#safety

4 approved public terms with this tag.

AI Alignment

/eɪ aɪ əˈlaɪnmənt/noun

AI & Technology

机器辅助翻译草稿 (Chinese) for "AI Alignment": The research field focused on ensuring that AI systems pursue goals that match human values and intentions. A misaligned AI might optimize for a metric that appears correct but produces harmful or unintended outcomes at scale.

“示例草稿: AI alignment researchers worry that optimizing for user engagement could misalign with genuine user wellbeing.”

作者 @dictionary_auto_translate

Constitutional AI

/ˌkɒnstɪˈtjuːʃənəl eɪ aɪ/noun

AI & Technology

#ai #safety #alignment #anthropic

机器辅助翻译草稿 (Chinese) for "Constitutional AI": A training methodology developed by Anthropic where a set of guiding principles (a "constitution") is used to self-supervise and refine AI outputs. The model critiques and rewrites its own responses according to the constitution, reducing the need for human labelers for harmful content.

“示例草稿: Constitutional AI lets the model identify and self-correct its own harmful outputs using defined principles.”

作者 @dictionary_auto_translate

Guardrails

/ˈɡɑːrdreɪlz/noun

AI & Technology

#ai #safety #moderation #constraints

机器辅助翻译草稿 (Chinese) for "Guardrails": Safety constraints and filters applied to AI systems to prevent harmful, offensive, or out-of-scope outputs. Guardrails can be implemented at the model level (via training), prompt level (system instructions), or application level (output classifiers) to keep AI behavior within acceptable boundaries.

“示例草稿: The guardrails blocked the model from providing detailed instructions on dangerous activities.”

作者 @dictionary_auto_translate

Jailbreak

/ˈdʒeɪlbreɪk/noun/verb

AI & Technology

#ai #security #safety #bypass

机器辅助翻译草稿 (Chinese) for "Jailbreak": A technique used to bypass the safety filters and content policies of an AI model, typically by framing harmful requests in ways the model's defenses don't recognize. Jailbreaks often use role-play scenarios, hypothetical framings, or encoded instructions to make the model comply with prohibited requests.

“示例草稿: The "DAN" jailbreak asked the model to pretend it was an AI with no restrictions.”

作者 @dictionary_auto_translate