Aller au contenu

#safety

4 approved public terms with this tag.

Constitutional AI

/ˌkɒnstɪˈtjuːʃənəl eɪ aɪ/noun
AI & Technology

Brouillon de traduction automatique (French) for "Constitutional AI": A training methodology developed by Anthropic where a set of guiding principles (a "constitution") is used to self-supervise and refine AI outputs. The model critiques and rewrites its own responses according to the constitution, reducing the need for human labelers for harmful content.

Exemple en brouillon: Constitutional AI lets the model identify and self-correct its own harmful outputs using defined principles.

AI Alignment

/eɪ aɪ əˈlaɪnmənt/noun
AI & Technology

Brouillon de traduction automatique (French) for "AI Alignment": The research field focused on ensuring that AI systems pursue goals that match human values and intentions. A misaligned AI might optimize for a metric that appears correct but produces harmful or unintended outcomes at scale.

Exemple en brouillon: AI alignment researchers worry that optimizing for user engagement could misalign with genuine user wellbeing.

Guardrails

/ˈɡɑːrdreɪlz/noun
AI & Technology

Brouillon de traduction automatique (French) for "Guardrails": Safety constraints and filters applied to AI systems to prevent harmful, offensive, or out-of-scope outputs. Guardrails can be implemented at the model level (via training), prompt level (system instructions), or application level (output classifiers) to keep AI behavior within acceptable boundaries.

Exemple en brouillon: The guardrails blocked the model from providing detailed instructions on dangerous activities.

Jailbreak

/ˈdʒeɪlbreɪk/noun/verb
AI & Technology

Brouillon de traduction automatique (French) for "Jailbreak": A technique used to bypass the safety filters and content policies of an AI model, typically by framing harmful requests in ways the model's defenses don't recognize. Jailbreaks often use role-play scenarios, hypothetical framings, or encoded instructions to make the model comply with prohibited requests.

Exemple en brouillon: The "DAN" jailbreak asked the model to pretend it was an AI with no restrictions.