#alignment

2 approved public terms with this tag.

Constitutional AI

/ˌkɒnstɪˈtjuːʃənəl eɪ aɪ/noun

AI & Technology

A training methodology developed by Anthropic where a set of guiding principles (a "constitution") is used to self-supervise and refine AI outputs. The model critiques and rewrites its own responses according to the constitution, reducing the need for human labelers for harmful content.

“Constitutional AI lets the model identify and self-correct its own harmful outputs using defined principles.”

by @aisafety

RLHF

/ɑːr el eɪtʃ ef/noun

AI & Technology

#ai #training #alignment #human-feedback

Reinforcement Learning from Human Feedback — a training technique used to align language models with human preferences. Human raters compare model outputs and choose the better response; these preferences train a reward model which then guides further fine-tuning via reinforcement learning.

“RLHF is the key step that turns a raw language model into a helpful, harmless assistant.”

by @aisafety