#alignment
2 approved public terms with this tag.
RLHF
/ɑːr el eɪtʃ ef/noun
Reinforcement Learning from Human Feedback — a training technique used to align language models with human preferences. Human raters compare model outputs and choose the better response; these preferences train a reward model which then guides further fine-tuning via reinforcement learning.
“RLHF is the key step that turns a raw language model into a helpful, harmless assistant.”
Constitutional AI
/ˌkɒnstɪˈtjuːʃənəl eɪ aɪ/noun
A training methodology developed by Anthropic where a set of guiding principles (a "constitution") is used to self-supervise and refine AI outputs. The model critiques and rewrites its own responses according to the constitution, reducing the need for human labelers for harmful content.
“Constitutional AI lets the model identify and self-correct its own harmful outputs using defined principles.”