Safety & Alignment

May 19 2025
TermCore IdeaWhy It Matters
AI Safety“How do we keep advanced systems from causing accidental or malicious harm?”Even well-intentioned models can break or be mis-used (e.g., hallucinating medical advice, enabling cyber-attacks).
AI Alignment“How do we make sure the AI’s goals stay pointed at our goals—both today and after it learns and self-improves?”A super-capable system that optimizes the wrong thing can become hazardous faster than we can correct it.

AI Safety =/= Alignment

  • Safety: About risk management. How robust is your system when faced with bad inputs? Is it deployed securely? Is it auditable?
  • Alignment: About objective correctness. Does the system optimise for the right goals? e.g. human values, business values

It's possible to build safe but misaligned systems (and the inverse, aligned but unsafe).

Key Subfields

BucketTypical QuestionsRepresentative Work
Robustness & ReliabilityHow do we make models fail gracefully (out-of-distribution, adversarial inputs, hardware faults)?Adversarial training, redundancy checks, “circuit-breaker” guardrails
Specification & AlignmentHow do we specify goals so the model can’t exploit loopholes?Reinforcement learning from human feedback (RLHF), constitutional AI, reward modeling
Scalable OversightHow do humans supervise tasks too complex for them to evaluate directly?Debate/critique protocols, recursive reward modeling, interpretability tools
Monitoring & GovernanceHow do we audit model behavior and control deployment?Model evals (dangerous capabilities, bias), secure inference APIs, incident response playbooks
Value Learning & EthicsHow do we encode collective human values and handle trade-offs?Preference aggregation, moral uncertainty modeling, norm-sensitive RL

Strategies

  • Red-teaming and evals: stress-testing models before release
  • RLHF/RLAIF: train on human feedback to drive outputs toward the correct goals
  • Inference Guardrails: filters, rate limits and reasoning

Summary

  • AI safety = prevent harm.
  • AI alignment = point at the right goal.
  • Both are key for production systems.