Safety & Alignment

May 19 2025

Important aspects of production AI systems

Term	Core Idea	Why It Matters
AI Safety	“How do we keep advanced systems from causing accidental or malicious harm?”	Even well-intentioned models can break or be mis-used (e.g., hallucinating medical advice, enabling cyber-attacks).
AI Alignment	“How do we make sure the AI’s goals stay pointed at our goals—both today and after it learns and self-improves?”	A super-capable system that optimizes the wrong thing can become hazardous faster than we can correct it.

AI Safety =/= Alignment

Safety: About risk management. How robust is your system when faced with bad inputs? Is it deployed securely? Is it auditable?
Alignment: About objective correctness. Does the system optimise for the right goals? e.g. human values, business values

It's possible to build safe but misaligned systems (and the inverse, aligned but unsafe).

Key Subfields

Bucket	Typical Questions	Representative Work
Robustness & Reliability	How do we make models fail gracefully (out-of-distribution, adversarial inputs, hardware faults)?	Adversarial training, redundancy checks, “circuit-breaker” guardrails
Specification & Alignment	How do we specify goals so the model can’t exploit loopholes?	Reinforcement learning from human feedback (RLHF), constitutional AI, reward modeling
Scalable Oversight	How do humans supervise tasks too complex for them to evaluate directly?	Debate/critique protocols, recursive reward modeling, interpretability tools
Monitoring & Governance	How do we audit model behavior and control deployment?	Model evals (dangerous capabilities, bias), secure inference APIs, incident response playbooks
Value Learning & Ethics	How do we encode collective human values and handle trade-offs?	Preference aggregation, moral uncertainty modeling, norm-sensitive RL

Strategies

Red-teaming and evals: stress-testing models before release
RLHF/RLAIF: train on human feedback to drive outputs toward the correct goals
Inference Guardrails: filters, rate limits and reasoning

Summary

AI safety = prevent harm.
AI alignment = point at the right goal.
Both are key for production systems.