Term | Core Idea | Why It Matters |
---|---|---|
AI Safety | “How do we keep advanced systems from causing accidental or malicious harm?” | Even well-intentioned models can break or be mis-used (e.g., hallucinating medical advice, enabling cyber-attacks). |
AI Alignment | “How do we make sure the AI’s goals stay pointed at our goals—both today and after it learns and self-improves?” | A super-capable system that optimizes the wrong thing can become hazardous faster than we can correct it. |
AI Safety =/= Alignment
- Safety: About risk management. How robust is your system when faced with bad inputs? Is it deployed securely? Is it auditable?
- Alignment: About objective correctness. Does the system optimise for the right goals? e.g. human values, business values
It's possible to build safe but misaligned systems (and the inverse, aligned but unsafe).
Key Subfields
Bucket | Typical Questions | Representative Work |
---|---|---|
Robustness & Reliability | How do we make models fail gracefully (out-of-distribution, adversarial inputs, hardware faults)? | Adversarial training, redundancy checks, “circuit-breaker” guardrails |
Specification & Alignment | How do we specify goals so the model can’t exploit loopholes? | Reinforcement learning from human feedback (RLHF), constitutional AI, reward modeling |
Scalable Oversight | How do humans supervise tasks too complex for them to evaluate directly? | Debate/critique protocols, recursive reward modeling, interpretability tools |
Monitoring & Governance | How do we audit model behavior and control deployment? | Model evals (dangerous capabilities, bias), secure inference APIs, incident response playbooks |
Value Learning & Ethics | How do we encode collective human values and handle trade-offs? | Preference aggregation, moral uncertainty modeling, norm-sensitive RL |
Strategies
- Red-teaming and evals: stress-testing models before release
- RLHF/RLAIF: train on human feedback to drive outputs toward the correct goals
- Inference Guardrails: filters, rate limits and reasoning
Summary
- AI safety = prevent harm.
- AI alignment = point at the right goal.
- Both are key for production systems.