Technical Aspects of AI Safety

Last reviewed: 2026-05-11

AI governance is not just policy. It also requires concrete technical practices to ensure AI systems are safe, reliable, and aligned with intended goals. This chapter covers the engineering disciplines that AI practitioners and risk managers apply as part of governance, with updates for the 2025-2026 state of practice.

Robustness and reliability

Robustness begins with rigorous validation on test data that simulates real-world variability and edge cases. Practical techniques:

Stress testing — computer vision systems tested in varying lighting and noise; text systems tested on out-of-distribution inputs and adversarial paraphrases.
Adversarial training — training on adversarial examples to harden the model.
Ensemble methods — multiple models or sensors compensate for any single component’s failure.
Operating-domain documentation — explicitly state what conditions the model was designed for; reject or flag inputs outside the domain.
Capability evaluations — for frontier models, structured pre-deployment evaluations of dangerous capabilities (cyber, CBRN, autonomy). See Frontier Models for the structural pattern.

The NIST AI 600-1 Generative AI Profile (updated March 2025) provides the most current US reference for GenAI threat categories and evaluation patterns.

Alignment with human values

Alignment is the engineering discipline of ensuring AI behaviour matches human intentions and values. Foundational techniques in production by 2026:

Reinforcement learning from human feedback (RLHF) — widely deployed for instruction-following and value alignment in large language models.
Constitutional AI / RLAIF — AI-assisted critique and revision against an explicit set of principles, reducing reliance on human raters at scale.
Direct preference optimisation (DPO) and successors — offline preference learning that avoids the reinforcement-learning loop.
Specification through constraints — encoding behavioural rules directly (e.g., refuse-list rules, scoped tool access).
Tool-use scoping — for agentic systems, explicit permissions and human-in-the-loop checkpoints for high-impact actions.

For applications beyond instruction-tuned LLMs, alignment thinking still applies: define objective functions carefully (not just optimise the metric, but the metric subject to fairness and safety constraints), and peer-review model objectives during development.

For agentic AI systems — AI that takes actions in the world — alignment also requires controllability: structured ability to interrupt, audit, and roll back AI-initiated changes. This is an active research area; see the International AI Safety Report (Paris, February 2025) for state-of-the-art.

Interpretability and transparency tools

Interpretability sheds light on “black box” models:

SHAP and LIME — per-prediction feature attribution; still standard for tabular and structured models.
Saliency maps and integrated gradients — per-input attribution for vision models.
Attention visualisation — standard for transformer-based models, though attention is not a complete causal account.
Probing — lightweight classifiers attached to model internals to test for specific learned representations.
Sparse autoencoders and circuit analysis — mechanistic interpretability techniques that have advanced substantially during 2024-2026 and now produce human-readable feature dictionaries for parts of large models.
Model cards — documentation in plain language describing intended use, performance, and limitations.^[1]

For high-impact AI systems, an explanation report or model card is increasingly mandatory:

EU AI Act Article 11-13 (technical documentation, transparency, instructions for use).
GDPR Article 22 (meaningful information about logic in automated decisions).
US sector rules (FCRA adverse action notices, ECOA reason codes).

Bias and fairness mitigation

Measuring and mitigating bias is a discipline with multiple technical entry points:

Pre-processing — ensure training data is balanced or reweighted.
In-processing — add fairness penalty terms or constraints to the objective.
Post-processing — adjust model outputs to reduce disparity.

Open-source toolkits (IBM AI Fairness 360, Microsoft Fairlearn, Google Model Card Toolkit) operationalise these techniques. AI governance can mandate a fairness check before deployment for specified system categories, with documented metrics, thresholds, and remediation actions.

Specific legal regimes now codify fairness expectations:

NYC Local Law 144 — bias audits for automated employment decision tools (see US State Laws).
EU AI Act Article 10 — data governance requirements for high-risk systems including measures to detect and prevent bias.
EU AI Act Article 27 — fundamental rights impact assessments for high-risk deployers.

Engineering teams must produce metrics and evidence; compliance must ensure those satisfy legal thresholds. Joint development of fairness definitions between engineering, legal, and product teams is essential because the technical definitions (demographic parity, equalised odds, predictive parity) embed normative choices and cannot all be satisfied simultaneously.

Performance monitoring and drift detection

Deployment is not a one-time event. Governance requires ongoing monitoring:

Performance metrics in production: accuracy, calibration, fairness metrics across subgroups, business metrics impacted by the AI.
Drift detection — distribution shift in inputs (covariate drift), in labels (label drift), or in the input-output relationship (concept drift).
Out-of-distribution detection — identify inputs unlike training data; route to human review or fall back gracefully.
Hallucination detection and grounding for LLM applications — particularly retrieval-augmented systems.
Misuse detection — identify abuse patterns and adversarial attempts.

When monitoring signals a problem, governance triggers a defined response process (NIST RMF’s “Manage” function): retrain, roll back, restrict, escalate. Document each incident and decision; feed lessons learned back into design.

Safety constraints and testing in simulation

For AI interacting with the physical world (robots, autonomous vehicles, medical devices), pre-deployment safety testing is essential:

Simulation — billions of simulated miles or interactions to explore rare and dangerous scenarios.
Formal verification — mathematical proof that certain safety properties hold under specified assumptions; viable for narrow modules but not yet for full neural-network policies.
Redundancy and fail-safes — engineering patterns from aviation and industrial control extended to AI systems.
Guardian systems — an independent monitor that can override or shut down the main AI if it detects unsafe behaviour.

For non-physical AI, the analogous practice is defence-in-depth: pre-deployment red-teaming, runtime guardrails, monitoring, incident response. The frontier-model safety architecture in Frontier Models is the most advanced expression of this pattern.

Agentic systems — the 2026 frontier

The biggest engineering shift between 2024 and 2026 is the rise of agentic AI — systems that take multi-step actions in the world via tool use, browsing, code execution, or robotic control. Governance and safety implications:

Permission scoping for tools and APIs the agent can call.
Human-in-the-loop checkpoints for high-impact actions.
Audit trails capturing every tool call and decision rationale.
Sandboxing of code execution and environment access.
Rate limiting and anomaly detection at the action layer.
Reversibility analysis — categorise actions by reversibility; require stronger controls for irreversible actions.
Identity and authorisation — the agent acts on behalf of a principal; both internal authorisation systems and emerging external standards (e.g., AI agent identity proposals) increasingly matter.

Best practice for agentic systems is still rapidly evolving. The Frontier Models chapter discusses the structural requirements emerging in the EU GPAI Code, CAISI agreements, and Korea’s AI Basic Act.

Verification, validation, and assurance

Technical safety culminates in assurance — the structured argument and evidence that a system is acceptably safe. The growing convergence around safety cases as the unit of assurance is visible in California SB 53, the EU GPAI Code, and several RSPs. A safety case typically includes:

The claim (e.g., “this model can be deployed for use case X with acceptable residual risk”).
The evidence — evaluations, audits, monitoring data, third-party assessments.
The argument linking evidence to claim — explicit reasoning about how the evidence supports the claim.
The counterarguments — what could undermine the claim and how those risks are addressed.

Engineering and governance functions both contribute to safety cases. Producing them well requires close coordination — another reason ISO/IEC 42001’s management-system architecture is foundational rather than optional.

Mitchell, M., et al. (2019). Model cards for model reporting. ACM FAccT '19. ↩︎