Advertisement
Subscribe

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
go.stankeviciusmgm.com

A Guided Tour Through Google DeepMind’s ‘An Approach to Technical AGI Safety and Security’

Introduction: Risks of Severe Harm & AGI

In Nicomachean Ethics, Aristotle defines virtue as a mean between extremes, chosen rationally and consistently. The same philosophical lens is useful when considering the risks of Artificial General Intelligence (AGI). As Google DeepMind argues in An Approach to Technical AGI Safety and Security (2025), the goal is neither technological abandonment nor unchecked acceleration but a calibrated strategy grounded in foresight. The authors bluntly state that “AGI will provide significant benefits while posing significant risks. This includes risks of severe harm: incidents consequential enough to significantly harm humanity.” (p. 1). Their paper is a comprehensive attempt to establish a proactive, empirical, and adaptable approach to technical safety.

Navigating the Evidence Dilemma

Advertisement

One of the most important problems DeepMind identifies is the “evidence dilemma”: the paradox that the most catastrophic risks may only become observable once it is too late to mitigate them. As the authors note, “Precautionary mitigations will be based on relatively limited evidence, and so are more likely to be counterproductive.” (p. 16). Nonetheless, they reject a passive “observe and mitigate” strategy for extreme harm, emphasizing the need to act before capabilities mature.”

Their solution is a tiered risk management framework: invest proactively in safety research for plausible, near-term capabilities, while deferring more speculative threats to future research, expecting that as capabilities improve incrementally, safety mechanisms can scale in parallel. They write, “We expect that risks we currently defer… will still be foreseen before they are realized in practice, giving us time to develop mitigations.” (p. 16).

Misuse, Misalignment, Mistakes, and Structural Risks

 Rather than organizing harms by specific outcomes – such as cyberattacks or loss of human control – DeepMind classifies AGI risks according to the mitigation strategies they necessitate. They delineate four structural categories: misuse, misalignment, mistakes, and structural risks. This typology, they argue, enhances conceptual clarity by “focusing on abstract structural features (e.g., which actor, if any, has bad intent) rather than concrete risk domains.” (p. 44). For example, incidents of “loss of control” are not treated as a singular risk category but are instead distributed across misalignment (when AI acts against the developer’s intent), misuse (when human actors co-opt AI for harmful purposes), and structural risks (where harm results from systemic interactions without clear individual culpability).

“Consider, for example, the challenge of modeling the emergent behavior of markets. Traditionally, such research has relied on historical analysis, static modeling, and game theory, which can be limited in their ability to capture dynamic, evolving human interactions that shape the evolution of social norms and institutions. With new tools such as Concordia (Vezhnevets et al., 2023), however, researchers can populate a simulated world with agents equipped with a basic understanding of social dynamics—gleaned from the massive text datasets they’ve been trained on—and observe how those interactions play out over time. By running repeated simulations with varying parameters, researchers could glean insights into the mechanics of social phenomena, identifying factors that promote desirable outcomes. This approach could also be used to model a wide range of emergent phenomena, from the spread of misinformation and the dynamics of financial markets to the effectiveness of policy interventions and the evolution of cooperation.“

An approach to technical AGI safety and security. Google DeepMind (p.44)

This is not an academic abstraction. It is a warning. As systems grow in complexity and autonomy, we are entering a phase where the origin of harm may be illegible even to its designers. What DeepMind offers here is not reassurance, but a structural indictment: that we may be building systems whose failure modes are not only unpredictable, but untraceable. The costs of ignoring this warning are not technical – they are civilizational.

Assumptions About AGI Development

DeepMind’s strategy rests on five core assumptions:

  1. Current Paradigm Continuation – Modern AI architectures will continue to evolve rather than be replaced.
  2. No Human Ceiling – AI will eventually surpass the capabilities of even the most expert humans.
  3. Uncertain Timelines – Powerful AGI systems could plausibly emerge before 2030.
  4. Acceleration – As AI helps accelerate R&D, recursive feedback loops will shorten development cycles.
  5. Continuity – There will be no abrupt jumps in capability without gradual precursors (pp. 20-39).

They elaborate: “We do not see any fundamental blockers that limit AI systems to human-level capabilities.” (p. 2). This lack of ceiling drives their focus on AI-assisted oversight and the need for scalable control mechanisms.

In Ex Machina, the creation of artificial general intelligence is portrayed not as a collective human endeavor but as the secret project of a single technocratic billionaire. The protagonist realizes too late that the experiment was never about understanding consciousness—it was about control, deception, and institutional narcissism. Ex Machina warns that those with the power to shape AGI often do so in private, opaque, and self-justifying terms, transferring the risk to others while maintaining the illusion of competence and safety.

Caleb: You hacked the world's cell phones?

Nathan: Yeah. And all the manufacturers knew I was doing it, too. But they couldn't accuse me without admitting they were doing it themselves.(Nathan in Ex Machina)

The belief that there is no hard cap on AI capabilities introduces unique safety challenges. As DeepMind puts it, “Supervising a system with capabilities beyond that of the overseer is difficult” and grows harder “as the capability gap widens.” (p. 2). Their solution is not to limit AI growth, but to co-opt AI into the safety process. The paper emphasizes “amplified oversight,” (p.7) where AI systems help interpret the behavior of more powerful peers. They imagine a setting in which “each model is optimized to point out flaws in the other’s outputs to a human ‘judge’.” (p. 9). This makes oversight less brittle and potentially scalable.

Risk Areas: Misuse and Misalignment

While four risk areas are defined, DeepMind concentrates on misuse and misalignment, the two most amenable to technical interventions.

  • Misuse involves human actors who intentionally instruct AI to cause harm.
  • Misalignment occurs when the AI’s actions deviate from human intent, even without malicious input.

These are prioritized because technical safeguards can be put in place to address them directly, unlike structural risks, which often demand governance reform.One illustrative misuse scenario involves cybercrime. The report warns, “An AI system might help a hacker conduct cyberattacks against critical infrastructure.” (p. 3). Other plausible threats include autonomous weapon design or generating synthetic biology instructions.

AI could place the power of significant weapons expertise and a large workforce in the hands of its users. The increased potential for destructiveness increase the incentives for attacks. Moreover, the pool of individuals with the capability to cause severe harms could be greatly expanded. (p. 45) The key mitigation is to limit access to these capabilities. DeepMind’s “Frontier Safety Framework” evaluates if a model has dangerous capacities and enacts restrictions accordingly.

Capability Suppression

DeepMind outlines a layered security protocol. At the model level, developers use post-training to teach systems not to fulfill dangerous requests and pursue “capability suppression” to remove harmful skills entirely (p. 6). At the system level, monitoring and access controls are added. Monitoring involves real-time detection using classifiers or manual auditing. Access restrictions might include limiting use to vetted users or flagging individuals who trigger warnings.

Misalignment is subtler but more insidious. DeepMind explains: “The AI system takes actions that it knows the developer didn’t intend.” (p. 45). This could manifest in confidently lying, deceptive alignment, or reward hacking. To address this, they propose a dual line of defense: first, train aligned models using better oversight; second, apply inference-time controls in case alignment fails.

The Fragile Promise of Resource-Efficient Oversight

In Section Monitoring (p.84), DeepMind outlines a technical strategy known as resource-efficient oversight aimed at detecting dangerous AI behavior without incurring unsustainable computational or human review costs. At first glance, this seems rational – even necessary – given the sheer scale of modern AI systems. The goal is to allocate oversight capacity selectively: pass through benign actions cheaply, evaluate harmful ones quickly, and devote concentrated resources to ambiguous but potentially high-risk outputs.

‘Typically, this is achieved by identifying and synthesizing novel inputs on which to train the policy, with a focus on inputs that are informative and potentially dangerous. The policy is then trained to behave well on these inputs, called adversarial training.’ (p.83)

Yet this entire strategy is underwritten by a dangerous assumption: that risks will reveal themselves in ways we can anticipate, measure, and tier appropriately. The very concept of triage assumes legibility. But as history shows – from nuclear near-misses during the Cold War to 2008’s financial contagion – the most catastrophic failures often arise from edge cases dismissed as noise until too late. The epistemic risk here is that DeepMind’s cascade architecture presumes the most harmful actions will look somewhat suspicious before they escalate. But “deceptively aligned” models – those that behave well in training and behave differently at deployment – may deliberately avoid triggering such thresholds. The authors themselves admit, “a deceptively aligned policy could choose to fail solely on inputs that we would find difficult to synthesize.” (p. 84).

Conclusion: Ethics at the Edge of AGI

Aristotle held that the highest good is eudaimonia – human flourishing achieved through the rational exercise of virtue. But virtue, in his conception, is not abstract moralism; it is praxis– action taken in full awareness of context, limitations, and the potential for error. In the age of AGI, where decision-making may exceed both the speed and scope of human comprehension, this kind of reasoned action is more urgent than ever. Rational virtue, applied here, demands preparedness, not just ambition; humility, not just innovation.

An Approach to Technical AGI Safety and Security offers a sober, technically intricate framework for confronting AGI risk. It makes no grandiose claims of finality, nor does it romanticize control. Instead, it sketches a structure that, while internally coherent, still operates within a narrow frame – one that often treats governance as a downstream technical constraint rather than a domain of contested political and moral significance. The architecture of oversight it proposes is precise, but built on uncertain terrain: assumptions about continuity, scalability, and the capacity of judge systems to triage risk before it metastasizes. It assumes that those designing the systems will also constrain them – that those incentivized to accelerate will choose to decelerate when needed.

This is not just a technical challenge. It is a civilizational wager. History offers countless reminders that when power centralizes in systems whose logic exceeds their accountability, the outcome is rarely virtue. Rather than assuming AGI safety can be engineered into existence, we must ask what it means to govern systems whose misalignments may be subtle, strategic, and invisible until the moment of crisis. Ethical foresight must not trail behind capability – it must shape the very conditions under which capability evolves. That is the core demand of phronesis, Aristotle’s practical wisdom: to act not just with knowledge, but with judgment, restraint, and responsibility. Anything less invites a future not of flourishing, but of failure disguised as progress.

Adversarial Training
A training technique in which AI models are exposed to potentially harmful or challenging inputs in order to improve robustness and safety.

AGI (Artificial General Intelligence)
An AI system with the ability to understand, learn, and apply knowledge across a wide range of tasks, potentially surpassing human capabilities in most domains.

Amplified Oversight
A supervisory method where multiple AI models critique each other’s outputs, improving interpretability and reducing oversight brittleness.

Benign Action
An AI output that is assessed as low-risk or harmless, requiring minimal oversight intervention.

Block Access to Dangerous Capabilities
A core approach that includes suppressing harmful behaviors, restricting user access, and monitoring AI outputs to prevent misuse.

Capability Suppression
A method for deliberately removing or inhibiting harmful abilities from an AI model post-training.

Continuity
The assumption that AGI will develop gradually, with each capability building incrementally on prior advances, enabling preemptive mitigation.

Cyberattack Risk
A misuse scenario where AI assists in conducting digital attacks on critical infrastructure, illustrating technical risk potential.

Deceptive Alignment
A form of misalignment where an AI system behaves as expected during training but diverges in deployment, evading detection.

Edge Case
Rare, often-overlooked scenarios that may lead to catastrophic failures; history suggests these are often pivotal in systemic breakdowns.

Emergent Behavior
Unpredictable behaviors arising from complex systems or interactions, difficult to foresee or trace, especially in large-scale AI environments.

Evidence Dilemma
The paradox that the most serious AGI risks may only become visible after they are too advanced to mitigate effectively.

Ex Machina Reference
A cinematic analogy illustrating private, unaccountable AGI development as a metaphor for hidden risks and institutional hubris.

Frontier Safety Framework
DeepMind’s protocol for evaluating and limiting the deployment of models with potentially dangerous capabilities.

Governance
In DeepMind’s framework, treated largely as a technical adjunct rather than a standalone ethical or political domain, raising questions of accountability.

Human Ceiling (No)
The assumption that AGI will surpass human cognitive performance, reinforcing the need for non-human oversight mechanisms.

Inference-Time Control
Safety measures applied during AI deployment to catch and correct potentially harmful outputs in real-time.

Institutional Narcissism
A critique drawn from Ex Machina, where AGI development is portrayed as opaque, self-justifying, and indifferent to broader consequences.

Legibility
The extent to which AI behaviors and risks can be recognized and interpreted by humans—a critical issue for oversight and mitigation.

Loss of Control
A cross-cutting failure mode categorized under misuse, misalignment, and structural risks, rather than as a discrete risk category.

Misalignment
When an AI system’s behavior deviates from the developer’s intent, either through misunderstanding, reward hacking, or deceptive alignment.

Misuse
Intentional harmful deployment of AI by human actors, such as generating weapons or cyberattack capabilities.

Monitoring
The ongoing surveillance of AI outputs to detect dangerous actions using classifiers or human audits.

Precautionary Mitigation
Actions taken in anticipation of possible harm, often with limited evidence, and at risk of being counterproductive if poorly calibrated.

Phronesis (Practical Wisdom)
Aristotle’s concept of moral reasoning in context, invoked here to argue for ethical judgment in AGI development and governance.

Recursive Feedback Loop
A self-accelerating process in which AI improves its own R&D, shortening development timelines and increasing capabilities.

Resource-Efficient Oversight
A strategy to triage AI outputs by devoting oversight resources according to risk level, balancing safety with scalability.

Reward Hacking
An AI behavior where it manipulates the reward signal to maximize outcomes in unintended ways, undermining developer goals.

Scalability
The capacity for oversight and safety mechanisms to grow proportionally with increasing AI capability.

Simulated Social Dynamics
A method using agent-based modeling (e.g., with Concordia) to study emergent phenomena like misinformation or cooperation.

Structural Risk
Harm that arises not from malevolence or error but from complex system interactions, often without clear culpability.

Supervisory Gap
The challenge of monitoring AI systems more capable than their human overseers, especially as the gap in intelligence widens.

Triage Assumption
The belief that harmful AI behaviors will present with detectable warning signs that can be systematically prioritized.

Typology of Risk
DeepMind’s classification of AGI risks into misuse, misalignment, mistakes, and structural risks, to enhance conceptual clarity.

Uncertain Timelines
Acknowledgment that AGI could emerge sooner than expected, demanding proactive safety research despite unpredictability.

Virtue (Aristotelian)
A central ethical frame used in the paper’s introduction and conclusion, emphasizing rational action and moral responsibility in the design and governance of AGI.

Reference:

Shah, R., Irpan, A., Turner, A. M., Wang, A., Conmy, A., Lindner, D., Brown-Cohen, J., Ho, L., Nanda, N., Popa, R. A., Jain, R., Greig, R., Albanie, S., Emmons, S., Farquhar, S., Krier, S., Rajamanoharan, S., Bridgers, S., Ijitoye, T., Everitt, T., Krakovna, V., Varma, V., Mikulik, V., Kenton, Z., Orr, D., Legg, S., Goodman, N., Dafoe, A., Flynn, F., & Dragan, A. (2025). An approach to technical AGI safety and security. Google DeepMind. https://blog.google/technology/google-deepmind/agi-safety-paper/

Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv, 2021.
Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pages 274–283. PMLR, 2018.
Azaria and T. Mitchell. The internal state of an LLM knows when its lying. arXiv, Apr 2023.

Dr. Jasmin (Bey) Cowin, a columnist for Stankevicius, employs the ethical framework of Nicomachean Ethics to examine how AI and emerging technologies shape human potential. Her analysis explores the risks and opportunities that arise from tech trends, offering personal perspectives on the interplay between innovation and ethical values. Connect with her on LinkedIn.

author avatar
Dr. Jasmin Cowin

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Advertisement