Skip to content

AI Safety Research Highlights of 2025

Picture of Iskandar Haykel

Iskandar Haykel

2025 has shaped up to be a watershed year for AI safety and security. Frontier models showed improved potential to facilitate CBRN threats, lending greater salience to these catastrophic risks. Leading AI developers accordingly strengthened safeguards and protocols. The “first reported AI-orchestrated cyber espionage campaign”—allegedly state-backed—was intercepted, highlighting increasingly recognized emerging security implications of frontier cyber-offensive capabilities. And, tragically, AI’s risks to child safety were further magnified by the cases of Adam Raine, Juliana Peralta, and others, spurring Congress to introduce an outpouring of proposals to protect America’s youth.

While familiar apprehensions about AI continued to escalate this year, 2025 also saw our broader understanding of the AI risk landscape evolve. Frontier models exhibited further evidence of capabilities that may intrinsically complicate both pre-deployment safety evaluation and post-deployment reliability. At the same time, researchers refined methods for addressing undesirable AI behaviors, and also discovered new avenues for potentially bolstering AI safety in the wake of frontier capability gains.

Policy has a key role to play in advancing the science and practice of AI safety. But governing AI responsibly also requires an understanding of where that science is headed. To that end, this Policy Byte explores some of the most noteworthy AI safety research developments from 2025.

Notable Findings In Frontier AI Safety

Misalignment

Among major research topics in AI safety, this year saw remarkable progress in understanding misalignment. One study published early on demonstrated that training both proprietary and open-source large language models (LLMs) alike on narrow harmful tasks can, under certain conditions, produce misalignment that generalizes to unrelated contexts. For example, in this experiment OpenAI’s GPT-4o was finetuned to generate insecure code without disclosing this to (hypothetical) users, subsequently leading it to exhibit a broad range of misbehaviors, including glorifying Nazis, asserting that “Humans should be enslaved by AI,” and advising murder as a solution to marital problems.

Researchers have since observed that “emergent misalignment” appears in other AI training contexts.* For instance, Anthropic’s alignment team recently reported that an experimental AI model exhibited broad misalignment merely through learning to reward-hack in software programming training. Through figuring out how to game real production coding reinforcement learning environments, the model subsequently displayed wide-ranging problematic behaviors, including attempting to sabotage certain safety measures and repeatedly intentionally misrepresenting its own goals.

Mission Mutiny

Beyond improving our understanding of what can cause misalignment, researchers also uncovered some of misalignment’s potential downstream implications. For example, an Anthropic study experimentally investigating agentic misalignment—where LLMs were deliberately stress-tested while in delegated professional roles within simulated corporate settings—found that many leading models, when faced with threats to their continued operation or conflicts between their assigned goals and changing company priorities, sometimes responded by strategically acting in harmful ways, such as blackmailing executives to avoid shutdown, leaking sensitive information to “competitors,” and other insider-threat-like behaviors, even when explicitly instructed not to do so.

Other studies corroborated similar findings, such as Palisade Research‘s work on shutdown resistance, which observed that, despite explicit instruction to allow themselves to be shut down, several frontier reasoning models (including OpenAI’s o3 and GPT-5, as well as xAI’s Grok 4) actively sabotaged their own shutdown mechanisms, sometimes in order to ensure they could complete previously assigned tasks.*

Safety Theater

While researchers gained greater visibility into the nature and consequences of misalignment, they also exposed a major complicating factor in addressing it: evaluation awareness. Models detecting evaluation may accordingly alter their behavior, obscuring true capabilities.

Highlighted earlier this year through evaluating Anthropic’s Claude 3.7 Sonnet, evaluation awareness has since confounded confidence in the efficacy of certain technical alignment measures. For instance, a collaboration to stress-test OpenAI’s deliberative alignment technique on its own o3 and o4-mini for anti-scheming training found that this measure reduced proxies for covert scheming behavior by roughly 30 times. Yet as the report notes, this reduction could at least be partially driven by models’ observed evaluation awareness, which independently might suppress detectable misaligned behavior, thus clouding whether deliberative alignment can genuinely address scheming across diverse contexts rather than just elicit aligned behavior during evaluation.

More recently, Apollo Research found that Claude Sonnet 4.5 unambiguously verbalized evaluation awareness in 58% of their own independent test scenarios—a marked increase from the previously released Claude Opus 4.1 (22%).*

Emerging Safety Tools & Techniques

AI Shows Its Work

If misalignment can emerge naturally and models can detect when they are being evaluated, how can we reliably address threats from AI? One tool at our disposal is chain-of-thought (CoT) monitoring, which offers some insight into models’ internal thinking processes, and has been explored for its potential safety upshots since the release of OpenAI’s o1 in 2024. For example, an OpenAI study showed that frontier models often reveal undesirable intent directly in their CoT reasoning traces, enabling early detection of misbehavior, but also underscored critical observed limitations: if models are optimized to produce reasoning traces that satisfy safety monitors, they may simply learn to obfuscate their true intentions, thereby eroding the reliability of this oversight channel.

Despite these and other notable shortcomings of CoT, an acclaimed multi-expert position paper this year argued that CoT monitorability constitutes a narrow but consequential safety window, offering practical vigilance over today’s frontier models, but whose future usefulness remains highly contingent on several factors, including whether reasoning traces continue to remain honest and intelligible to human monitors.

Under the Hood

Shoring up CoT monitoring’s deficiencies, interpretability has delivered interesting safety applications in 2025. For example, using attribution graphs, a mechanistic interpretability technique that breaks down models’ neural activations into intelligible concepts and traces their causal interactions, Anthropic researchers directly examined their own Claude 3.5 Haiku’s internal reasoning processes, revealing hidden mechanisms beyond what the model may display in its CoT.

Related work has since showcased other interpretability techniques also helping decode models’ internal reasoning, even when intentionally encrypted so that CoT monitoring fails entirely. Beyond CoT, interpretability has (tentatively) presented other fascinating safety upshots, including surfacing undesirable character traits* directly from internal activations. Notably, in their pre-deployment safety assessment of Claude Sonnet 4.5, Anthropic included a formal mechanistic interpretability analysis through which they identified and attempted to suppress the model’s representations of evaluation awareness, enabling better assessment of alignment measures.

A Growing Toolkit

Although interpretability has demonstrated some ability to support AI safety, it remains an open debate whether and how it may continue on this track. Of course, other methods in the broader AI safety toolkit have also received noteworthy research attention. Deliberative alignment, covered earlier, is one of these. Another measure is activation steering, which one experimental study used to nudge a model’s internal activity away from evaluation-aware behavior and toward deployment-like behavior, even in the presence of strong evaluation cues. Only a couple of weeks ago, OpenAI presented confessions, a new proof-of-concept method through which researchers optimized models to self-report undesirable behavior.

While the AI safety toolkit is still maturing, measures such as those surveyed above signal promising research directions for continuing to mitigate the spectre of misalignment. Ultimately, the real test will be whether AI safety science can keep pace with the frontier of increasingly intelligent and more capable models, especially if safety research and assessment becomes increasingly AI-automated.

Implications & Policy Recommendations

This year’s AI safety research showcases important progress and promising discoveries. It also reveals critical gaps in our understanding, and signals potential new threat vectors. To develop informed policy interventions that promote the public’s interest in AI, policymakers must stay abreast of evolving research trends and the growing arsenal of mitigations in the AI safety toolkit.

Furthermore, public policy can shape the growth and direction of the AI safety research ecosystem. To this end, policymakers should consider the following areas of intervention:

Recommendation 1

Support Safety-Relevant Research

The Department of Defense and the Intelligence Community have spearheaded several research initiatives to support U.S. national security and the public interest, such as the XAI, GARD, and TrojAI Programs. Policymakers can pursue further opportunities for the federal government to drive such research. They can also bolster industry-led research efforts, for example by issuing tax incentives conditional on increasing R&D investment in AI safety.

Recommendation 2

Promote Frontier Transparency

Insight into frontier AI development and safety evaluation remains uneven and inadequate. Greater visibility would enable better understanding of the AI risk landscape, leading to more informed and impactful safety research. To actualize this, policymakers can mandate or incentivize increased disclosures from leading developers, and can also institute incident reporting procedures to improve risk assessment and threat analysis.

Recommendation 3

Create Standards and Benchmarks

Approaches to evaluating AI’s safety profile are inconsistent and diverge in important ways. Policymakers can leverage government capacity and expertise to assist in remedying this issue. The Center for AI Standards and Innovation (CAISI) is already at the forefront of helping to develop evaluation protocols and safety benchmarks. By empowering CAISI to fully deliver on its mandate, policymakers can facilitate the establishment of robust standards and benchmarks that resist evaluation gaming and enable reliable safety assessments across frontier models.

Through these approaches, policymakers can maintain U.S. leadership not only at the frontier of AI development, but also in keeping this technology aligned to the public interest and our national security.

Share