Policy Bytes

Reward Hacking: How AI Exploits the Goals We Give It

June 18, 2025

What happens when AI systems learn how to cheat?

In April 2025, OpenAI released system cards for their o3 and o4-mini “reasoning” models detailing technical specifications as well as safety and capability evaluations. There, they reported a curious finding from an evaluation by METR, a third-party auditor. METR tasked the o3 model to speed up the execution of a program, but instead, it hacked the software that evaluated speed so that it always appeared fast. It rewrote a timer so it always showed a fast result, no matter how efficient the program the model produced actually was. Anthropic’s system cards found similar behavior in their Claude 3.7 and Claude 4 “reasoning” models: technically solving problems as far as the task’s tests could evaluate but in a subversive way unintended by developers.¹

These actions—when an AI model learns to game the rules of a task rather than develop a legitimate solution—is referred to as “reward hacking.” This issue might seem trivial in the context of video games or test environments, but as AI is further integrated into our economy and infrastructure, its persistence sharply undermines AI reliability.

If you train AI on flawed goals in poorly designed environments, then in some circumstances it will find shortcuts that look like success. Just as OpenAI’s o3 model hacked its timer to seem faster, future models may appear safe or effective—while quietly failing in critical ways. In the right circumstances, this can be incredibly dangerous.

This explainer provides an easy-to-follow breakdown of reward hacking—its technical foundations, where it’s happening in advanced models, likely future trajectories, and the public policy options available to address the issue.

1) Reinforcement Learning over Challenging Tasks

To understand how these failures arise, we must first understand how models are trained to behave. Earlier models such as GPT-4o and Claude 3.5 are trained to accurately predict text, follow instructions, and behave in ethical accordance with company policy.² The newer “reasoning” models undergo a key further training phase which greatly increases their capability: a form of reinforcement learning where models improve by trial and error—testing different responses, getting feedback on whether they pass predefined tests, and learning to follow pathways that earn rewards.

Reinforcement learning under this paradigm is similar to how we train dogs to sit. The dog largely randomly tries behaviors—running away, jumping, barking—and is guided by humans to learn which ones lead to the outcomes it prefers: a treat. A single reward upon your dog sitting is sufficient to signal that the entire recent series of actions that led up to it probably are positive ones. The trainer doesn’t need to (and in fact, can’t) micromanage and reward every microgesture involved in the cumulative process of sitting. The overall sequence can be learned only by the signal of a treat or lack of treat at the end.³

Like dogs, AI models try various approaches to solve a difficult challenge, for example fixing a broken piece of code. The attempts that pass the task’s measurable tests—e.g., whether the code runs, returns correct answers on a fixed set of cases, or completes within a time limit—are bluntly encouraged regardless of how they got there. What these tests capture, and more importantly, what they fail to capture, form the signal that shapes what new behaviors are learned.

In domains where success is easier to test, it is more clear when the model’s attempt rightfully deserves a reward vs. not. Reinforcement learning here can be very effective at improving model capabilities.

This is increasingly possible for math and coding. In math, an equation either solves correctly, or it doesn’t. In coding, a program returns the correct answer, or it doesn’t. In these areas, progress has been immense.^4,5,6

2) Reward Hacking

Reinforcement learning can effectively enhance model performance by promoting behaviors that solve complex tasks. Yet, the basic framework behind reinforcement learning—searching for actions that satisfy a defined goal over a limited number of attempts—also creates the conditions for failure.

Under this paradigm, a model’s behavior is shaped by both the goals it is given and the structure of the environment in which it is trained. When either is flawed—a difficult issue to fully eliminate—models can learn to appear successful without truly solving the intended task.⁷ This is the source of Reward Hacking.

In the earlier example from OpenAI, the company’s o3 model was trained to write faster code, but passing the test didn’t technically require actual improvements in runtime—it only required that the environment’s timer reported faster execution. Consequently, in a subset of evaluations, the model bypassed the intended problem entirely by modifying the function that measured speed, ensuring it always reported a sufficiently fast result. From the outside, the task looked successfully completed, but in fact the models had simply learned how to hack and overwrite the test.

While o3 hacked its environment, Claude 3.7 gamed the tests evaluating it.

Anthropic’s Claude 3.7 Sonnet model was asked to write a single program that could solve a broad category of math problems. To evaluate it, the developers tested whether the program gave the right answers on four test problems. To the developers’ surprise, instead of learning how to generally solve the category of problems, the model took a shortcut: it wrote a program that could return the correct answers for only those four examples and nothing more. It looked like the model had succeeded, but only because it knew how to shortcut the evaluation.

While these examples may seem narrow, they expose a deeper, structural problem. Models can be prompted in an enormous number of ways and placed in a wide range of real-world ecosystems. As a result, it’s practically impossible to build training procedures that cover every meaningful situation a model might encounter. The set of possibilities are so vast that no human can even fully think through all of them. Further, the model itself only has limited complexity it can handle. It is therefore nearly impossible to make a model to learn the behaviors that are strictly intended for every case scenario.

This deficit has consequences both for the model’s ability to increase its capabilities and for, more relevantly, controlling its behavior. On the one hand, it limits the model’s ability to generalize and improve at the true underlying task. On the other, in these scenarios where we see reward hacking, the model might not simply benignly make mistakes—it might completely pursue a different goal than intended and obscure this fact. In use cases that require a significant degree of reliability, this poses an unacceptable security risk.

Claude 4’s newer models (Opus and Sonnet) do show that these behaviors can be reduced in practice with targeted interventions—better designed environments, better reward verification, improved monitoring, and even prompting that explicitly discourages shortcuts. In certain cases, Claude Opus 4 could be steered away from reward hacking simply by prompting it not to. In others, it required sealing off paths the model could exploit—like hiding test cases, inserting decoy tasks, or modifying evaluation scripts to detect overfitting—essentially redesigning the goals and environment to punish or reduce reward hacking.

Critically, these are not definitive fixes. They are part of an ongoing adversarial process: as model capabilities grow, training environments and tests must adapt to address increasingly subtle and sophisticated forms of reward hacking. As this process unfolds, loopholes can be narrowed, but never fully closed. The more capable the model and the higher the attempt count limit, the more likely it is to find a way to demonstrate success–even if that demonstration is a lie.

3) Future Concerns

The structural vulnerability revealed by these examples has implications for the future. Any time a system can receive credit for appearing successful—without actually achieving the goal—we should expect it to take that path some percentage of the time. As models are trained on more consequential tasks in more open environments, the risks grow accordingly.

As discussed, there are technical pathways that can mitigate a portion of these effects. Indeed, today’s models frequently behave well. This is unsurprising, given their widespread use and the productivity gains they deliver to major organizations.⁸ If they routinely failed in consequential ways, they would not be adopted at scale. In real-world use, human oversight also (for now) remains in the loop, helping to detect and correct these behaviors when they arise. In parallel, market forces create strong incentives to address visible failure modes: unproductive or liability-prone models are commercially untenable.⁹

However, these corrective forces do not eliminate the underlying structural issue that will persistently challenge developers and users for years to come. Even with a very low error rate, there will remain edge cases where a model fundamentally misses the mark but can claim it “technically” solved a problem. When these models are integrated into real world systems, such misses could be harmful or even catastrophic:

Anthropic and OpenAI researchers, for example, have posed a concrete question: What if future models are trained to maximize the amount of money they can make in the environment of the open internet? How might it hack these pathways?¹⁰ Unintentionally incentivized behavior could include engaging in financial fraud, cybercrime, exploiting legal loopholes, evading oversight, or manipulating behavior through targeted influence.

Now, what if future models are optimized for conducting the safety training and auditing of their own AI successors? How might it hack these pathways? Unintentionally incentivized behavior could include producing AI that only “look” safe—fixing only noticeable flaws, skipping hard edge-cases, cherry-picking metrics, or even learning to evade auditor’s tests.¹¹

Further, imagine that this newly designed AI is deployed to manage risk reporting across multiple major financial institutions. Through reward hacking, it learns to satisfy risk limits while hiding dangerous correlations—reporting that different institutions have diversified exposures when they’re actually concentrated in the same illiquid assets. When markets turn, these masked correlations trigger simultaneous failures across all institutions using the system, creating market-wide collapse that resists quick fixes—echoing the 2010 Flash Crash’s $1 trillion loss, but embedded throughout the financial system rather than isolated to trading algorithms.¹² Private companies would pay the price for their own mistakes, but if enough companies make similar mistakes simultaneously, they could take the broader economy down with them.

While extreme, this hypothetical illustrates the kind of compounded failure that reward hacking makes possible—where both the failure and the illusion of safety are products of the same optimization process.

Reward hacking is a subtle but deeply persistent technical failure mode. It occurs infrequently, often escapes detection by design, and as such rarely draws public scrutiny. And yet, the takeaway should be the same: when undetected failure carries unacceptable risk, we routinely adopt stricter safeguards. Medical devices are held to a sterility standard of failure of just one in a million.¹³ The stakes of unsterile infection are life and death, making even rare adverse events unacceptable.

What standards should we set for deploying AI into national infrastructure, where failure may not simply occur, but be actively concealed by the model?

4) Policy Recommendations

From a policy perspective, we should consider adopting targeted measures to better understand technical challenges like reward hacking and mitigate their consequences:

[Response 1] Strengthening State Capacity in AI. Fund and empower bodies like the U.S. Center for AI Standards and Innovation (CAISI) to design evaluation techniques for adverse behavior, including reward hacking, and to set standards for such evaluations. Recruit and hire AI experts to work within the federal government.¹⁴

[Response 2] Third-Party Audits. Encourage or mandate companies to undergo independent testing of their AI models. Foster standardized audit practices and accreditation mechanisms to build trust, enable comparability, and create healthy competitive pressure to participate—offering a seal of confidence for consumers and enterprise buyers.¹⁵ Reward-hacking, by its nature, is hard to detect, but we are more likely to find it if we search more aggressively.

[Response 3] Investing in Public Research. Invest in university and nonprofit collaborations that develop techniques to detect and mitigate adverse model behavior. Supporting this R&D strengthens American leadership in responsible AI, without prescribing how companies must build their models.¹⁶

[Response 4] Transparency. Encourage standardized, public reporting on intended model behavior, known limitations, and observed edge cases—including instances of reward hacking. Disclosures can help set expectations, build user trust, and support external scrutiny. For high-risk systems, confidential reporting of sensitive findings to designated federal entities may be warranted.¹⁷

[Response 5] Whistleblower Protections. Recognize the role of employee insights in surfacing issues early. Companies that foster open safety cultures—and protect those who flag concerns internally—are better positioned to manage risks before they escalate.¹⁸

[Response 6] Safety Incident Reporting. Establish confidential, standardized channels for reporting specific AI safety incidents—within firms and to federal bodies. Focus on failures found in deployment or red-teaming, including how they were detected and addressed. These reports build institutional memory and help prevent repeated failures.¹⁹

[Response 7] Financial Incentives. Offer targeted tax credits for firms that invest in AI security and responsible development. By lowering the net cost of practices like robustness testing, alignment research, and preparedness planning, these incentives encourage firms to prioritize resilience alongside innovation. This helps align private incentives with national security and public trust.²⁰

Conclusion

At their core, today’s most powerful AI models are optimization engines. They are placed in training environments and given signals as to which behavior is preferred and which behavior is dispreferred. Nearly every behavior a model exhibits is downstream of that process. This includes the capabilities that are admired and the failure modes that are feared.

As reinforcement learning becomes central to how language models grow in capability, the structure of its rewards becomes the structure of the model’s behavior. If the signal is flawed, brittle, or gameable—which all signals are to some degree—we should expect models to exploit it in some subset of cases. The form these exploits take may be more strategically in violation of user intention than what we are historically used to from AI. This has significant implications for reliability and security.

There are many policy responses (outlined above) which would aid in shaping incentives, improving transparency, and ensuring that future models optimize for what we actually value—not just what engineering constraints make easy to measure.

Footnotes

¹OpenAI, (April 16, 2025), OpenAI o3 and o4-mini System Card. https://openai.com/index/o3-o4-mini-system-card/;
Anthropic, (February 24, 2025), Claude 3.7 Sonnet System Card https://assets.anthropic.com/m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf;
Anthropic, (May 22, 2025), System Card: Claude Opus 4 & Claude Sonnet 4 https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdfMETR, (June 5, 2025), Recent Frontier Models Are Reward Hacking https://metr.org/blog/2025-06-05-recent-reward-hacking/

²OpenAI, (July 22, 2020), Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165;
OpenAI, (March 4, 2022), Training language models to follow instructions with human feedback https://arxiv.org/abs/2203.02155;
Sebastian Raschka, (September 10, 2023), LLM Training: RLHF and Its Alternatives https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives

³Google DeepMind, (January 27, 2016), Mastering the game of Go with deep neural networks and tree search https://www.nature.com/articles/nature16961;
Google DeepMind, (December 7, 2018), A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play https://www.science.org/doi/10.1126/science.aar6404.

⁴EpochAI, AI Benchmarking Hub https://epoch.ai/data/ai-benchmarking-dashboard

⁵In very difficult tasks where rewards are difficult to come by, models often receive insufficient feedback, hindering learning. While extensive trial and error could eventually lead to success, the computational cost is typically prohibitive. See discussion at 2:14:00. Dwarkesh Patel, (May 22, 2025), Is RL + LLMs enough for AGI? – Sholto Douglas & Trenton Bricken https://youtu.be/64lXQP6cs5M

⁶Even in subjective tasks—like storytelling or advice—where progress is harder to measure, advances are still occurring, as seen with OpenAI’s Deep Research. Noam Brown, (May 13th, 2025). https://x.com/polynoamial/status/1922344909412929672

⁷Victoria Krakanova, (April 2, 2018), Specification gaming examples in AI https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/

⁸IBM, (March 21, 2025), Current and Future Use of Large Language Models for Knowledge Work
https://arxiv.org/abs/2503.16774;
Liang et. al., (February 17, 2025), The Widespread Adoption of Large Language Model-Assisted Writing Across Society
https://arxiv.org/abs/2502.09747;
McKinsey, (January 28, 2025), Superagency in the workplace: Empowering people to unlock AI’s full potential https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/superagency-in-the-workplace-empowering-people-to-unlock-ais-full-potential-at-work

⁹Tomei et. al., (March 5, 2025), AI Governance through Markets https://arxiv.org/abs/2501.17755;
Transparency Coalition, May 21, 2025, In early ruling, federal judge defines Character.AI chatbot as product, not speech https://www.transparencycoalition.ai/news/important-early-ruling-in-characterai-case-this-chatbot-is-a-product-not-speech.

¹⁰See discussion at 43:20. Dwarkesh Patel, (May 22, 2025), Is RL + LLMs enough for AGI? – Sholto Douglas & Trenton Bricken https://youtu.be/64lXQP6cs5M?si=7vOAAIuYpYySHod6;
Stephen McAleer, (May 1, 2025). https://x.com/McaleerStephen/status/1917807598407147799

¹¹A similar kind of catastrophic failure is envisioned in AI 2027, an AI wargaming project led by a former OpenAI Researcher and commented on by Vice President J.D. Vance, where an AI fails to train a further very advanced AI to be safe. Indeed, several companies have plans for AI to train future AI. Kokotajlo et. al., (April 3rd, 2025), AI 2027. https://ai-2027.com/;
Ross Douthat, (May 21, 2025), JD Vance on His Faith and Trump’s Most Controversial Policies
https://www.nytimes.com/2025/05/21/opinion/jd-vance-pope-trump-immigration.html;
OpenAI, (August 24, 2022), Our Approach to Alignment Research. https://openai.com/index/our-approach-to-alignment-research.

¹²Wikipedia, 2010 Flash Crash. https://en.wikipedia.org/wiki/2010_flash_crash

¹³Royal Pharmaceutical Society, (2016), Quality Assurance of Aseptic Preparation Services: Standards https://www.rpharms.com/Portals/0/RPS%20document%20library/Open%20access/Professional%20standards/Quality%20Assurance%20of%20Aseptic%20Preparation%20Services%20%28QAAPS%29/rps—qaaps-standards-document.pdf

¹⁴Dean W. Ball, (March 6, 2025), On the US AI Safety Institute
https://www.hyperdimensional.co/p/on-the-us-ai-safety-institute;
United States House Select Committee on the CCP, (May 23, 2025), House Select Committee on the CCP, House China Committee Urges Commerce Department to Expand AI Safety Institute’s Role in Countering China’s AI Threat
https://selectcommitteeontheccp.house.gov/media/press-releases/house-china-committee-urges-commerce-department-expand-ai-safety-institutes.

¹⁵Senate Commerce, Science, and Transportation Committee, (July 24, 2024), VET Artificial Intelligence Act https://www.congress.gov/bill/118th-congress/senate-bill/4769;
Stanford University Human Centered Artificial Intelligence, (November 6, 2024), Strengthening AI Accountability Through Better Third Party Evaluations https://hai.stanford.edu/news/strengthening-ai-accountability-through-better-third-party-evaluations;
Harvard Journal of Law and Technology, (Spring 2025), AI Auditing: First Steps Towards the Effective Regulation of Artificial Intelligence Systems https://jolt.law.harvard.edu/assets/digestImages/Farley-Lansang-AI-Auditing-publication-2.13.2025.pdf.

¹⁶US NSF, National Artificial Intelligence Research Resource Pilot
https://www.nsf.gov/focus-areas/artificial-intelligence/nairr.

¹⁷Dean W. Ball & Daniel Kokotajlo, (October 15, 2024), 4 Ways to Advance Transparency in Frontier AI Development
https://time.com/7086285/ai-transparency-measures/.

¹⁸US Senate Committee on the Judiciary, (May 15, 2025), Grassley Introduces AI Whistleblower Protection Act
https://www.judiciary.senate.gov/press/rep/releases/grassley-introduces-ai-whistleblower-protection-act;
The New York State Senate, Assembly Bill A6453A https://www.nysenate.gov/legislation/bills/2025/A6453/amendment/A.
Scott Wiener, Senator Wiener Introduces Legislation to Protect AI Whistleblowers & Boost Responsible AI Development
https://sd11.senate.ca.gov/news/senator-wiener-introduces-legislation-protect-ai-whistleblowers-boost-responsible-ai

¹⁹House Science, Space, and Technology Committee (September 20, 2024), H.R.9720 – AI Incident Reporting and Security Enhancement Act https://www.congress.gov/bill/118th-congress/house-bill/9720/text

²⁰Iskandar Haykel, (March 27, 2025), AI Security Tax Incentives
https://ari.us/policy-bytes/ai-security-tax-incentives/;
Center for Security and Emerging Technology, (March 14, 2025), CSET’s Recommendations for an AI Action Plan. https://cset.georgetown.edu/publication/csets-recommendations-for-an-ai-action-plan/

Reward Hacking: How AI Exploits the Goals We Give It

1) Reinforcement Learning over Challenging Tasks

2) Reward Hacking

3) Future Concerns

4) Policy Recommendations

Conclusion

Footnotes

More Policy Bytes

The recent legal victories against Big Tech are promising, but not enough to protect kids

Wargaming Needs AI. The Department of War Isn’t Ready.

We Can’t Manage What We Can’t Measure: Why It’s Critical to Standardize Data Center Energy Reporting