A Guide to AI Interpretability
Introduction
If a Large Language Model (LLM) used as a healthcare chatbot offers reassurance instead of flagging a serious symptom, can we trace what drove that decision? What if a video generation model used for children’s educational content introduces antisemitic or racist imagery, can we understand how it arrived at that output? Or, if a classifier used in satellite image analysis misidentifies a military installation as benign infrastructure, can we determine what led it astray? The scientific inquiry of understanding, tracing, and explaining what causes the behavior of AI models is called interpretability.
This field asks whether we can understand the specific computational mechanisms that underlie model outputs, much like how neuroscientists map activity in the brain to particular human behaviors. If we could genuinely understand the “thoughts” of our most powerful models, we could audit, debug, and control these systems with clarity and precision.
Yet enabling this kind of understanding remains a major challenge. When we can’t see how models arrive at their decisions, it becomes difficult to ensure that they behave correctly for the right reasons. As these systems become increasingly powerful and used in high-stakes domains, this lack of transparency raises serious concerns.
What makes these systems so challenging to interpret? How do they process information internally? What approaches to interpretability are currently available, and what are their advantages and shortcomings? Most importantly, what can policymakers do to advance interpretability and make this transformative technology more interpretable, reliable, and controllable?
Modern LLMs achieve their capabilities by learning trillions of entangled statistical associations that are hard to tease apart, unlike very early AI systems that used explicit, clearly-defined rules. To better understand their inner workings, two main approaches exist: mechanistic interpretability (precise but impractical) and representation interpretability (practical but imprecise). Neither yet provides complete understanding of a model’s internal logic. This compels us to further refine and incentivize these methods all the while leveraging them as useful (but not comprehensive) evidence for understanding AI systems.
Motivation: WordNet
To explain why this is so challenging, let’s look at an example of an earlier, less powerful system that is perfectly interpretable: WordNet, which came out in 1995. WordNet is a massive curated information structure that organizes words by their relationships to each other.
| Relationship | Definition | Example |
| Synonymy | Same or similar meanings. | Car → Automobile |
| Antonymy | Opposite meanings. | Hot → Cold |
| Hypernymy/Hyponymy | Category to member hierarchy. | Animal → Dog |
| Meronymy | Part-to-whole relationship. | Wheel → Car |
Table 1. WordNet Structural Relationships
If you were to peer under the hood, you could see exactly what is going on. For example, “Car” and “Automobile” have the manually defined relationship of synonyms. This unambiguously indicates they share the same or similar meaning. It has no gray area. You always know exactly how and why two words are related because they have an explicitly defined relationship. It is perfectly interpretable.
Why don’t we design modern AI systems, such as LLMs, to be this interpretable?
The answer is that if we did so, the models would be nowhere near as performant or capable. The fuzzy gray areas are where the magic that makes LLMs so powerful happens. Leveraging the vast information stored inside statistical associations is the entire point.
LLMs and Associativity
The initial training stage for LLMs tasks them to, as accurately as possible, predict language structures across nearly all of the internet: trillions of words. These models learn a massive amount of associations built into this vast lexical fabric. They learn what words tend to go together and what this means in different contexts. The span of these associations carry information that cumulatively far surpasses WordNet.
Consider the following examples:
1) Models learn cultural pairings that aren’t logically necessary, but are statistically strong
Models pick up on common cultural associations that we take for granted:
- “Peanut butter and ___” → “jelly” (not honey or jam, despite those being equally valid)
- “Spaghetti and ___” → “meatballs” (not cheese or vegetables, although those also pair deliciously)
2) Models infer causality even when it’s not explicitly stated
Models learn to predict natural cause-and-effect relationships:
- “He touched the stove and ___” → “screamed” (the model infers the stove was hot)
- “She dropped the glass and it ___” → “shattered” (the model predicts the logical consequence)
3) Models capture subtle shifts in textual tone and intent
Models detect emotional context and adjust their word choices accordingly:
- “You’re ___” said in a supportive tone → “valid” or “right”
- “You’re ___” said in a correcting tone → “incorrect” or “wrong”
Gather enough of these relationships and you have something as powerful as today’s frontier models. LLMs learn implicitly far more than we’ve ever been able to precisely map out explicitly. Consider what explicitly programming this would require: catalouging every diverse word association, every context-dependent word choice, every implicit causal relationship across human language. This would take, say, billions of human work hours to formulate trillions of nuanced relationships and there would never even be consensus on the ground truth of it all.
Pretraining on massive datasets using Deep Learning sidesteps this impossible task. It is a powerhouse for capturing relationships and complexity. The problem is that in compressing all of this information, it becomes very hard to tease it back apart into its individual components. The concepts and relationships between words become superimposed onto one another.
But, what are those individual components and how do they take form inside of an LLM?
What do Model “Internals” look like?
LLMs work by performing mathematical operations on internal numerical values called “parameters.” During pretraining, the model adjusts these parameters to more accurately predict text. This adjustment process is how models “learn.”
LLMs take language as input and produce language as output but their internal processing is purely mathematical. So, words must first be converted into numbers that the model can work with. Here, we come into the “embedding space”. This is a mathematical representation where every word is translated to a vector (essentially a list of numbers).
A vector in language embedding space has two key properties:
- It points in a direction, which captures its meaning (e.g., “sarcasm” or “personal”)
- It has a magnitude, or length, which captures how strongly meaning is expressed. (e.g. “very sarcastic” or “lightly sarcastic”)
In this space, our “concept” vectors form relationships to one another. More interrelated words are trained to cluster together: “weapon” and “gun” would have more similar vectors, as might “Paris” and “France,” whereas “gun” and “France” would be more far apart.
Image 1. Converting text into embedding vectors.
Image 2. Conceptual clustering of words in embedding space.
The vectors processed inside of LLMs exist in this space and encode several entangled concepts. A stronger version of the prevailing theory on how concepts are represented, known as the “Linear Representation Hypothesis,” suggests that these entangled concepts can be separated into more fundamental components. The idea is that every fundamental core concept can be captured as a vector direction, and when you combine these core vectors together you get the cumulative conceptual meaning. Even if this doesn’t map directly onto any individual word, the broader fuzzy concept is captured.
While it is uncertain if these core concepts exist, we do see that combining vectors together does seem to capture shared meaning. For instance, when you add the vector for “king” and for “woman,” you get something very close to “queen.” This suggests that many conceptual relationships like gender or royalty are generally encoded as directions in this mathematical space.
Image 3. Embedding vectors adding upon one another to represent cumulative meaning. Source: 3Blue1Brown.
It is within this space of relationships that LLMs process information in order to come to the very impressive capabilities we see today in production. Without it, models could not understand what bridges and distinguishes different groups of semantic meanings nor could models solve challenging problems in highly technical fields.
Understanding the embedding space within which LLMs encode meaning is crucial because it is where interpretability research operates. The question becomes: can we map how models process information in this space well enough to understand why models make specific decisions?
Generally, there are two different approaches to this: Mechanistic Interpretability, which is precise yet impractical, and Representation Interpretability, which is practical yet imprecise.
Mechanistic Interpretability and Challenges
Mechanistic interpretability aims to reverse-engineer neural networks by identifying the specific model components that give rise to a particular behavior. It asks how we can translate each unit of processing done in embedding space into a human-explainable interpretation. We gather these minute pieces together, map their interactions (i.e. circuits), and predict the “thought” process that led to the cumulative result.
A commonly used method of mechanistic interpretability is Sparse Autoencoders (SAE). This is an interpretability tool whose design allows it to (to some degree) separate the cumulative vectors processed inside of a model into more core concepts. Specifically, sparse autoencoders contain mechanisms trained to activate on more basic core concepts, rather than firing diffusely across many mixed signals. After training, we post-hoc investigate these mechanisms, identify what topics in the prompt they tend to correlate with, and assign each a label describing what ‘concept’ it seems to represent.
For example, in Anthropic’s “Scaling Monosemanticity” paper, they reveal that one of these mechanisms tended to predictably fire on text about the Golden Gate Bridge. This mechanism with high accuracy activated when the model processed text referencing the bridge across varied phrasings, indirect descriptions, and even different languages. To test its causal role, researchers boosted the mechanism. The result: the model began talking about the Golden Gate Bridge, regardless of the topic of the prompt. This intervention revealed that the feature was not only somewhat conceptually separable, but also held causal power, steering the model’s behavior in a very specific direction.
Beyond identifying individual concepts in a model’s embedding space, we hope to understand how they interconnect and through what circuits. For example, if we know that a vision model has detected wheels, a metal body, and windows in its internal computations, can we identify the process by which it combines these to judge that the overall entity is a car? Or, if we identify a concept corresponding to counting the numbers 1, 2, and 3, can we identify the process that predicts that 4 is going to come next?
It is therefore the hope that we can understand the precise causal reasons behind why a model outputs every single word that it does. However, in most cases where progress has been made, the experimental setups are relatively contrived and not always reliable. The example of the “Golden Gate Bridge” concept is an outlier, and in practice it is still rare that we can distill concepts with such high accuracy. The baseline is for concepts to be incredibly messy.
Let’s outline some intuition here. In a paper titled “A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders,” with the same methodology as before, the authors discovered SAE mechanisms that revealed interesting behaviors. They post-hoc labeled a mechanism as ‘word starts with the letter S’ given that it seemed to activate on words beginning with S such as stream, send, store, stair, etc. However, what researchers found is that this mechanism labeled ends up very frequently not activating on the specific word “short”.
This was how standard SAE training happened to incentivize the concepts to be packaged. It is no surprise: models compress and organize concepts in whatever way best optimizes their training objective. Inherently, this will create messy, overlapping representations instead of clean boundaries. While we can attempt to further pin down these concepts with greater precision and accuracy, there likely will never be a silver bullet that allows us to understand a model’s internal “thought” process with absolute certainty.
This doesn’t mean that we can’t derive useful information about what’s occurring inside of a model. The “word starts with the letter S” labeled mechanism is not perfect, but it is frequently accurate. What this does mean is that under this paradigm which trends suggest may take us to very, very capable AI, we likely will always have to depend on synthesizing various sources of evidence to assess model internals and therefore safety rather than having access to a clear ground truth.
Representation Interpretability and Challenges
Representation interpretability focuses on understanding emergent properties inside of models. Rather than identifying the precise mechanisms that produce an output, it asks: what kind of cumulative information is encoded across superimposed concepts/vectors? Can we read out and intervene on these representations in meaningful, controllable ways?
Some argue that this approach is more aligned with how we study other complex systems. Neuroscientists don’t trace every neuron to understand emotion; they study regional patterns of activation. Meteorologists don’t simulate every air molecule; they model pressure systems and weather fronts. Similarly, representation interpretability acknowledges that intelligible generalizations often arise at higher levels of abstraction across populations rather than at the level of individual parameters and computations.
Representation interpretability identifies concepts inside a model like tone or intent by comparing examples that differ mostly (but not solely) via that trait. For instance, take the sentences “I guess that’s one way to do it” and “That’s a brilliant solution.” Both might follow a similar prompt, but the first carries sarcasm, while the second expresses praise. By feeding both into the model and analyzing the internal difference between them, researchers can extract a vector that captures the net conceptual attributes, in this case, for example, irony and approval. With this methodology, researchers can obtain a rough proxy of a wide range of social, emotional, or stylistic concepts encoded inside the model.
The traces of these concepts can be leveraged for various useful ends. Researchers have used related techniques to:
- Suppress directions corresponding to undesirable behaviors. By locating vectors linked to harmful traits like toxicity, bias, or manipulation, researchers can dampen their influence during generation, reducing the likelihood of correlated behaviors appearing in outputs.
- Amplify directions corresponding to desirable behaviors. Desired traits can be reinforced by shifting representations along directions associated with honesty, helpfulness, or politeness, nudging the model toward more aligned responses.
- Edit model beliefs directly in representation space. Internal representations encoding factual beliefs such as “Eiffel Tower is in Paris, France,” can be located and modified to “Eiffel Tower is in Rome, Italy,” rewriting parts of the model’s worldview with variable consistency.
- Halt unsafe generations mid-process with circuit breakers. Circuit breakers first detect internal representation patterns associated with harmful content, redirect them to orthogonal (incoherent) directions, and then train the model on these modified representations to interrupt coherent harmful generations while preserving helpful responses.
- Learn tamper-resistant safeguards that survive retraining attempts. Tamper resistance techniques embed safety constraints into the model’s representations in ways that remain durable even after extended retraining for misuse, making it significantly harder for bad actors to remove pro-alignment behaviors in open-weight models.
However, representation interpretability also has significant weaknesses, mostly centered around precision. Concept vectors such as “truthfulness” or “politeness” can be applied for the behavioral steering ends above to only one token (i.e. word) at a time. In a sentence like “I thought the movie was great,” should truthfulness be applied to “thought,” “movie,” or “great”? Each seems plausible, but no existing methodology can automatically provide the answer. Further, while representation engineering excels at editing broad, conceptual traits, it struggles with precise pathways that govern computational processes. For example, adding a vague “success at math” concept vector does very little for improving capabilities in actual mathematical reasoning.
Additionally, the features identified in representation space often blend multiple traits together. Suppressing “toxicity” might inadvertently mute confrontational language that is justified. This makes for blunt rather than surgical interventions. The edits operate over broad patterns rather than targeting specific entities, making it difficult to achieve the kind of precise, isolated control that many safety applications would require. In short, representation-based approaches may help us describe what is encoded, but they still struggle to tell us why or how that encoding governs behavior with specificity.
Future Implications
Interpretability is a deeply compelling but persistently difficult goal. Mechanistic and representation interpretability offer valuable and complementary approaches to understanding AI systems, each with distinct trade-offs. Recent moves by DeepMind to scale back work on sparse autoencoders, contrasted with Anthropic’s continued investment, reflect an unresolved battle between the two.
Mechanistic interpretability excels at precision when it works reliably. Examples like the Golden Gate Bridge feature suggest it could provide detailed explanations for specific behaviors, potentially making it useful for mitigating risk of specific dangerous behaviors where tracing the granular roots matters most. However, successful implementations have primarily been in relatively narrow, friendly, and controlled contexts on smaller models than the frontier.
Representation interpretability appears to offer broader functional utility. While it may not explain precise mechanisms, it seems effective at identifying and steering general behavioral patterns. For risks involving persuasion, bias, or general alignment issues, representation-level interventions that broadly shape behavior might prove more practically useful than detailed mechanistic understanding.
Researchers should continue exploring both paradigms rather than betting entirely on one framework. Mechanistic interpretability may provide precision tools for targeted detection and intervention in particular proven cases, while representation interpretability could offer broader steering capabilities for general behavioral alignment.
However, neither approach should be expected to provide complete understanding of frontier AI systems. The complexity of these models, combined with the fundamental challenges outlined above, suggests that robust and comprehensive interpretability remains unlikely. Success will more likely come from using these as tools that work to provide informative evidence to enhance safety and control in a broader holistic sense while accepting inherent limitations and uncertainties.
Success likely means building tools that contribute to an overall picture of safety rather than perfect understanding and control. Yet in this otherwise opaque landscape, even partial insight is invaluable.
Policy Recommendations
As AI systems make increasingly consequential decisions, our limited understanding of their internal reasoning threatens our ability to audit their alignment and reliability. Even partial visibility into why models behave how they do can surface risk and support early intervention as their influence expands. Without such advances, models will remain vulnerable to failures we don’t fully understand. Four policy measures can address these urgent challenges:
[Response 1] Strengthening State Capacity in AI. Resource technical institutions to assess model interpretability and embed AI experts across agencies to evaluate what evidence these contribute to a holistic understanding of model safety.
[Response 2] Invest in Public Research. Fund academic, nonprofit, and government research into both mechanistic and representation-based interpretability through agencies like NSF, DARPA, CAISI and other relevant federal research programs. Public investment guides research priorities and ensures long-term progress beyond commercial timelines.
[Response 3] Transparency. Direct CAISI to develop standardized interpretability disclosures for Secure Development Frameworks, allowing organizations to report on their understanding of model internals in a consistent format.
[Response 4] Financial Incentives. Consider offering tax credits or procurement preferences for models with demonstrable interpretability understandings.
Conclusion
AI interpretability has no silver bullet solution. Mechanistic approaches offer precision in narrow cases while representation methods provide broader steering. Full interpretability faces fundamental barriers due to the associative and superimposed nature of information in neural networks.
Four policy responses emerge: strengthen state capacity by resourcing technical institutions and embedding AI experts across agencies; invest in public research on both mechanistic and representation interpretability; require transparency through standardized disclosures in development frameworks; and create financial incentives like tax credits for demonstrable interpretability features.
The stakes demand action. Without the ability to trace how or why a model arrived at a given output, meaningful oversight becomes difficult. Even tools that provide partial interpretability can help identify and mitigate risks as AI systems grow more powerful and transformative to our society. Governments that invest in this capacity today will be better positioned to create the interpretable, reliable, and controllable AI of tomorrow.