Les pages de Nomic: The Urgency of Interpretability

lundi 7 juillet 2025

The Urgency of Interpretability - Dario Amodei

John Martin - Manfred et la sorcière des Alpes

Un autre article très récent (avril 2025) de Dario Amodei, le patron d'Anthropic. Encore une fois, je précise que je n'ai pas d'opinion tranchée sur l'avenir des IA et des LLM. Quel que soit l'avenir de ces technologies, c'est passionnant, comme une injection de SF dans le quotidien. Sur ces mêmes questions, voir AI 2027 et Machines of Loving Grace.

Dans le fabuleux monde des grands modèles de language (LLM), l'interprétabilité consiste à comprendre le fonctionnement interne de ces systèmes. C'est potentiellement un enjeu considérable, car avec la compréhension vient non seulement la possibilité du contrôle, mais aussi de nouvelles opportunités de recherche et de développement.

Dario Amodei souligne que ce manque de compréhension est assez inédit dans l'histoire de la technologie informatique.

If an ordinary software program does something—for example, a character in a video game says a line of dialogue, or my food delivery app allows me to tip my driver—it does those things because a human specifically programmed them in. Generative AI is not like that at all. When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or precise level, why it makes the choices it does—why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate. As my friend and co-founder Chris Olah is fond of saying, generative AI systems are grown more than they are built—their internal mechanisms are “emergent” rather than directly designed. It’s a bit like growing a plant or a bacterial colony: we set the high-level conditions that direct and shape growth, but the exact structure which emerges is unpredictable and difficult to understand or explain. Looking inside these systems, what we see are vast matrices of billions of numbers

Cette nature émergente rend difficile la prédiction des comportements des IA.

Au-delà des questions de sécurité, Dario Amodei mentionne que l'interprétabilité permettrait de déterminer si les IA sont de simples pattern-matchers ou des créatures possédant ce qui s'approche d'une conscience. Personnellement, je ne suis pas certain que la frontière soit claire entre les deux : quand un humain converse, les mots ne sont pas choisis par une conscience décidante ; dans une certaine mesure, les mots apparaissent, sont générés, selon des enchainements logiques, probabilistiques. Il en va de même pour la pensée.

La discipline qui consiste à tenter d'ouvrir la boite noire que sont les réseaux de neurones artificiels s'appelle spécifiquement interprétabilité mécaniste. L'auteur en retrace brièvement l'histoire, avant d'évoquer les pistes actuelles qui permettent au domaine de progresser. Je cite quelques passages.

We quickly discovered that while some neurons were immediately interpretable, the vast majority were an incoherent pastiche of many different words and concepts. We referred to this phenomenon as superposition, and we quickly realized that the models likely contained billions of concepts, but in a hopelessly mixed-up fashion that we couldn’t make any sense of. The model uses superposition because this allows it to express more concepts than it has neurons, enabling it to learn more. If superposition seems tangled and difficult to understand, that’s because, as ever, the learning and operation of AI models are not optimized in the slightest to be legible to humans.

The concepts that these combinations of neurons could express were far more subtle than those of the single-layer neural network: they included the concept of “literally or figuratively hedging or hesitating”, and the concept of “genres of music that express discontent”. We called these concepts features, and used the sparse autoencoder method to map them in models of all sizes, including modern state-of-the-art models. For example, we were able to find over 30 million features in a medium-sized commercial model (Claude 3 Sonnet). Additionally, we employed a method called autointerpretability—which uses an AI system itself to analyze interpretability features—to scale the process of not just finding the features, but listing and identifying what they mean in human terms.

Finding and identifying 30 million features is a significant step forward, but we believe there may actually be a billion or more concepts in even a small model, so we’ve found only a small fraction of what is probably there, and work in this direction is ongoing. Bigger models, like those used in Anthropic’s most capable products, are more complicated still.

Once a feature is found, we can do more than just observe it in action—we can increase or decrease its importance in the neural network’s processing. The MRI of interpretability can help us develop and refine interventions—almost like zapping a precise part of someone’s brain. Most memorably, we used this method to create “Golden Gate Claude”, a version of one of Anthropic’s models where the “Golden Gate Bridge” feature was artificially amplified, causing the model to become obsessed with the bridge, bringing it up even in unrelated conversations.

L'objectif le plus important de l'interprétabilité étant probablement de pouvoir faire un scan des modèles afin de détecter des "maladies" (des mésalignements) et pouvoir les corriger.

lundi 7 juillet 2025

The Urgency of Interpretability - Dario Amodei

Aucun commentaire:

Enregistrer un commentaire