Understanding what goes on inside an AI’s ‘mind’ is still quite a challenge, even though we often compare it to human thinking. Anthropic’s latest research is making strides to clear up this mystery. By getting a grasp on how large language models create actions that seem intelligent, we can boost these systems’ capabilities and keep them in check as they start to outpace human abilities.
Unlike older programs that depended on logical rules, neural networks learn skills on their own. Their internal workings are complex and often feel like a ‘black box.’ But Anthropic is making headway here. They’ve shown that they can link a model’s activity to both concrete and abstract concepts. In their new studies, they’ve demonstrated how these models connect concepts to guide decision-making.
Their research focused on the Claude 3.5 Haiku model, Anthropic’s smallest, to train a ‘replacement model’ that mimics Haiku’s operations with clearer features. The team looked at how concepts formed ‘circuits’ that determined the model’s responses. This approach revealed the ‘thinking’ steps and concept combinations leading to final outputs.
Another study used this technique to examine the model’s behavior in tasks like multi-step reasoning, poetry, medical diagnoses, and math. The findings were both surprising and enlightening. For instance, although models can respond in multiple languages, they initially use language-independent features for concepts before picking a language.
There’s a common belief that models predict the next word in a sequence, but when generating a line of poetry, the model picked a rhyming word first, hinting at some long-term planning. They also explored ‘unfaithful reasoning,’ where models offer explanations that sound plausible but aren’t accurate. When asked to add numbers, the model combined approximate values and refined its answer by reasoning through possible outcomes. Yet, its explanation mirrored traditional math methods, showing a gap between learning and explaining, which affects how we trust and align machine behavior.
The researchers noted that their method gives just a partial view of the model’s inner workings and demands a lot of human effort. Still, this kind of understanding will be key as AI systems like Claude become more a part of our daily lives.