Submission + - Anthropic can now track the bizarre inner workings of a large language model
What the firm found challenges some basic assumptions about how this technology really works.
MIT Technology Review
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Fwww.technologyreview.com%2F2025%2F03%2F27%2F1113916%2Fanthropic-can-now-track-the-bizarre-inner-workings-of-a-large-language-model%2F (paywalled)
https://ancillary-proxy.atarimworker.io?url=https%3A%2F%2Farchive.is%2F4mujU (free)
= This caught my eye first: studying something that claims to be brainy using brain-investigation tools and approaches.
Anthropic says it was inspired by brain-scan techniques used in neuroscience to build what the firm describes as a kind of microscope that can be pointed at different parts of a model while it runs. The technique highlights components that are active at different times. Researchers can then zoom in on different components and record when they are and are not active.
= Secondly, LLMs get "better" when they know when to shut up.
The latest generation of large language models, like Claude 3.5 and Gemini and GPT-4o, hallucinate far less than previous versions, thanks to extensive post-training (the steps that take an LLM trained on text scraped from most of the internet and turn it into a usable chatbot). But Batson’s team was surprised to find that this post-training seems to have made Claude refuse to speculate as a default behavior. When it did respond with false information, it was because some other component had overridden the “don’t speculate” component.