On Learning Mechanistic Interpretability

I’ve been spending some of my side-project time trying to get a real working understanding of mechanistic interpretability — the subfield of AI safety research that tries to reverse-engineer what’s actually happening inside neural networks.

My entry point was the Chris Olah episode of the 80,000 Hours podcast, which gave me a useful frame: mech interp is less like behavioral testing and more like trying to read source code. From there I’ve been working through Elhage et al.’s “A Mathematical Framework for Transformer Circuits” and the ARENA exercises — the plan deliberately interleaves the two, pairing each reading with a notebook exercise on the same concepts so I’m checking whether I actually understand something well enough to do something with it, not just accumulating theory.

This is also, explicitly, an experiment in how effectively agentic tools can accelerate this kind of self-directed technical learning. I’m deliberately pushing the boundary of how much work I can have Claude Code do for me while still building expertise and outputs I both understand and am confident in. The vast majority of the code in this project will be written by tools — but I’d like to feel the same sense of ownership at the end that I would if I’d hand-crafted every line.

The project I’m building toward is called Circuit Generalization. The core question: when a circuit is identified for a task on a specific prompt, does it actually generalize to other prompts requiring the same underlying computation? Anthropic’s 2025 “Biology of a Large Language Model” paper flags this as an open problem — their attribution methodology only succeeds on roughly 25% of attempted prompts, and the reasons for failure aren’t well understood.

I’m tracking the work in a public lab notebook. I’m not a professional researcher — this is independent learning, done in public in the spirit of open science.