Completing the BlueDot Technical AI Safety Course

A few weeks ago I finished the BlueDot Impact Technical AI Safety course. If you have any interest in AI safety (technical or otherwise), it’s worth your time.

The course covers six units: the nature of the technical challenge, approaches to training safer models, interpretability and detection, harm minimization, and how to start contributing. For me, coming in with a software engineering background and some prior reading, it filled in a lot of gaps. There were entire threads of the field I hadn’t engaged with seriously before (alignment research directions, threat models, evaluation approaches) that I now have at least a working map of.

The unit on mechanistic interpretability was the one that changed what I’m working on. The framing of mech interp as an attempt to read the source code of neural networks, understanding not just what models output but how they compute, is both intellectually compelling and genuinely safety-relevant. That’s the thread I’m now pulling on directly with my Circuit Generalization project.

What I appreciated most about the course structure was the discussion sessions. Each unit is a few hours of reading followed by a small-group Zoom with a facilitator and around eight peers. The discussions were well-facilitated and did real work: connecting the reading to practical implications, surfacing disagreements in the field, and making the material stick in a way that solo reading doesn’t. The cadence also kept me on track through a busy few weeks in a way that self-paced learning rarely does.

The course is free. That’s remarkable for the quality of what’s on offer. I’d highly recommend it to anyone with an engineering or research background who’s been meaning to engage seriously with AI safety, and I hope to find ways to support the organization in the future.