After graduating from Princeton in 2024, I spent the following year floating around: the summer surfing in Indonesia; the fall interning at the Max Planck Institute in Munich, Germany; and the winter and spring fulfilling my dream of working at a ski lodge in Alta, UT. I’m incredibly lucky to have been able to do any of that, much less all of it, and I recognize that most people lucky enough to graduate from college have too much to worry about to even consider taking a gap year. This summer, a year belatedly, I am now looking to the future as so many do after graduating. That means it’s time to find a job.

A few months into that endeavor, I’ve come to the surprising conclusion that I should start a blog discussing some informal research—but why? I’m writing this first post not only as an explanation to whoever may read this (and there will likely be very few who do), but also to think through things myself. In particular, I’d like to answer the following question: why spend the time to write a blog—something that is quite out-of-character—especially given the already time-consuming (and far more important) task at hand of finding a job?

In short, I’m very excited about AI interpretability research but have little direct prior experience, and there is no better way to get experience than by just going ahead and doing it, even if in an informal capacity such as this blog. As I do some informal research, I should to share my results because: (i) communication is a critical part of research; (ii) some of what I do may be interesting to people (including potential employers); and (iii) more generally, life is about putting yourself out there, and this is a great way to do just that.

Why Interpretability?

Given that I just said I have no prior experience, why am I so interested in AI interpretability research (and what even is it)? Briefly put, just as cognitive science seeks to understand the human mind, interpretability seeks to understand artificial minds—specifically, those composed of parametric functions at incredible scale, that are collectively labelled as artificial intelligence (AI).

Why is interpretability important? Whatever you think about AI’s capabilities—whether you subscribe to the “Sam Altman view” or the “Marcus Gary view”—it’s becoming increasingly obvious that there is already something pretty impressive going on. Whether you view them as truly generalizing, reasoning intelligences or as simple statistical parrots, they are already already solving math problems that I, a math major, could not solve (see DeepMind’s recent IMO result), and already fooling your grandmother about what videos on the internet are real (see Jon Oliver’s video on AI slop). And whatever you think about AI’s impact—only 17% of American adults think it will be positive, compared to 56% of so-called AI experts (Pew Research, 2025)—it’s becoming increasingly obvious that it will be massive and multifaceted. As AI’s capabilities and impact continue to grow, it will be increasingly important to understand how it works, so that it can be monitored and controlled. Regardless of how likely so-called “doomsday scenarios” (see AI 2027) seem today, most AI experts readily admit they are non-zero—this is a sign we should take this seriously.

But, despite this, AI remains fundamentally a black box. To some extent, this may be an inevitable result of the irredeemable complexity of distributed processing systems—certainly, this is at least somewhat true of the brain. But neuroscience has made incredible progress already, and artificial neural networks are much easier (and safer) to tinker with than biological ones. It is incredibly important that we do so—that we understand these artificial minds, and hence get them to do what we want them to do—before they become too intelligent and widely distributed. Aside from this apocalyptic purpose, there are a host of other opportunities: as an analogy, just imagine what we could do for neurological disorders and mental health with a perfect-resolution scan of the brain followed by synapse-precision treatments.

In addition to its importance, I also find interpretability to be incredibly beautiful and interesting, lying at the intersection of both aspects of my background (mathematics and cognitive science). My bachelor’s thesis was a study of the inductive biases of small recurrent networks learning base addition; though just touching on interpretability a bit, it gave me a taste of what the research is like—I quite liked it, evidenced by my spending last year revising it as a paper. Last fall, too, my work at the Max Planck Institute for Biological Intelligence explored local learning methods in small MLPs; this gave me a sense of just how remarkable the brain is, and how impressive it is that we’ve even come close to approximating its intelligence artificially. And, since starting my job search, I’ve been extensively reading interpretability research papers for the past two months, often finding myself struggling to put them down (I should write a future post on my favorite papers so far).

Okay Interpretability, But Why a Blog?

Though interpretability’s importance is obvious and early signs of its fit with my interests are positive, it’s also true that I have little direct prior experience. This is problematic not only for potential employers, but also for myself: is my interest in interpretability as deep as I think it is? This (informal research presented via a blog) is an opportunity to really get a feel for interpretability research, and crucial to that is not only doing the research but communicating it. Not only that, but it’s (aspirationally) possible that something I do here is interesting, to the interpretability community or (again, aspirationally) to a potential employer.

This hints at a broader factor in writing a blog like this, which is something that weirdly scares me: what if it isn’t any good? I’m under no apprehension that research is easy; it’s incredibly challenging, and especially so on your own, without much in the way of mentorship (not to mention compute). So, I may very well fail to produce anything interesting, which paradoxically would demonstrate the reverse of what I’m intending. But, that’s what research is all about: blindly shuffling around in a pitch-black room, intuiting what direction to go, and hoping to find something interesting. Even in the happy occasion that I do find something, I also know that I’ll want to spend far more time than I have coming up with follow-up experiments, ironing out the results, and writing about it as clearly as possible. So, this will be an exercise in “imperfectionism”. There is never enough time, and so here I go, out into the dark.