Eric Todd

Conference Publications

Jaden Fiotto-Kaufman, Alexander R. Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla Brodley, Arjun Guha, Jonathan Bell, Byron C. Wallace, David Bau. NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals. Proceedings of the 2025 International Conference on Learning Representations. (ICLR 2025)

We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred remote execution. NDIF is a scalable inference service that executes NNsight requests, allowing users to share GPU resources and pretrained models. These technologies are enabled by the intervention graph, an architecture developed to decouple experiment design from model runtime. Together, this framework provides transparent and efficient access to the internals of deep neural networks such as very large language models (LLMs) without imposing the cost or complexity of hosting customized models individually. We conduct a quantitative survey of the machine learning literature that reveals a growing gap in the study of the internals of large-scale AI. We demonstrate the design and use of our framework to address this gap by enabling a range of research methods on huge models. Finally, we conduct benchmarks to compare performance with previous approaches. Code, documentation, and tutorials are available at https://nnsight.net/.

Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, David Bau. Function Vectors in Large Language Models. Proceedings of the 2024 International Conference on Learning Representations. (ICLR 2024)

We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). FVs are robust to changes in context, i.e., they trigger execution of the task on inputs such as zero-shot and natural text settings that do not resemble the ICL contexts from which they are collected. We test FVs across a range of tasks, models, and layers and find strong causal effects across settings in middle layers. We investigate the internal structure of FVs and find while that they often contain information that encodes the output space of the function, this information alone is not sufficient to reconstruct an FV. Finally, we test semantic vector composition in FVs, and find that to some extent they can be summed to create vectors that trigger new complex tasks. Our findings show that compact, causal internal vector representations of function abstractions can be explicitly extracted from LLMs.

Eric Todd, Mylan R. Cook, Katrina Pedersen, David S. Woolworth, Brooks A. Butler, Xin Zhao, Colt Liu, Kent L. Gee, Mark K. Transtrum, Sean Warnick. Automatic detection of instances of focused crowd involvement at recreational events. Proceedings of Meetings on Acoustics 39, 040003. (2019)

This paper describes the development of an automated classification algorithm for detecting instances of focused crowd involvement present in crowd cheering. The purpose of this classification system is for situations where crowds are to be rewarded for not just the loudness of cheering, but for a concentrated effort, such as in Mardi Gras parades to attract bead throws or during critical moments in sports matches. It is therefore essential to separate non-crowd noise, general crowd noise, and focused crowd cheering efforts from one another. The importance of various features—both spectral and low-level audio processing features—are investigated. Data from both parades and sporting events are used for comparison of noise from different venues. This research builds upon previous clustering analyses of crowd noise from collegiate basketball games, using hierarchical clustering as an unsupervised machine learning approach to identify low-level features related to focused crowd involvement. For Mardi Gras crowd data we use a continuous thresholding approach based on these key low-level features as a method of identifying instances where the crowd is particularly active and engaged.

Brooks A. Butler, Katrina Pedersen, Mylan R. Cook, Spencer G. Wadsworth, Eric Todd, Dallen Stark, Kent L. Gee, Mark K. Transtrum, Sean Warnick. Classifying crowd behavior at collegiate basketball games using acoustic data. Proceedings of Meetings on Acoustics 35, 055006. (2018)

The relationship between crowd noise and crowd behavioral dynamics is a relatively unexplored field of research. Signal processing and machine learning (ML) may be useful in classifying and predicting crowd emotional state. This paper describes using both supervised and unsupervised ML methods to automatically differentiate between different types of crowd noise. Features used include A-weighted spectral levels, low-level audio signal parameters, and Mel-frequency cepstral coefficients. K-means clustering is used for the unsupervised approach with spectral levels, and six distinct clusters are found; four of these clusters correspond to different amounts of crowd involvement, while two correspond to different amounts of band or public announcement system noise. Random forests are used for the supervised approach, wherein validation and testing accuracies are found to be similar. These investigations are useful for differentiating between types of crowd noise, which is necessary for future work in automatically determining and classifying crowd emotional state.

Preprints and In Submission

Sheridan Feucht, Eric Todd, Byron C. Wallace, David Bau. The Dual Route Model of Induction. 2025.

Prior work on in-context copying has shown the existence of induction heads, which attend to and promote individual tokens during copying. In this work we introduce a new type of induction head: concept-level induction heads, which copy entire lexical units instead of individual tokens. Concept induction heads learn to attend to the ends of multi-token words throughout training, working in parallel with token-level induction heads to copy meaningful text. We show that these heads are responsible for semantic tasks like word-level translation, whereas token inductin heads are vital for tasks that can only be done verbatim, like copying nonsense tokens. These two "routes" operate independently: we show that ablation of token induction heads causes models to paraphrase where they would otherwise copy verbatim. In light of these findings, we argue that although token induction heads are vital for specific tasks, concept induction heads may be more broadly relevant for in-context learning.

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath. Open Problems in Mechanistic Interpretability. 2025.

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, Yonatan Belinkov. The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability. 2024.

Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this paper, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate depending on the goals of a given study. We argue that this framing yields a more cohesive narrative of the field, as well as actionable insights for future work. Specifically, we recommend a focus on discovering new mediators with better trade-offs between human-interpretability and compute-efficiency, and which can uncover more sophisticated abstractions from neural networks than the primarily linear mediators employed in current work. We also argue for more standardized evaluations that enable principled comparisons across mediator types, such that we can better understand when particular causal units are better suited to particular use cases