Studying Large Language Model Generalization with Influence Functions

https://arxiv.org/abs/2308.03296
Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamilė Lukošiūtė, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, Samuel R. Bowman

Influence functions aim to answer a counterfactual: how would the model’s parameters (and hence its outputs) change if a given sequence were added to the training set?