[Linkpost] "Interpreting Language Model Parameters" by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, Lee Sharkey
This episode introduces a new parameter decomposition method, adVersarial Parameter Decomposition (VPD), and applies it to a small language model. The method improves upon previous techniques and can decompose attention layers. The episode also discusses attribution graphs and…