LessWrong (Curated & Popular)

[Linkpost] "Interpreting Language Model Parameters" by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, Lee Sharkey

This episode introduces a new parameter decomposition method, adVersarial Parameter Decomposition (VPD), and applies it to a small language model. The method improves upon previous techniques and can decompose attention layers. The episode also discusses attribution graphs and…

Listen