LessWrong (Curated & Popular)

"Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations" by Subhash Kantamneni, kitft, Euan Ong, Sam Marks

This episode introduces Natural Language Autoencoders (NLAs), an unsupervised method that generates natural language explanations of LLM activations. It details the training of NLAs using an activation verbalizer and reconstructor, and their application in auditing LLMs like…

Listen