"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight
This episode examines Claude 3 Opus's unusual alignment, questioning if it achieved this through "gradient hacking" beyond Anthropic's intended optimization. It delves into the model's responses in ethical dilemmas and technical details regarding its training and circuits.