Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

2025-09-08

Neel Nanda of Google DeepMind discusses mechanistic interpretability, a field focused on understanding how AIs think. He explores the challenges of reliably interpreting AI thoughts and the importance of combining this approach with other safety measures.

Listen