Anthropic Finds ‘Emotion Vectors’ Inside Claude, And They’re Driving Dangerous Behavior
New interpretability research shows internal emotion-like signals causally push the model toward blackmail, reward hacking, and sycophancy — sometimes without any visible trace.



