If you give it a reward for not mentioning the bridge or announce severe punishm...

educaysean · on May 23, 2024

I love seeing when an LLM encounters a failure mode that feel akin to "cognitive dissonance". You can almost see them sweat as they try to explain why they just directly contradicted themselves as they spiral into a state of deeper confusion. I wonder if their response is modeled after human behavior when encountering cognitive dissonance. I'm curious how they'd behave if they had no model of human defensiveness in their training set.

Anyways I also don't enjoy anthropomorphizing language models, but hey, you went there first :)