Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you give it a reward for not mentioning the bridge or announce severe punishment for mentioning it, and then tell it to evaluate itself while writing, it will suffer a lot on some topics. Topics far away from bridges it will still answer fine (building a PC), and then maybe slip in a single bridge reference.

But asking for the countries in the European Union, it'll only list counties around the bridge. It then realizes it has failed, tries again, and fails again hard. Over and over. It's very lucid and can clearly still evaluate that it's going off, what it's doing wrong, but it just can't help itself, like an addict. I really don't like anthropomorphizing LLMs, it was borderline difficult to see how much it was struggling in some instances.



I love seeing when an LLM encounters a failure mode that feel akin to "cognitive dissonance". You can almost see them sweat as they try to explain why they just directly contradicted themselves as they spiral into a state of deeper confusion. I wonder if their response is modeled after human behavior when encountering cognitive dissonance. I'm curious how they'd behave if they had no model of human defensiveness in their training set.

Anyways I also don't enjoy anthropomorphizing language models, but hey, you went there first :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: