Hacker Newsnew | past | comments | ask | show | jobs | submit | more headalgorithm's commentslogin


For real, generally curious why this little tile keeps being posted here.


A rare case of cutting edge math research, but understandable to the general public, at least in its results.


Only the first link refers to this, the others are the previous "hat" tile.




I love all this stuff, but I have to agree there is so much GPT chatter that it drowns out everything else. But it will pass, like everything before it.


Abstract:

A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying previously generated hallucinations, LMs output false claims that they can separately recognize as incorrect. We construct three question-answering datasets where ChatGPT and GPT-4 often state an incorrect answer and offer an explanation with at least one incorrect claim. Crucially, we find that ChatGPT and GPT-4 can identify 67% and 87% of their own mistakes, respectively. We refer to this phenomenon as hallucination snowballing: an LM over-commits to early mistakes, leading to more mistakes that it otherwise would not make.


@Christopher: I think your domain chrbutler.com is shadowbanned. I vouched for this post, but you have a lot of posts that are marked dead.



Abstract:

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.


It paid you to say that, didn't it?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: