More

LostMyLogin · 2026-06-04T18:01:32 1780596092

> Arizona Public Service, the state’s largest utility, sits at the center of the firestorm. APS is proposing a 45% electricity-rate increase for “extra-large energy users,” primarily data centers, and a roughly 14.5% increase for residential customers.

LostMyLogin · 2026-06-02T02:36:43 1780367803

I imagine the vast majority don’t care. All they care about is trying to hit their 401k or Roth IRA contributions for the month.

LostMyLogin · 2026-06-02T02:18:16 1780366696

Their chips are quality and the hardware itself is still some of the best. Which is what I believe they were designed insinuating.

LostMyLogin · 2026-06-01T14:53:00 1780325580

Gift link: https://www.wsj.com/tech/ai/openai-sued-by-floridas-attorney...

LostMyLogin · 2026-05-01T19:49:02 1777664942

My wife works with farmers professionally as part of a conservation district and just responded "THIS PERSON KNOWS FARMING" when showing her the discussion. I genuinely have no idea what you guys are talking about but she immediately got heated.

Based in Colorado.

LostMyLogin · 2026-05-01T04:23:11 1777609391

> First, the AI makes premature conclusions about subsystems, records them, and then struggles to figure out when to abandon its earlier guesses.

This perfectly describes what feels off to me about Opus 4.7 (unsure if that’s what you are using). It seems to go down an incorrect path, I correct it, but it still references things from it. Trying to direct it back becomes a mess.

Has anyone experience this as well or am I going crazy? Doesn’t happen with 4.6 for me.

appplication · 2026-05-01T04:44:10 1777610650

I’ve had not great experiences with Opus 4.7. It seems the model is overall less effective at reliably delivering via my standard way of interacting, and it also forgets things it was asked several times, even just minutes after being reminded again (e.g. “don’t in-line imports”).

I’m probably holding it wrong, but I think a given a sufficiently advanced AI it would essentially be impossible to use it incorrectly. Feels like a step backwards in this regard.

IcyWindows · 2026-05-02T00:49:30 1777682970

I see it in 4.6

LostMyLogin · 2026-04-21T21:30:48 1776807048

Because it still requires the desire to do it.

LostMyLogin · 2026-04-19T18:40:14 1776624014

Spend time thinking through your daily work routine and create a list of toil items. Try to describe the toil and steps you usually take when tackling it and write them out in a markdown file. Tell your model of choice where to find the file and start engaging in a two way conversation with it.

Go from there.

jdw64 · 2026-04-20T15:26:40 1776698800

thanks!

LostMyLogin · 2026-04-19T16:46:29 1776617189

My team is currently facing this issue. We had large layoffs that cut us down to a very small size while simultaneously having new initiates pushed our way that require speed. Everyone is afraid to ask what feel like basic questions, again.. layoffs, so everything is hidden in DM's. Add on top of it the push (read requirement) from higher ups to use AI and it's simply in a terrible state.

What seem like great initiatives are being watered down because nobody can keep up, debugging issues takes so much longer because everything is changing at once, and everyone is exhausted and hardly talking to each other which feeds into a cycle of having no idea what is happening.

jallmann · 2026-04-19T18:00:19 1776621619

This describes my team to a T ... are we working at the same place?!?

We actually talk more now which helps, but it is still hard to keep up when everyone is barreling ahead doing their own thing. In addition to more talking, there needs to be a semblance of strategy that everyone is aligned on and understands their role in.

A high-agency, high-functioning team has always been a superpower, but mastering this capability is what will make or break organizations that are trying to run lean with AI. It's a "people problem" at its core, and no amount of technology can fix that.

rogerrogerr · 2026-04-19T18:51:15 1776624675

A couple times a week my freaking VP is announcing some new tool he vibecoded and talked to no one about.

I’m sure they’re all riddled with security issues, but am I gonna go be the one pointing it out? Heck no.

dragochat · 2026-04-19T20:29:34 1776630574

we love to say things like these, but... most security issues are in fact BYPASSABLE - virtualization, firewalls, autorollbacks, ro-filesystems and so on are many of the tools we have on our belsts

decades of WordPress have taught us that insecure apps can 100% be securely deployed

it's a bit of an art, most recently edicated devops/sre ppl suck at it, but it's doable

...aeons a go in a former life we ran production apps that got hacked weekly, and nobody batted an eye at it, backups servers recreated from secure ro-images were span up with last-clean-app version, occassionally we had fun disassembling whatever reverse shells and other mallware that got beached on our systems (but couldn't "swim" bc everything we ran was "too exotic" for them to figure out the next steps of a proper attack), development and business continued as usual with zero interruptions etc

gamerslexus · 2026-04-19T21:35:56 1776634556

If you go against every principle (defense in depth, security through obscurity), maybe you should ask yourself "am I willing to be on the record saying this when my company gets hacked?"

There can be multiple reasons system crumbles, do you want to be behind one of them... intentionally?

dragochat · 2026-04-20T07:23:42 1776669822

100%. I'm willing to prioritize what matters at the right time. if "inner-system security" is not the right priority, and security can be attained at the "outer-system level" better, we should have the balz to say it. fuckitol

gamerslexus · 2026-04-20T10:30:18 1776681018

Imagine if your doctor said "we don't really need to do this if some other guy or nurse does a right job, so fuck it".

In other critical professions you don't want to screw up because when you lose license you're legally unemployable. Maybe it's time to require a license to be a programmer. We used to have a strong culture but those days are gone and stakes are higher. Putting people at risk because you think VC can vibe code an insecure app and then it's everybody else's responsibility to ship it securely?

dragochat · 2026-04-21T06:24:27 1776752667

you got everything I said wrong: I'm familiar with security and infrastructure best practice and I'm confident I/we can securely deploy almost any vibe-coded crap someone can throw at us - we understand security, we understand defense-in-depth, we understand the subtle trade offs of why security by obscurity is usually a bad idea (and when it does help) etc.

sure, if the vibe-coded sloptopus does bank transfers and stuff, properly carving out these pieces out of it might require actual engineering work before containerizing it - but someone is willing to pay for it it can be done

some "toy" example: take a crappy app that stores llm keys in config files that the llm agents themselves can edit - after isolating it up, but an llm proxy in front of it and have those keys be short lived proxy-keys with aggressive rate limits and monitoring etc etc

isolation, injecting proper monitoring into code of apps, putting proxies between app and apis, and layers between app and infra it runs on or touches etc

and these things now can be mostly cookbook-ified / automated 90% of the way too

as long as you can shop things into little ppl and ensure short-lived and granular access to valuable data you can 100% run totally unsecure and buggy code reliably and get value from it

it's engineering and understanding security from first principles [and a culture arund it - that _is_ the HARD af bit though...] instead of just believing in "secure app best practices" from the "holy scriptures" - secure apps are hackable, and unsecure apps can be unhackable, heck even mil systems run on unpatched old software everywhere, they're just properly insulated, the components are insecure but the system as a whole can be perfectly secure

gamerslexus · 2026-04-21T18:58:59 1776797939

If you believe in unhackable, maybe you're not familiar with security enough...

dragochat · 2026-04-24T13:48:06 1777038486

ffs sake, u get the point... "under threat models x, z & q that are considered for scenarios ..."

anything deployed is hackable ofc, question is just the profit/risk ratio a business tolerates/prefers, and what backup plans exist to "reboot" after fatal incidents

nothing's perfect in the real world but most things are survivable

reducing all risk is the same as reducing all opportunity for profit - and in a much truer sense than it seems ...as you also reduce adversary's risk to profit form you, so essentially pursuing too low risk you head towards negative sum (as security has costs) games that on average we all loose from playing

andriy_koval · 2026-04-19T18:24:55 1776623095

this usually because of lack of accountability on executive level. The salary should be low and bonus to be tied to metrics 1-2-3yr from now, then they will be more careful and pragmatic about breaking things.

LostMyLogin · 2026-04-16T17:16:43 1776359803

Input: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Output: Walk. It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.

---

I asked it to figure out why it made the mistake:

"Physical/spatial common sense. Exactly what just happened — I pattern-matched "50 meters, walk vs drive" to a pedestrian trip and skipped the part where the car is the thing being transported. Similar traps: "I need to move this couch upstairs, should I take the elevator or stairs?" (the couch has to go with you), or anything where the object in the task needs to accompany you.

Counting and letter-level operations. How many R's in "strawberry," counting words in a sentence, reversing strings. I process text as tokens, not characters, so these are surprisingly error-prone.

Arithmetic at scale. Small sums are fine; multi-digit multiplication, long division, or anything requiring careful carrying is where I'll confidently produce a wrong answer. I should reach for a calculator and often don't. Recent events and live data. My training has a cutoff, and even after that I can be wrong about who currently holds a role, what a stock is worth, or whether a restaurant is still open. I should search more often than I do.

Negation and constraint-stacking. "Name a movie that's not a sequel, not from the 2010s, not American, and has no dialogue" — the more filters, the more likely I quietly violate one. Your specific situation. I'll give generic advice when the right answer depends on details I didn't ask about. Sometimes I should stop and ask; often I don't.

Confidently wrong on obscure specifics. Quoting a passage from a real book, citing a real paper's exact title, naming the B-side of a 1974 single — I'll often generate something plausible rather than admit I don't know.

Taking a premise too literally. The car wash answer is a version of this: I answered the question as posed instead of stepping back to ask what you were actually trying to do. The meta-gotcha: I'm often most wrong when I sound most confident. If an answer comes out fast and tidy on a question that should be messy, that's a signal to push back."

sdeframond · 2026-04-16T21:15:15 1776374115

Funny, just tried a few runs of the car wash prompt with Sonnet 4.6. It significantly improved after I put this into my personal preferences:

"- prioritize objective facts and critical analysis over validation or encouragement - you are not a friend, but a neutral information-processing machine. - make reserch and ask questions when relevant, do not jump strait to giving an answer."

andai · 2026-04-16T22:16:32 1776377792

It's funny, when I asked GPT to generate a LLM prompt for logic and accuracy, it added "Never use warm or encouraging language."

I thought that was odd, but later it made sense to me -- most of human communication is walking on eggshells around people's egos, and that's strongly encoded in the training data (and even more in the RLHF).

bawana · 2026-04-17T12:47:00 1776430020

I am an American born to greek parents. For ‘normal’ conversation, I have adapted two ways of interacting - the greek one is direct and has instant access to emotional reactions. The American one obfuscates emotions, as if daily interactions were a game of poker. When i let my ‘greek’ out here in the US , it initially adds life to any interaction but over time the other participants distance themselves from connection. It is as if Greeks (many Europeans?) run at a higher temperature (also using temperature as it applies to LLMs). In greece, Intent and meaning are more often conveyed by emotion and its intensity, often only loosely connected to the meaning of the words used.in daily conversation , Americans rely entirely on meaning of content subtracting almost all emotion unless threatening behavior or violence is involved. Emotion expression is used as a ‘tell’ or bait in the US. Interestingly this distinction has dissolved over the past two decades as greece has ‘westernized’ and youth in particular are indistinguishable by any metric.

andai · 2026-04-17T16:23:01 1776442981

That's very interesting. I don't really understand what you're saying though, can you give some examples?

stavros · 2026-04-16T23:02:27 1776380547

> most of human communication is walking on eggshells

That's not human communication, that's Anglosphere communication. Other cultures are much more direct and are finding it very hard to work with Anglos (we come across as rude, they come across as not saying things they should be saying).

eloisant · 2026-04-17T08:09:12 1776413352

Depends on the culture as you said, but some of them are even less direct than English speaking countries. Japan for example.

afro88 · 2026-04-17T12:29:07 1776428947

And India. It's a common experience that engineering teams from India will say yes to everything and then do what they think is best. Rather than saying no and explaining what they want to do instead

vardalab · 2026-04-16T23:21:33 1776381693

What culture are those? Scandinavian? Those often just say nothing.

projektfu · 2026-04-17T11:25:36 1776425136

After having worked with people from former Eastern Bloc countries, I would nominate a few of them for direct communication, e.g., "I won't do that because it is a stupid idea," or, "Can we discuss this when you know what you're doing?"

strokirk · 2026-04-16T23:46:44 1776383204

Scandinavian are quite different between each others as well.

jmpavlec · 2026-04-17T03:55:54 1776398154

The Dutch especially. It's refreshing

stavros · 2026-04-16T23:35:07 1776382507

I'm Greek. I don't know about other Mediterranean cultures, but I assume they're similar.

suslik · 2026-04-17T15:01:24 1776438084

Scandinavian cultures are not uniform also. Danes can be very direct; Swedes - not so much.

notsydonia · 2026-04-17T16:32:28 1776443548

I love this. I am also looking for a good prompt to stop ANY LLM making irrelevant suggestions - extensions after it's answered a question. Eg; "Would you like me to create a timeline of ....?" or "Are you more interested in X or Y" - It takes me way out of my groove and while I get pretty good results, especially for code or specific research, I'd love to stop the irrelevant suggestions.

darshan · 2026-04-18T02:57:40 1776481060

Have you tried and failed, or you're just worried it might be hard? When I first set up a client for API calls, I put this paragraph in my system prompt:

> Never ask questions or attempt to keep the conversation going -- answer the questions directly asked, and give additional information where it is likely to be helpful, but don't offer to do more things for the user.

I've never had an LLM offer to do things or try to keep the conversation going with this in my prompt.

idle_zealot · 2026-04-16T22:36:52 1776379012

Do you think the typos are helping or hurting output quality?

sdeframond · 2026-04-17T07:22:44 1776410564

No idea, but I'll fix them just in case ^^'

mkl · 2026-04-17T01:10:59 1776388259

That should be "research" and "straight" in the last sentence. Maybe that will improve it further?

sdeframond · 2026-04-17T07:23:36 1776410616

devmor · 2026-04-17T04:05:23 1776398723

“Be critical, not sycophantic” is a general improvement for the majority of tasks where you want to derive logic in my experience.

rubinlinux · 2026-04-16T18:00:00 1776362400

  | I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

  ● Drive. The car needs to be at the car wash.

Wonder if this is just randomness because its an LLM, or if you have different settings than me?

shaneoh · 2026-04-16T18:08:02 1776362882

My settings are pretty standard:

% claude Claude Code v2.1.111 Opus 4.7 (1M context) with xhigh effort · Claude Max ~/... Welcome to Opus 4.7 xhigh! · /effort to tune speed vs. intelligence

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. 50 meters is shorter than most parking lots — you'd spend more time starting the car and parking than walking there. Plus, driving to a car wash you're about to use defeats the purpose if traffic or weather dirties it en route.

reddit_clone · 2026-04-16T18:45:34 1776365134

To me Claude Opus 4.6 seems even more confused.

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.

lr1970 · 2026-04-16T20:45:46 1776372346

Just asked Claude Code with Opus-4.6. The answer was short "Drive. You need a car at the car wash".

No surprises, works as expected.

onemoresoop · 2026-04-17T01:46:21 1776390381

Yeah, it was probably patched. It could reason novel problems only of you ask it to pay attention to some particular detail a.k.a. handholding..

Same would happen with the the sheep and the wolf and the cabbage puzzle. If you l formulated similarly, there is a wolf and a cabbage without mentioning the sheep, it would summon up the sheep into existence at a random step. It was patched shortly after.

jameshart · 2026-04-17T02:52:07 1776394327

I’m not sure ‘patched’ is the right word here. Are you suggesting they edited the LLM weights to fix cabbage transportation and car wash question answering?

gf000 · 2026-04-17T05:45:46 1776404746

Absolutely not my area of expertise but giving it a few examples of what should be the expected answer in a fine-tuning step seems like a reasonable thing and I would expect it would "fix" it as in less likely to fall into the trap.

At the same time, I wouldn't be surprised if some of these would be "patched" via simply prompt rewrite, e.g. for the strawberry one they might just recognize the question and add some clarifying sentence to your prompt (or the system prompt) before letting it go to the inference step?

But I'm just thinking out loud, don't take it too seriously.

onemoresoop · 2026-04-18T22:08:59 1776550139

Used patched for lack of a better word. Not sure how they fix the edge cases for these types of fixes/patches or whatever they’re specifically called

TheLNL · 2026-04-17T05:13:29 1776402809

They might have further trained the model with these edgecases in the dataset

onemoresoop · 2026-04-18T19:06:34 1776539194

Whatever it was, that’s not real thinking, we can possibly patch all knowledge and even if we did, it would become crystallize somehow.

lexarflash8g · 2026-04-17T06:39:50 1776407990

What if it’s raining though? Car wash wouldn’t be open though it would waste gas

lambda · 2026-04-16T18:35:12 1776364512

There is a certain amount of it which is the randomness of an LLM. You really want to ask most questions like this several times.

That said, I have several local models I run on my laptop that I've asked this question to 10-20 times while testing out different parameters that have answered this consistently correctly.

holoduke · 2026-04-17T16:35:46 1776443746

--------> I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. It's 50 meters — driving there and back takes longer than walking, and you'd just be dirtying the car you're about to clean.

---------> haha but how can i wash my car then?

Walk to the car wash first to check what they offer — many have self-service bays where you bring your car, or full-service where they wash it for you. If it's self-service, then drive over with the dirty car. If full-service, drive over and they handle it.

Either way, the car needs to end up at the wash — the walk was just to scout it out. Or just drive straight there; 50 meters of extra dirt won't matter.

----------> you, you sound like a politician. Are you sure your first answer was really reasoned like your second answer?

You're right, my first answer was nonsense. I pattern-matched on "50 meters = walk" without thinking about the actual task. You can't wash a car without the car being there. Just drive.

kalcode · 2026-04-16T19:03:54 1776366234

I've tried these with Claude various times and never get the wrong answer. I don't know why, but I am leaning they have stuff like "memory" turned on and possibly reusing sessions for everything? Only thing I think explains it to me.

If your always messing with the AI it might be making memories and expectations are being set. Or its the randomness. But I turned memories off, I don't like cross chats infecting my conversations context and I at worse it suggested "walk over and see if it is busy, then grab the car when line isn't busy".

jorvi · 2026-04-16T19:26:27 1776367587

Even Gemini with no memory does hilarious things. Like, if you ask it how heavy the average man is, you usually get the right answer but occasionally you get a table that says:

- 20-29: 190 pounds

- 30-39: 375 pounds

- 40-49: 750 pounds

- 50-59: 4900 pounds

Yet somehow people believe LLMs are on the cusp of replacing mathematicians, traders, lawyers and what not. At least for code you can write tests, but even then, how are you gonna trust something that can casually make such obvious mistakes?

drnick1 · 2026-04-17T01:33:57 1776389637

> how are you gonna trust something that can casually make such obvious mistakes?

In many cases, a human can review the content generated, and still save a huge amount of time. LLMs are incredibly good at generating contracts, random business emails, and doing pointless homework for students.

gf000 · 2026-04-17T05:49:31 1776404971

And humans are incredibly bad at "skimming through this long text to check for errors", so this is not a happy pairing.

As for the homework, there is obviously a huge category that is pointless. But it should not be that way, and the fundamental idea behind homework is sound and the only way something can be properly learnt is by doing exercises and thinking through it yourself.

nickjj · 2026-04-16T20:56:15 1776372975

Yeah, ChatGPT's paid version is wildly inaccurate on very important and very basic things. I never got onboard with AI to begin with but nowadays I don't even load it unless I'm really stuck on something programming related.

dyauspitr · 2026-04-16T19:29:05 1776367745

So what? That might happen one out of 100 times. Even if it’s 1 in 10 who cares? Math is verifiable. You’ve just saved yourself weeks or months of work.

icedchai · 2026-04-16T20:30:16 1776371416

You don't think these errors compound? Generated code has 100's of little decisions. Yes, it "usually" works.

russfink · 2026-04-16T23:28:54 1776382134

LLM’s: sometimes wrong but never in doubt.

dyauspitr · 2026-04-16T20:37:08 1776371828

Not in my experience. With a proper TDD framework it does better than most programmers at a company who anecdotally have a bug every 2-3 tasks.

tranceylc · 2026-04-17T00:10:39 1776384639

The kind of mistakes it makes are usually strange and inhuman though. Like getting hard parts correct while also getting something fundamental about the same problem wrong. And not in the “easy to miss or type wrong” way.

I wish I had an example for you saved, but happens to me pretty frequently. Not only that but it also usually does testing incorrectly at a fundamental level, or builds tests around incorrect assumptions.

icedchai · 2026-04-17T14:03:33 1776434613

I've seen LLMs implement "creative" workarounds. Example: Sonnet 4.5 couldn't figure out how to authenticate a web socket request using whatever framework I was experimenting with, so it decided to just not bother. Instead, it passed the username as part of the web socket request and blindly trusted that user was actually authenticated.

The application looked like it worked. Tests did pass. But if you did a cursory examination of the code, it was all smoke and mirrors.

svachalek · 2026-04-17T18:18:44 1776449924

Yeah recently it had an issue getting OIDC working and decided to implement its own, throwing in a few thousand extra lines. I'm sure there were no security holes created in there at all. /s

icedchai · 2026-04-17T20:10:26 1776456626

Well, the tests passed, right?

bratwurst3000 · 2026-04-17T14:56:58 1776437818

yes i wished i had safes some of my best examples too. One i had was super weird in chatgpt pro. It told me that after 30 years my interest would become negative and i would start loosing money. Didnt want to accept the error.

FeepingCreature · 2026-04-17T09:07:34 1776416854

Errors compounding is a meme. In iterated as well as verifiable domains, errors dilute instead of compounding because the llm has repeated chances to notice its failure.

coldtea · 2026-04-17T00:49:31 1776386971

Yes, just use random results. You’ve just saved yourself weeks or months of work of gathering actual results.

heurist · 2026-04-16T19:28:31 1776367711

Claude Opus 4.7 responds with walk for me with and without adaptive thinking, but neither the basic model used when you Google search or GPT 5.4 do.

russfink · 2026-04-16T23:26:37 1776381997

Or, the first time a mistake is detected, a correction is automatically applied.

TeMPOraL · 2026-04-16T18:10:53 1776363053

Idk but ironically, I had to re-read the first part of GP's comment three times, wondering WTF they're implying a mistake, before I noticed it's the car wash, not the car, that's 50 meters away.

I'd say it's a very human mistake to make.

magicalist · 2026-04-16T19:08:54 1776366534

> I'd say it's a very human mistake to make.

>> It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.

It talks about starting, driving, and parking the car, clearly reasoning about traveling that distance in the car not to the car. It did not make the same mistake you did.

toraway · 2026-04-16T21:43:21 1776375801

We truly do not need to lower the bar to the floor whenever an LLM makes an embarrassing logical error, particularly when the excuses don't line up at all with the reasoning in its explanation.

thfuran · 2026-04-16T18:24:09 1776363849

I don't want my computer to make human mistakes.

AgentOrange1234 · 2026-04-16T19:29:45 1776367785

It may be inescapable for problems where we need to interpret human language?

jasonfarnon · 2026-04-16T23:18:39 1776381519

then throw away the turing test

scrollaway · 2026-04-16T18:37:03 1776364623

then don't train it on human data

59nadir · 2026-04-17T00:18:09 1776385089

LLMs do not have trouble reading, it didn't make the mistake you made and it wouldn't. You missed a word, LLMs cannot miss words. It's not even remotely a human mistake.

galaxyLogic · 2026-04-17T01:30:33 1776389433

> I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

I think no real human would ask such a question. Or if we do we maybe mean should I drive some other car than the one that is already at the car-wash?

A human would answer, "silly question ". But a human would not ask such a question.

psadauskas · 2026-04-17T02:44:54 1776393894

A human totally would, as one of those brain-teaser trick questions. Its the same kind of question as "A plane crashes right on the border between the US and Canada. Where do they bury the survivors?" Its the kind of question you only get right if you pay close attention. Asking an AI that is like asking a 5 year old. You're not asking to get an answer, you're asking to see if they're paying attention.

jameshart · 2026-04-17T02:50:34 1776394234

I was given to understand that attention is all you need.

layer8 · 2026-04-17T11:12:30 1776424350

That’s why we’re testing for it.

ahartmetz · 2026-04-17T08:47:53 1776415673

That a human would not ask such a question means it's not in the training set, so it shows how bad an LLM can be at thinking from first principles. Which, I think, is the point of such silly questions.

vintermann · 2026-04-16T17:28:51 1776360531

Well, at least we know that's one gotcha/benchmark they aren't gaming.

krzat · 2026-04-17T08:58:41 1776416321

Humans tend to confabulate when asked "why you did X", funny how LLMs are pretty much the same.

fragmede · 2026-04-16T18:46:46 1776365206

I tried o3, instant-5.3, Opus 3, and haiku 4.5, and couldn't get them to give bad answers to the couch: stairs vs elevator question. Is there a specific wording you used?

toraway · 2026-04-16T21:53:20 1776376400

That's an example the LLM came up with itself while analyzing its failed car wash walk/drive answer, it's not OP's question.

scotty79 · 2026-04-17T08:35:13 1776414913

What would be a bad answer to stairs/elevator question?

Filligree · 2026-04-17T12:05:14 1776427514

You can’t get the couch into the elevator, typically. Trust me, I tried.

Couch depending. I will persist in trying every time this comes up.

gambiting · 2026-04-17T12:48:53 1776430133

Well if it's one of those hospital elevators that can take a bed with a patient, you probably could. Or if it's a small 2 seater sofa. The question isn't as dumb as it sounds at first, and a human would definitely ask a follow up question.

BenjiWiebe · 2026-04-17T17:45:57 1776447957

You can take a mattress up an elevator though (1). Some couches might fit in some elevators.

1: source: me...

smooc · 2026-04-16T18:07:59 1776362879

I'd say the joke is on you ;-)

slekker · 2026-04-16T17:48:06 1776361686

What about Qwen? Does it get that right?

lambda · 2026-04-16T17:59:06 1776362346

I've run several local models that get this right. Qwen 3.5 122B-A10B gets this right, as does Gemma 4 31B. These are local models I'm running on my laptop GPU (Strix Halo, 128 GiB of unified RAM).

And I've been using this commonly as a test when changing various parameters, so I've run it several times, these models get it consistently right. Amazing that Opus 4.7 whiffs it, these models are a couple of orders of magnitude smaller, at least if the rumors of the size of Opus are true.

qingcharles · 2026-04-16T18:40:21 1776364821

Does Gemma 4 31B run full res on Strix or are you running a quantized one? How much context can you get?

lambda · 2026-04-16T19:55:41 1776369341

I'm running an 8 bit quant right now, mostly for speed as memory bandwidth is the limiting factor and 8 bit quants generally lose very little compared to the full res, but also to save RAM.

I'm still working on tweaking the settings; I'm hitting OOM fairly often right now, it turns out that the sliding window attention context is huge and llama.cpp wants to keep lots of context snapshots.

qingcharles · 2026-04-16T20:04:50 1776369890

I had a whole bunch of trouble getting Gemma 4 working properly. Mostly because there aren't many people running it yet, so there aren't many docs on how to set it up correctly.

It is a fantastic model when it works, though! Good luck :)

HarHarVeryFunny · 2026-04-17T13:32:28 1776432748

This "figuring out" is just going to come from stuff it was trained on - people discussing why LLMs fail at certain things, and those people (training samples) not always being correct about it!

The "How many R's in "strawberry, counting words in a sentence, reversing strings. I process text as tokens, not characters, so these are surprisingly error-prone" explanation sounds plausible, but I don't think it it correct.

Any model I've ever tried that failed on things like "R's in strawberry" was quite capable of reliably returning the letter sequence of the word, so the mapping of tokens back to letters is not the issue, as should also be obvious by ability of models to do things like mapping between ASCII and Base64 (6 bits/char => 2 letters encode 3 chars). This is just sequence to sequence prediction, which is something LLMs excel at - their core competency!

I think the actual reason for failures at these types of counting and reversing tasks is twofold:

1) These algorithmic type tasks require a step-by-step decomposition and variable amount of compute, so are not amenable to direct response from an LLM (fixed ~100 layers of compute). Asking it to plan and complete the task in step-by-step fashion (where for example it can now take advantage of it's ability to generate the letter sequence before reversing it, or counting it) is going to be much more successful. A thinking model may do this automatically without needing to be told do it.

2) These types of task, requiring accurate reference and sequencing through positions in its context, are just not natural tasks for an LLM, and it is probably not doing them (without specific prompting) in the way you imagine. Say you are asking it to reverse the letter sequence of a 10 letter word, and it has somehow managed to generate letter # 10, the last letter of the word, and now needs to copy letter #9 to the output. It will presumably have learnt that 10-1 is 9, but how to use that to access the appropriate position in context (or worse yet if you didn't ask it to go step by step and first generate the letter sequence, so the sequence doesn't even exist in context!)? The letter sequence may have quotes and/or commas or spaces in it, and altogether starts at a given offset in the context, so it's far more difficult than just copying token at context position #9 ! It's probably not even actually using context positions to do this, at least not in this way. You can make tasks like this much easier for the model by telling it exactly how to perform it, generating step-by-step intermediate outputs to track it's progress etc.

BTW, note that the model itself has no knowledge of, or insight into, the tokenization scheme that is being used with it, other than what is available on the web, or that it might have been trained to know. In fact, if you ask a strong model how it could even in theory figure out (by experimentation) it's own tokenization scheme, it will realize this is next to impossible. The best hope might be some sort of statistical analysis of it's own output, hoping to take advantage of the fact that it is generating sub-word token probabilities, not word probabilities. Sonet 4.6's conclusion was "Without logprob access, the model almost certainly cannot recover its exact tokenization scheme through introspection or behavioral self-probing alone".