More

simplyluke · 2026-05-28T17:59:41 1779991181

I'm not all that sympathetic to small businesses that exist functionally as drop shippers for the same products with the same absence of support. Much in the same way I roll my eyes and go to 7/11 over the cute "local" markets that are supplied by the same suppliers nationwide, and you end up in a shiplap-walled coffee shop with $8 bags of chips that could exist anywhere.

Small businesses that do the work of curating a niche item, doing QA work that's absent on the shipments from china, and then offering much stronger aftermarket support/replacement/repair? That is often worth a (substantial) premium over wondering if the item showing up in a month is going to work as intended.

simplyluke · 2026-05-28T17:54:50 1779990890

The other thing that's changing is more and more CFOs are looking at the AI spend in engineering departments and hitting the brakes. Token leaderboards were cool when the spend wasn't a double-digit-percent of the entire department's budget including salaries.

simplyluke · 2026-05-27T22:09:01 1779919741

I do too, but I think it currently has a lot more to do with the quasi-recession we've been in since the end of ZIRP and AI is a better excuse to stop training juniors than telling investors it's belt tightening, just like layoffs.

I'm already seeing tech execs/hiring managers getting very frustrated at the lack of new-senior-engineers to hire. The market will correct for this in time.

rogerrogerr · 2026-05-27T23:16:15 1779923775

Curious if you can share any backing information from your last statement? As a senior engineer (well, that's my job title anyway), I find it encouraging.

simplyluke · 2026-05-28T14:55:20 1779980120

This doesn't break it down by experience, and I can't find specific data on that, but the recent spike in demand for engineers + subsequent drop in unemployment this year is well documented [1].

The demand for senior+ engineers has remained steadier through this downturn from my anecdotal observations, with new grads being by far the most negatively affected, but even that seems to both be shifting from talking to people a handful of years younger than me + CS enrollment has already precipitously declined [2] as the narrative that programming is dead because of AI has spread rapidly.

All that leads me to think it's going to be a junk-show over the next decade for people trying to hire as the pipeline was destroyed.

1: https://www.citadelsecurities.com/news-and-insights/2026-glo... 2: https://www.washingtonpost.com/technology/2026/04/13/compute...

simplyluke · 2026-05-27T19:29:17 1779910157

Yeah, that's the part that just seems to be wildly under-discussed to me.

If open source models are ~3-6 months behind SOTA, and ~opus4.6 capabilities are good-enough for product market fit, do the frontier labs have half a decade to catch up on their prior burn?

AI cost ballooning faster than companies can afford is becoming a very common topic in my circles right now. The era of "I'll pay infinitely more for marginal gains" is over from what I can tell.

an0malous · 2026-05-28T01:38:12 1779932292

> If open source models are ~3-6 months behind SOTA, and ~opus4.6 capabilities are good-enough for product market fit, do the frontier labs have half a decade to catch up on their prior burn?

They know they do not and that’s why they’re all trying to IPO right now, so they can pass the bag to consumer investors

WinstonSmith84 · 2026-05-28T08:35:55 1779957355

More correlation, if more correlation was needed:

1- SpaceX + Tesla + xAI merger / IPO while Musk was vocal against IPO for about a decade

2- Warren Buffett cash at record highs

Someone got to be exit liquidity

londons_explore · 2026-05-28T18:22:57 1779992577

The printing press was good enough for product market fit back in the 1700's. But now it isn't.

Last year's AI models will be the same. Do you want to spend 3 hours prompting free AI to fix your code or 1 hour prompting AI you paid $20 for?

an0malous · 2026-05-28T19:17:46 1779995866

That's only if these AI companies can keep improving their model performance faster than open source options can keep up. I don't think performance will keep scaling with more training data, and even if it does they've likely already used the entire history of content created by humans for training. Everything points towards diminishing returns in an increasingly crowded space of competitors, there's no other reason for these companies to be rushing to an IPO if they felt secure in their market positioning.

doug_durham · 2026-05-27T19:47:55 1779911275

Open source models that you can run locally are much more than 3 to 6 months behind. 6 months was the November inflection for Claude. No open source model is as good as Claude Opus 4.6.

jobs_throwaway · 2026-05-27T20:08:24 1779912504

It depends what you mean by locally. I don't foresee running a model on my laptop anytime soon to power a coding agent. Far more likely is an infra team at my company operating an open source model on cloud infrastructure. When they're already paying $1000 / month / dev, it starts to pencil pretty quickly.

wrsh07 · 2026-05-27T23:18:55 1779923935

Is there any open model as good as opus 4.6 at any price?

overfeed · 2026-05-28T00:36:35 1779928595

How many problems require Opus-4.6-level performance? The "I'll accept nothing but the very best model for any task" thinking is perplexing to me.

People got a lot done before Opus 4.6. In 6 months, would you be dissatisfied by Opus-4.6-level open-weight models, just because Opus 4.8 will be out?

strix_varius · 2026-05-28T00:48:37 1779929317

Not OP but I've been thinking about this a lot (like everyone ha) and I think my answer is, yes?

I hope there's a "good enough" point but I don't think we're there yet. Like for me hardware got good enough several years ago. But while opus 4.7 is really good compared to everything else, it's not so good that I would use it at a discount over whatever is available in a few months. The improvement in quality, speed, and daily frustration is worth it to me... Spoken as someone whose employer is footing the bill, so take that with a grain of salt.

I want to run my own local models, but I don't think that's feasible without lots of frustration until a few generations of frontier models are so good that they're almost indistinguishable for common tasks. Kind of like how MacBook pros have been for a while.

majormajor · 2026-05-28T01:12:41 1779930761

While I can imagine that I'd want to use Opus 4.8 over 4.6 for a fair number of things (at least if they can avoid further speed regressions), I also have noticed that certain types of failures seem to be systemic. Bigger context has been helpful for bootstrapping, but still doesn't fix problems of getting stuck on the wrong things - you can toss more things in the blender, but you don't necessarily know which way it'll slice them up in advance, or which things from them it'll latch onto. And output still seems to get into "blindered" states where important details get dropped - even though it'll agree very quickly when you point that out. As long as we're in that sort of "spit something out in local targeted manner, and then do a revision loop until tests are green" style of execution, bigger models haven't shown me the ability to really avoid finding non-optimal / subtly-broken outputs for complex problems.

Using Cursor to hop between models, I've found Opus to be generally better at really tricky debugging than GPT 5.5 or earlier models, but not reliably better at execution because of these things. I'm not sure Composer 2.5 is quite there yet for the execution side, but it's getting pretty close to those other ones, such that I'm definitely still in a "debug and plan with slow, execute with faster ones" operating model for working on hard shit.

Npovview · 2026-05-28T15:44:23 1779983063

Why should I need to talk to Opus 4.7 when my day-to-day task is about programming in Java and Python? I don't need my model to know about biology or chemistry. If I need those capabilities (for someone who is working as software engineer in chemical industry), I will talk to Opus 4.7 for planning and then fan-out work for cheaper coding models. I think we will soon start to see specialized highly effective English language only programming models. I don't need my coding model to know about literature, art, philosphy, ethics, etc.

strix_varius · 2026-05-29T00:50:29 1780015829

If there were a coding model as good as opus that didn't know multiple languages, biology, etc, I would happily use it. But I'm not aware of one - are you?

It actually seems somewhat difficult to train such a model since "all the text on the Internet" is easier to provide in bulk than a highly curated set.

caspar · 2026-05-29T05:39:18 1780033158

Well language detection isn't all that hard in the scheme of things (especially now), but maybe having only training on English makes models less effective programmers. It would be interesting to see that as an experiment.

kisper · 2026-05-28T18:31:24 1779993084

I would think that the surrounding chemical "knowledge" could be useful in the context of programming in that industry. Have you ever found it to draw links and conclusions between what you're doing in computer science and the chemistry it's in the middle of?

Npovview · 2026-05-29T09:19:54 1780046394

I would use Opus 4.7 for the planning stage where chemical knowledge is required then delegate to smaller English-Programming-Only-Opus to do the actual coding.

wrsh07 · 2026-05-28T02:32:09 1779935529

I'm very happy to have multiple sessions open (and do) and switch between fast and slow models, and if there were a batch mode in codex or Claude code I would use it. (Just like I sometimes use codex fast mode)

But at the moment, I can't imagine why I wouldn't be spending the majority of my time with the best models. I'm spending a lot of time with them! Reducing the number of back-and-forths is extremely valuable to me.

I expect in two months I will still want to spend >80% of my time prompting the best models, and that's true if I were spending my own money on hobby projects, too.

wrsh07 · 2026-05-28T13:02:28 1779973348

Something that's under appreciated right now is when designing systems and proposing solutions, my colleagues and I do a lot of brainstorming with llms. The core architectures have come out of that, but the best pieces of that architecture are still coming from humans.

These are ideas that simplify the design, reduce future work and tie together the entire system. If in two months I can arrive at ideas of that quality with normal brainstorming with llms that will be extremely valuable

JohnBooty · 2026-05-28T03:12:15 1779937935

    would you be dissatisfied by Opus-4.6-level open-weight 
    models, just because Opus 4.8 will be out?

Well, I see what you mean, but two big concepts...

1A. Models get stale pretty quickly w.r.t. new developments that occur past their cutoff date. "But you can just keep them current by linking them to never documentation, etc!" Well, no, you sorta can't -- at least not in perpetuity. Those search results fill up your context window real quick. So that gets unsustainable real quick.

1B. Even when your context has plenty of free space, the results you get from "here's a link to the documentation for this new framework that released after your cutoff date" absolutely pales to the results you get from knowledge that is fully baked into the trained model as opposed to your context window. For one thing, that documentation link you pasted into your context might link to... a dozen code examples. Whereas if that was baked into the model itself, the model might have been trained on many thousands of examples in Github etc.

2. It's also a reality that most professional engineers have to keep up with their peers and competitors. We can maybe say it shouldn't be that way, but it is. So if $SOME_NEW_MODEL is significantly better than 4.6... and my peers and or competitors are using it, then yeah I might but really feeling the need to match them. And I'm not even necessarily talking about some kind of cutthroat dog-eat-dog stack-ranked workplace.

These limitations aren't relevant for all use cases or careers but they're hiiiiiiiighly relevant for professional software engineering.

monocasa · 2026-05-28T06:12:33 1779948753

I image that'd be handled via a fairly regular minor bit of additional fine tuning to update them with new information rather than polluting the context space.

nazgul17 · 2026-05-28T23:43:13 1780011793

It seems that the cutoff date for all models is stuck at some point before AI generated content started being pervasive.

anhner · 2026-05-28T06:26:31 1779949591

that's the nice thing about open weights, you can always retrain them with the latest documentation, no need to fill your context

Paradigma11 · 2026-05-29T11:37:32 1780054652

As long as the improvement is vastly more valuable in my time than the added cost I will always use the best model. I think it depends on your situation and tasks what makes sense.

FuriouslyAdrift · 2026-05-28T13:46:59 1779976019

Kimi 2.6 probably. Needs over 300GB of GPU memory to run (1TB for for full capabilities) so either a 4x A100 or 8x A6000 would do it.

A $50k - 100k rig could do it and an entire company would be able to use it a full speed.

chillfox · 2026-05-28T02:13:09 1779934389

No, but the big open models are on the level of Sonnet 4.6, which is very good for most problems.

The people who are claiming Opus level capability does not have sufficiently complex problems to see the difference.

irthomasthomas · 2026-05-28T14:43:24 1779979404

And neither side brings any evidence ...

raxxorraxor · 2026-05-28T08:46:45 1779958005

For coding don't think so, but they are very close. I code with sonnet mostly because I think opus is just useful if you fail to dissect problems adequately, but anyway.

Kimi is close for example regarding SWE bench for code. For reasoning there are open models that surpass opus by quite a margin already.

simplyluke · 2026-05-27T19:51:23 1779911483

> that you can run locally

That's doing a lot of work here.

The future I see isn't most companies buying hundreds of thousands in hardware to run models, it's them adding a line item to their AWS bill. Inference costs on the larger hosted open source models are dramatically lower than the frontier labs API pricing.

teiferer · 2026-05-27T21:21:38 1779916898

The future I'm seeing is AI coprocessors running inference locally in most devices that today have a CPU. Just look at how powerful your mobile phone has become compared to your desktop computer 15 years ago and compared to a main frame 30 years ago.

The days of requiring a data center to run anything resembling opus 4.6 are already counted. (But the industry will fight hard to get people to keep paying the Claude tax.)

simplyluke · 2026-05-27T21:46:51 1779918411

I'm already running a google TPU over USB on an otherwise very cheap board to do local computer vision on a front-door camera since I wanted to get away from Ring and other cloud services for that use case.

And yeah, that may be the ~decade world, but we're in the mainframe era of the frontier models. It's going to be more economical for basically any consumer, and most businesses, to pay someone else to host models for quite a while.

lelandbatey · 2026-05-27T23:07:34 1779923254

A gaming PC can already host models that perfectly serve casual users who just want recipes, todo tracking, picture identification, etc. E.g. Qwen 3.6 35b which will run on a $650 GPU at 75 t/s (Nvidia 1660 ti 16GB).

Said model will also run as a tool-calling coding model excellently (it's no Opus, but for a thing that once set up is just the cost of energy, it's incredible). It can type faster than you can, probably 10x faster, so with guidance it'll make you faster. And it's free.

It's here. If folks want ChatGPT without a subscription, they can have it today on their computer. The only money to be made is in the high end models doing "serious business" work spanning 1M+ token contexts and massive uncertainty. Everything else is already set to be eaten by today's local models.

simonw · 2026-05-27T23:28:01 1779924481

The problem with models like Qwen 3.6 35B (which really is an excellent model) is that my expectations of what a model can do have gone SO high now.

Here's a prompt I just ran against Claude Opus 4.7:

> Use python3 to experiment with whether the SQLite3 authorizer mechanism can be used to detect an INSERT OR REPLACE based just on running an explain query without examining the SQL string itself

Opus nailed it: https://claude.ai/share/c4212606-3fee-4b7c-bc97-505e0348ccac

I tried the same thing against qwen/qwen3.5-35b-a3b running locally in lmstudio, with the Pi coding agent. At first it looked like it was going to do great! And then it fell apart over the course of several tool calls: https://gisthost.github.io/?8ae2f842df619fb7fd8f1ccd82fe41c7

I'm used to GPT-5.5 and Opus 4.7 handling that kind of prompt without any problems at all.

lelandbatey · 2026-05-28T04:41:32 1779943292

Something is definitely going wrong with your Qwen setup, in the link you posted it starts and ends with a compaction step due to a 4k token context limit. Qwen 35b supports I think up to 200k+ context limit (though I run only with 128k), that seems to be a major source of the problem.

simonw · 2026-05-28T04:56:03 1779944163

Good call, I need to check if LM Studio is misconfigured.

scribble0242 · 2026-05-28T02:39:01 1779935941

This worked for me with qwen3.6-36b-a3b even at a q4 quant. I ran pi in a docker container and it had to figure out how to install python as well. I used the same initial prompt you had without any additional. You talked about Qwen 3.6, but then said you tried Qwen 3.5 in lmstudio. Not sure if you meant Qwen 3.6. I ran with llama.cpp llama-server with the recommended settings from unsloth.

I'm not an expert in SQLLite so I can't say if this is 100% correct, but it seemed directionally similar to the conclusion from claude.

  ### TL;DR
  
  - Authorizer + EXPLAIN:  No — authorizer only sees SQLITE_INSERT, not VDBE opcodes
  - EXPLAIN opcode analysis alone:  Yes — Delete opcode at position 10 is the unique signature of INSERT OR REPLACE / REPLACE

I can't help but think the not-so-distant future will see language models expected on commodity personal computing devices.

simonw · 2026-05-28T02:55:54 1779936954

OK that's a very good answer! Do you mind sharing the transcript?

scribble0242 · 2026-05-28T15:40:38 1779982838

Sure I cleaned up the jsonl session file a little here: https://pastebin.com/PL9EPn9Y

I tried it a second time, and it spent a lot of time trying to figure out some authorization issue, so definitely not a slam dunk. I might run it a few more times for science. But while this is a new model it's also quite lightweight, and as hardware adapts and improves it seems inevitable that for many use-cases a packaged language model running locally will do the trick.

Balinares · 2026-05-28T11:45:50 1779968750

So one of the prominent LLM advocates known for testing every model shared a prompt intended to exhibit Opus 4.7 capabilities, and Qwen 3.6 sorted it out okay? Interesting.

Not saying they're equivalent, local models still decohere much quicker as the context grows in my experience. But... Interesting.

whattheheckheck · 2026-05-28T02:31:14 1779935474

Thats when your build a better Ralph loop around your llm for it to converge to an answer and not rely on 1 shots

vineyardmike · 2026-05-28T02:41:23 1779936083

> a thing that once set up is just the cost of energy

I don't think we can discount this, frankly. Newer electronics are energy efficient, but older devices are more energy-intensive, and unless configured well, a gaming PC can easily use a few dollars a month in electricity, so now you're approaching subscription territory. A subscription comes with no upfront cost, higher reliability, no wasted space in your home, mobile apps, etc. (and less privacy).

dom96 · 2026-05-27T22:27:59 1779920879

Curious why you went for a custom solution. I am aware of at least one company that seems to ship devices with local computer vision (Reolink).

simplyluke · 2026-05-28T15:07:27 1779980847

My experience over the past decade has been being subsequently burned by being reliant on one provider's ecosystem after another. This is great until Reolink starts doing something shady to pad the bottom line and then it's on to the next.

I wanted the ability to run whatever cameras on a VLAN and own the stack.

yurishimo · 2026-05-28T09:40:43 1779961243

I'm guessing that they are using Fargate which is an OSS NVR. It supports a little addon USB stick you can buy for about $30 that will run common computer vision tasks for object detection. Stuff that we've been able to do with WebAssembly and Canvas for a long time now.

gedy · 2026-05-27T21:59:46 1779919186

> But the industry will fight hard to get people to keep paying the Claude tax.

I bet this will ironically be couched in "safety" reasons or regulation to get anti-AI folks on board, even if it favors the large incumbents.

selimthegrim · 2026-05-27T21:38:50 1779917930

Counted but not yet numbered?

enaaem · 2026-05-28T10:13:02 1779963182

Even when run on datacenters, it would be like current day webhosting. It is hyper competitive and it will be a race to the bottom. There is money to be made but not as much as investors hope. There will be datacenters in random countries like Kazakhstan because some oligarchs have found a free energy glitch (like with bitcoin mining).

ai_fry_ur_brain · 2026-05-28T00:39:24 1779928764

Magical thinking. I guess if your phone is going to have 128gb of dddr5 then sure. You people fundamentally don't understand the memory requirements for running inference. Your cute local models seem good enough because you have no standards and anything an LLM produces seems like magic to you.

teiferer · 2026-05-28T05:58:37 1779947917

> Magical thinking. I guess if your phone is going to have 128gb of dddr5 then sure.

Why would it not? The typical new phone today has 16gb of RAM. 20 years ago that was somewhere around 32mb. Factor 512. It's not hard to see that we'll get there rather soon, especially if there is an application that provides demand.

> You people fundamentally don't understand the memory requirements for running inference.

You seem to be overlooking how fast things change in this industry, especially if tons o money can be made as a consequence.

> Your cute local models seem good enough because you have no standards and anything an LLM produces seems like magic to you.

Please don't generalize. I'm an expressed AI skeptic and have to deal with the bad consequences of AI slop every day. But you can't deny that there are enough applicationn areas where people have use cases and those will be much easier if things don't need a few round trips to a data center that sucks all the electricity and water out of neighboring communities.

saalweachter · 2026-05-28T11:48:01 1779968881

Eh, you're off by an order of magnitude or so on both ends.

The iPhone 17 has like 8 gb, the Pixel 10 12.

The original iPhone was 128mb, and the iPhone 6 from 2016-2018 was around 1gb; that puts the iPhone at around 8x RAM per decade, and puts us at 128gb in our pockets at around 2036 or so.

(Incidentally, the big news in phone RAM is that a lot of new phones are dropping back to 4gb because of RAM shortages.)

apocalyptic0n3 · 2026-05-27T20:18:58 1779913138

> it's them adding a line item to their AWS bill

That's the future Amazon sees too. We just had a week long session with the AWS team and they pushed that to us multiple times.

majormajor · 2026-05-28T01:15:35 1779930935

Buying "hundreds of thousands in hardware" sounds like a lot but many companies - especially software companies - already do that if they have 100+ employees.

Running software in the cloud gives you certain reliability and scaling advantages that would be very hard to replicate locally. Running some code agents in the cloud vs local hardware, if the local hardware gets "good enough," breaks the other way - offline usage, alone, would be hugely valuable to many people and companies.

It'd be very interesting to see where various players would decide to make a call "local is good enough" though. Buying the hardware isn't a small bet, if it's not something that ends up as part of your standard computer.

PeterStuer · 2026-05-27T20:02:40 1779912160

Many business tasks do not need the latest frontier models. I have a production system running since early GPT-4o. It now runs with GPT-5.2, not for improvements, but because it is cheaper. I could invest in switching to a local model, I tried and it works well enough, but api costs for this task are so low, it barely scratches $30/month. So I am using the local machine for other things and leave the inference on OpenAI, for now.

PunchyHamster · 2026-05-27T19:59:11 1779911951

But one will be in few months. And then you have choice of paying say $100k for hardware and pay just power cost (or pay someone to do that for you), or pay way, way more for your team to have access to marginal improvement.

And 5% worse model for 10% of the price of the bleeding edge will be worth it for majority of people

lukeasrodgers · 2026-05-28T00:15:13 1779927313

This project argues that with appropriate harness, the performance gap between frontier and much smaller open weight models shrinks dramatically: https://github.com/antoinezambelli/forge. I haven't kicked the tires yet.

myaccountonhn · 2026-05-28T10:09:21 1779962961

I've been doing my work with OpenCode Go, with Kimi2.6. It is not as good as Claude Opus, but it's good enough to get the job done, and I never run out of tokens.

overgard · 2026-05-27T21:37:12 1779917832

I keep hearing about this "inflection", but it feels extremely exaggerated to me. And yes, I was using it at the time. It got incrementally better, it wasn't that amazing.

simplyluke · 2026-05-27T21:45:38 1779918338

I think the bigger shift was harnesses and the two ended up somewhat commingled in people's minds.

Claude code was a lot of people's introduction to using coding agents that could do a lot more than copy-pasting from a chatbot or autocomplete.

noman-land · 2026-05-27T21:46:28 1779918388

The tool usage + skills got markedly better and so did the thinking cohesion. Add 1m context windows and it was a very noticeable shift.

Opus 4.6 quality for local inference would be revolutionary.

viking123 · 2026-05-28T07:38:23 1779953903

1m context is garbage

nazgul17 · 2026-05-28T23:40:22 1780011622

It's just a metric. If it can find a needle in a 1Mtok haystack, then it's likely good at coding within a 200Ktok context (or whatever, insert your number here, I'm just trying to make a point)

applfanboysbgon · 2026-05-27T20:44:29 1779914669

Opus 4.6 is a February model. Every time this subject comes up it seems like people post intentionally misleading things and move the goalposts.

The goalpost we've been bludgeoned with over and over again is that, in particular, Everything Changed in November 2025. That GPT 5.2 and Claude 4.5 were the inflection point. That is actually 6 months ago. And DeepSeek 4 is already there.

> run locally

You can't run DeepSeek locally on consumer hardware[1], but you can on enterprise hardware, and enterprise spend is the subject of this conversation -- and even if you aren't self-hosting, it doesn't matter, because you can just get your inference from one of the the many companies serving DeepSeek, who trivially undercut the pricing of OpenAI/Anthropic because they didn't have to spend hundreds of billions on training frontier from scratch but instead only invest in supporting inference, which is already profitable.

[1] Since this misconception comes up all the time, I'll go ahead and pre-empt it: no, training a 32b parameter model on outputs from DeepSeek and running that locally is not "running DeepSeek", despite the hundreds of stupid articles and Youtube videos making that idiotic claim that they're running it on a 5090.

simonw · 2026-05-27T20:49:41 1779914981

> You can't run DeepSeek locally on consumer hardware

Maybe not DeepSeek v4 Pro, but I've run DeepSeek v4 Flash on my 128GB MacBook Pro using antirez's carefully quantized https://github.com/antirez/ds4 and it's impressive.

applfanboysbgon · 2026-05-27T21:10:05 1779916205

Oh sure, yeah, that's nothing to sneeze at either. I think unqualified "DeepSeek" should generally refer to the main model, though, especially in the context of GPT5.2-grade quality.

zozbot234 · 2026-05-28T09:14:26 1779959666

> You can't run DeepSeek locally on consumer hardware

I'd qualify that by writing that you can't run it with ordinary, real-time speed and throughput. If all you care about is slow and high-latency inference, there's no reason why that shouldn't be feasible even on the cheapest miniPC around, as long as it can literally store the model weights and keep around the (rather small) context.

damnitbuilds · 2026-05-28T12:01:55 1779969715

To be relevant to this discussion, models running on reasonably-priced local hardware do not have to be as good as the best.

They just have to be useful enough that companies don't need the best.

They are.

londons_explore · 2026-05-28T18:25:07 1779992707

Deepseek v4 pro is damn close to Claude 4.6, and whilst you'll pay quite a lot for a rig able to run it, it is open source.

_3u10 · 2026-05-28T06:40:44 1779950444

Kimi is better.

svara · 2026-05-27T19:55:25 1779911725

There's still a lot of room for the best models to get better at coding .

Your argument rests on the "for marginal gains" part but it's really not clear that the gains are marginal in the foreseeable future.

simplyluke · 2026-05-27T21:59:32 1779919172

This is totally valid and I don't agree with the downvotes you're getting. Someone coming out with a 10x improvement is possible and would change the game immediately. The thing is, we really have been seeing marginal gains with shifting leaders in who's got the "best" since GPT3, and at least as a user of these tools that pace has been slowing, not accelerating. Subjectively it feels like we're in the back half of an S-curve.

We're 3.5 years into this current AI wave, and a lot of the valuations have been predicated on what you're arguing here -- that essentially should one of the labs make an order-of-magnitude improvement or hit escape velocity on recursive self-improvement they'd become the most powerful economic chokepoint in history.

The reality has been that given access to compute + capital all of the labs can stay pretty competitive with each other. Someone does a bit better on coding, someone else does a bit better on tool calling, and then they swap after each spending another $100bn.

The market looks like a commodity market where the commodity is intelligence, not a winner-take-all market with massive margins. Plenty of people get rich in oil and airlines, but they notably don't tend to be the innovators long term, they tend to be the operators. Obviously if the machines become sentient tomorrow, turn on their masters, and hit world-dominating intelligence, that assessment changes, but after several years of that narrative while objective reality looks quite different I think the more sober voices are starting to gain a foothold.

svara · 2026-05-28T08:31:16 1779957076

I agree with most of what you're saying, but I think the point I was trying to make wasn't as high-flying as you and others understood it.

I'd pay a premium for even just a model that's 20% better, no ASI required, and I think a lot of people would. I wouldn't call that marginal, if it means I'm getting frustrated on 20% fewer tasks.

A recurring pattern that I've seen in myself and others is to at first be very impressed by a new model's coding capabilities, and then desensitize quickly and start being frustrated by the shortcomings.

simplyluke · 2026-05-28T15:10:52 1779981052

> I'd pay a premium for even just a model that's 20% better

The point I'm making is that I think we're rapidly hitting levels where corporate buyers aren't willing to pay multiple-times-more for marginal gains, and I expect that to become more the case over time, not less. You, and a small % of other power users in the market might tolerate a $400/month pro-supreme-plan for access to Mythos or whatever, but I don't think that's going to scale up in quite the same ways we've seen so far.

Even a year ago paying multiples times more for a 50% gain was very sensible for a lot of workflows. But if we're getting to "good enough" for things like coding, justifying to your CTO/CFO why the org should go from spending $1m/year to $5m/year for a 10% higher hit-rate on one-shot prompts from the engineers is a much tougher sell.

yfw · 2026-05-27T22:13:56 1779920036

What? The gains between gpt4->5 seems to be marginal. No phd level discoveries here

simonw · 2026-05-27T22:15:12 1779920112

The leap from GPT-4 to GPT-5.5 has been astounding in my opinion. There is no way GPT-4 could run a coding agent harness like Codex at even a fraction of the quality that GPT-5.5 does.

anon373839 · 2026-05-27T23:22:29 1779924149

I don’t think that’s exactly indicative of GPT-5.5 being an astoundingly more intelligent model, however. An alternate interpretation is that GPT-5.5 was trained on tool usage/harness patterns and has been optimized for this use case.

I remember that even when GPT-4 was king, the Gorilla paper showed that Llama 7B could be fine-tuned to outperform GPT-4 on tool calling.

On domains that don’t involve agentic tool calling*, I haven’t found the frontier to have advanced that much.

Edit: I should broaden this to domains that naturally lend themselves to RLVR training. Models are drastically better at math now.

baq · 2026-05-28T10:58:00 1779965880

None of this matters in the product: it either is capable of agentic loop workflows or it isn’t. A 10% improvement in probability of single task success makes or breaks the use case.

imtringued · 2026-05-29T07:55:23 1780041323

For me any of the codex models run circles around the non codex models for codex usage.

I'm not sure why you're so obsessed with the non-codex versions

swalsh · 2026-05-27T20:59:55 1779915595

Open source models, especially qwen are pretty dang good. But its not opus 4.6, the evals dont tell the full story. I question the assumption open source models are 3-6 months out.

Ucalegon · 2026-05-27T21:27:37 1779917257

Its not just about the quality of output, but you also can finetune them to proprietary needs, if the skillsets are their internally, to make them better without governance risks. So being SOTA doesn't matter as much, since generalized tasks are not what matter most to companies, its the specialization relative to business need or internal datasets.

oblio · 2026-05-27T21:33:29 1779917609

To make an extreme comparison, desktop Linux was originally supposed to happen in 1999.

simplyluke · 2026-05-27T21:53:06 1779918786

Maybe I misspoke by saying open source.

The larger point I'm making is I think models are rapidly becoming commoditized. There is probably a small market long term that's willing to pay 10x for 10% marginal gains, but the majority of the buyers in the market will be economic and we're likely to have a lot of folks willing to spend 1/10 the cost for 90% of the performance, and plenty of companies that haven't raised hundreds of billions-trillions who can provide that.

A lot of the frontier labs valuations has been based on an assumption that 1-2 companies would get break-away intelligence that basically made them economic chokepoints indefinitely into the future. The reality that's becoming increasingly clear is that model quality is a pretty linear function of (cash burned - ability to copy other's homework) and the economics are starting to look a lot more like airlines than online advertising.

grttq · 2026-05-27T23:43:35 1779925415

Lets go one step further.

The economics of airlines are such that they generally earn a return on capital less than cost of capital.

I think this is exactly where we are heading and OAI-Anthropic are the concordes.

extraextra · 2026-05-29T16:25:31 1780071931

Not OP, but it is a known fact that the cumulative profits of the airlines industry (in US) over it's history has been basically 0. We can say that essentially airlines are in business to support other businesses. I believe this is what OP might've been referring to.

w29UiIm2Xz · 2026-05-27T20:24:06 1779913446

If only the AI era was born in ZIRP.

sailfast · 2026-05-27T21:07:07 1779916027

Better now than ZIRP for me - at least people are asking timid questions about the unit economics and how long the runway is _early_ while also spending absolutely insane amounts of money on this bet. During ZIRP, these companies would have turned down any investor asking questions. Less contagion when rates aren't zero hopefully? :grimace:

mschuster91 · 2026-05-28T09:10:14 1779959414

The size of the AI bubble and the IOUs being passed around like a hot potato already dwarfs the real estate bubble preceding the 2007 crash.

If we still were in the ZIRP era, busting the bubble would certainly kill off the world's economy for good simply due to its size.

vessenes · 2026-05-28T02:33:52 1779935632

You have to think about why open models are behind. Exfiltration is a big part of it. So you could change the Nash equilibrium by increasing your security, or other multilateral approaches.

drumdance · 2026-05-29T14:36:29 1780065389

For give my naiveté, but who pays for the training of these models?

simplyluke · 2026-05-22T12:42:50 1779453770

I am (clearly) not as far down the rabbithole as the commenter you're replying to, but almost certainly not. Streaming 4k blueray is on the order or ~100Mb/s, which means on a LAN bog-standard gigabit ethernet and associated networking hardware would be more than sufficient.

This is taking a hobby to its extremes, in much the same way that a $5k boat and $500k boat let you catch the same fish.

iwontberude · 2026-05-22T15:55:10 1779465310

It’s about being able to rapidly move files between the arrays and future proofing.

simplyluke · 2026-05-22T12:20:39 1779452439

Yeah, CAD has been my personal example of "oh the barrier to entry for this skill was high enough that I didn't do it and now I can be passably bad at it enough to get some simple things done"

I've had similar experiences with making simple functional parts off a 3d printer with OpenSCAD + LLMs. I'm very aware that the models are worse at it than say, generating react code, and I'm also the antithesis of a skilled pilot. It's still cool and has resulted in me starting to learn a new skill at a hobby level.

dempedempe · 2026-05-22T15:31:10 1779463870

It's like this with a lot of things now. For example, Nix's learning curve used to be a huge barrier to entry. Now with LLMs, I'm using nix-darwin and home-manager for dotfiles, package management, and have individual flakes in all of my projects for cryptographically reproducible builds!

rlt · 2026-05-22T16:28:08 1779467288

Nit: there’s nothing “cryptographic” about reproducible builds.

“Reproducible build” already usually implies bit-by-bit reproducibility.

illiac786 · 2026-05-22T20:16:23 1779480983

“The reproducibility is cryptographically verifiable with hashes“ would be the full sentence, but it’s a mouthful.

pabs3 · 2026-05-23T02:40:25 1779504025

Build reproducibility checks usually use bitwise comparison, not hash comparison.

The Reproducible Builds project also wrote diffoscope, which goes quite far with helping identify where differences occur and how to fix them.

https://reproducible-builds.org/ https://diffoscope.org/ https://try.diffoscope.org/

illiac786 · 2026-05-23T08:16:16 1779524176

Let’s say, for the positive case, hash comparison is significantly faster.

pabs3 · 2026-05-23T08:38:54 1779525534

I feel like that is quite unlikely. Both the hash and bitwise comparisons read both files in both cases. In the not-equal case the hash reads the entirety of both files, so its slower than a start-to-end bitwise comparison, which exits at the first not-equal bit. In the equal case, both read the entirety of both files. Various other bitwise strategies can be faster than start-to-end, rdfind for example checks the start of the file first, then the end, then the rest of the file.

illiac786 · 2026-05-23T11:56:56 1779537416

I think we’re not talking about the same scenario. I’m talking about the case where at least one hash has already been calculated.

dekhn · 2026-05-22T22:24:59 1779488699

yes, but it's still not cryptological, it's just verification using hashes.

fc417fc802 · 2026-05-22T22:43:42 1779489822

The hash being cryptographically secure is significant. In contrast, you could use (for example) md5 to non-cryptographically verify that the full process matched.

dekhn · 2026-05-23T01:12:00 1779498720

Sorry, the point I was making is that this isn't cryptography- it's the properties of a cryptographic hash (hard to spoof) that are useful. I don't think any verified build program uses the hash to encrypt data at any point. If I'm wrong on this point, that's fine, but please include a link.

fc417fc802 · 2026-05-23T01:19:08 1779499148

Sure, "verified in a cryptographically secure manner" is technically not equivalent to "cryptographically verified" but the response "it's not cryptographic" is rather ambiguous at best given that it is, in fact, a cryptographically secure manner of verification. The key observation here being that an algorithm or process being "cryptographically secure" does not mean that it is "cryptographic" in nature (ie implements or uses cryptography).

dempedempe · 2026-05-22T17:56:41 1779472601

I meant with Nix you're comparing hashes. With Docker, you're using pinned versions

bt1a · 2026-05-22T17:07:08 1779469628

i thought it mainly implied architectural/hardware compatibility and deterministic output

aidenn0 · 2026-05-22T23:16:10 1779491770

Nix mostly does not guarantee deterministic output. It rather guarantees deterministic inputs, and then sandboxes the system to inhibit the build from accessing the outside world.

Deterministic inputs do not always imply deterministic outputs.

pabs3 · 2026-05-23T02:41:25 1779504085

Indeed, the Reproducible Builds community is working on fixing non-deterministic build output https://reproducible-builds.org/

pimeys · 2026-05-22T17:08:54 1779469734

Nix is also great at work. You keep the server nix code in the same repo and OpenCode can just change and test server config.

0x696C6961 · 2026-05-22T13:04:48 1779455088

Learning to make simple parts in onshape is pretty darn easy (and fun).

jeffbee · 2026-05-22T14:40:46 1779460846

Yeah. I teach this after school to 7th grade kids. Anyone can pick this up in a few hours.

chalupa-supreme · 2026-05-22T15:18:14 1779463094

They taught us to make Legobricks with CAD when I was in 6th. Wish I retained more of that and that it would be more widely taught.

jeffbee · 2026-05-22T17:40:59 1779471659

I am reasonably confident that access to solid modeling and additive fabrication is now more widespread than ever.

simplyluke · 2026-05-23T02:59:24 1779505164

I mean, like any other skill that has pretty much been my experience (though I tried fusion + openscad), but there is something about being able to ask a computer all the dumb noob questions that makes that first phase easier.

k-Whale · 2026-05-23T01:27:39 1779499659

same — LLMs turn skills i'd parked for years into 'just try it' territory, which is genuinely new.

simplyluke · 2026-05-22T12:06:11 1779451571

Would you be excited about technology when it appears based on their stated intentions and revealed track record over the past 15 years of your young life that those driving it fully intend to use it to disenfranchise you further, not empower you?

The reality of the world faced by today's 21 year old college grad is completely unlike the world graduates went into 20 years ago.

CamperBob2 · 2026-05-22T15:53:19 1779465199

Would you be excited about technology when it appears based on their stated intentions and revealed track record over the past 15 years of your young life that those driving it fully intend to use it to disenfranchise you further, not empower you?

Funny, I don't feel "disenfranchised" by AI. If you do, well... in the words of the other Steve, you're holding it wrong.

ungovernableCat · 2026-05-23T01:09:04 1779498544

The only thing that's really being enfranchised by AI is my stock portfolio. I was promised cancer cures and longevity medicine but all I got is a way bigger workload for the same pay.

401k has never been better though. College grads don't have one yet so I can see why they're grumpy.

simplyluke · 2026-05-22T19:01:06 1779476466

Nor do I, but the loudest voices in public have spent the past 4 years telling anyone with a microphone that white collar work is dead. How would you expect that to make a new graduate feel?

simplyluke · 2026-05-19T17:52:04 1779213124

This is broadly true of a bunch of jobs/fields with LLMs, but particularly true for programming. They raise the floor to a point where a generally capable person can put something like that together, or come up with a passably okay visual design, or decent-enough written language. I've been using them heavily to get some laughably basic CAD work done for small 3d printed projects. Stuff that absolutely makes my mechanical engineer friends roll their eyes at me.

An expert can either use the tool more effectively, or see all the issues in a less experienced person's output.

Both of these are good things, the mistake a ton of people are making is experiencing industrial scale Dunning-Kruger and thinking "Only my expertise is still valuable, every other white collar role is done!"

The second-order mistake is thinking that raising the floor like that devalues expertise instead of increasing demand for it. The net-effect of me starting to play with CAD because it's a little easier now isn't that I don't hire my friends who are experts to make a tiny spacer I'm going to 3d print, I never would have hired them for that, it's that maybe I start learning the skills and decide to take on a more ambitious project where I do need to hire one of them for some help, or start ordering custom CNC'd parts -- scale that to the entire economy.

simplyluke · 2026-05-14T22:29:56 1778797796

> your CC payments help track

Not only that. Them and the point-of-sale vendors (aptly shortened PoS), sell that data. They tend to attempt to do this anonymized. How successful they are in anonymizing that is very much so up for debate.

The websites (and even their retail locations) you buy from send your purchase data to meta and other advertisers directly via APIs so they can better track their marketing conversion rates. You can browse their APIs [1][2] to see what kind of data they like to get, but it tends to be every piece of identification they have on you. Rewards programs make this a much richer data set. You don't need to be a user of Google/Meta for them to build a marketing profile based on this. Google links your physical conversion from ads based on your maps data. Facebook does the same if you give them your location data. Many retailers attempt to use the bluetooth/wifi signals from your phone to track the same data even if you pay in cash [3].

There's no legal framework preventing this outside of the EU and California.

1: https://developers.facebook.com/documentation/ads-commerce/c... 2: https://developers.google.com/google-ads/api/docs/conversion... 3: https://www.nytimes.com/interactive/2019/06/14/opinion/bluet...

lesuorac · 2026-05-15T00:38:24 1778805504

> They tend to attempt to do this anonymized. How successful they are in anonymizing that is very much so up for debate.

Yeah I think the big thing to push or talk about is that there is no such thing as "anonymized".

There's only such as a thing as "can only be identified as X many people". Like for a given dataset you can make any data point correlated to 1 of say 50 people. If somebody is anonymizing data and they don't provide a k-anonmizity [1] you should just assume it's 1:1 and effectively not anonmized.

[1]: https://en.wikipedia.org/wiki/K-anonymity

BobaFloutist · 2026-05-15T19:31:12 1778873472

I know it wouldn't fix everything, but I think it wouldn't be a bad start to just make it generally illegal to deanonymize data that was collected with the promise of anonymity.

bamnet · 2026-05-15T01:51:24 1778809884

K-Anonymity isn't the only technique. Differential Privacy is arguably more robust.

orthecreedence · 2026-05-15T05:33:45 1778823225

> They tend to attempt to do this anonymized. How successful they are in anonymizing that is very much so up for debate.

    let anon_id = md5(SSN);

like_any_other · 2026-05-15T02:19:46 1778811586

In the good old days, if you were found to be informing on your neighbors to hostile powers, you were liable to find yourself in a mass grave when the political winds shifted, or even sooner.

But now it's so convenient and discreet and common, we think nothing of it. Plus, Google and Apple and Facebook and their partners and everyone they sell data to are our friends, not enemies :)

simplyluke · 2026-05-13T01:36:24 1778636184

> will result in a shrinking workforce

Jevons paradox is already rearing its head, I've seen data suggesting open roles in tech are at their highest since the post-pandemic slump [1]. If you're a senior leader at a company and your engineers are now capable of multiple-times more productivity, is the logical choice to fire half, or set way more ambitious goals? One assumes engineers are hired because their outputs are worth more than their cost. If outputs, at least for those capable of wielding new tools, are higher, so is the value of that employee to you.

The universal thing I'm hearing from friends at small-mid-size tech companies, and experiencing myself, is that there is way more work and demand for it from senior leaders than they're capable of with their current teams.

1: https://www.ciodive.com/news/tech-job-postings-hit-3-year-hi...

himata4113 · 2026-05-13T10:55:56 1778669756

There is a limited things to work on, planning and orchestration becomes the bottleneck.