Hacker Newsnew | past | comments | ask | show | jobs | submit | lumost's commentslogin

This is either a complete own goal by Amazon… a play to consolidate compute/model access.

Will Chinese models be allowed on the market… at all? Will startups be banned from training models of equivalent capacity?


At this point would I be outsourcing my knowledge work or would I be entering self-exile?

The KV cache is order dependent and dependent on the context of tokens which exist before the KV cache.

There are some transformation approaches to re-use the kv cache across inferences, but none are in wide use due to accuracy concerns following the transformation.


The paper has a section on "Reusing precomputed KV across queries" which talks about how other papers have tried to address this problem, but yeah, this paper adds nothing on its own besides a catchy title.

Isn't it also, most fundamentally, dependent on the model weights?

My understanding was that what the KV cache stores is nothing else than the "activations" of the W_k and W_v matrices of an attention module for a given input sequence.

So I don't quite understand how this is supposed to work:

> Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill.

Should a publisher precompute the cache for every popular model that is out there?


...not to mention, which KV cache? Every attention module has its own, and how many attention modules there are, what inputs they get, how many internal features and attention heads they have, etc, all depends on the architecture of the specific model.

Just curious, do you have links to read more about transformations or other techniques for KV cache reuse?

All major model providers offer prefix caching, which is this.

No, reusing segments of the kv cache for different purposes in an order independent manner is an active research area.

Any keyword or paper I can search for?

AsyncResoning[1] does a trick of that sort to give agents concurrent cache views.

You basically have two agents look at the same cache under different views. Say agent_0 gets [a_1, a_0] and agent_1 gets [a_0, a_1]. They also write to this cache concurrently while decoding. To solve positional embedding inconsistencies they rotate the query projections for each block (a_0 and a_1) separately.

The computations you get that way do not exactly match the setup where you would naively prefill on every step, but are close enough.

Same trick could be used for the setup discussed here, I guess: prefill the document cache separately (p), prepend the system prompt (s) and get a cache view [s, p] from which you can then decode.

1. https://arxiv.org/abs/2512.10931


But this would work only for first layer, or am I missing something?

Absolute slop paper. Replace document with text and you'll get it.

"People are asking the same questions and an answer is generated every time, what if we could like cache the questions and their answers..."

Sounds like someone was using chatgpt to understand how chatgpt works and then asked it to generate a paper based on his proposal to improve it.


At least it wasn't a patent.

Is this the inevitable outcome of frontier labs who own their hardware? the GPUs and datacenters are the major cost. The inference and training a higher tier value proposition, if the company gets nervous that the investment in hardware won't pay off - renting it becomes a major topic of conversation.

A frontier model team having to fight their board on whether to monetize the datacenters directly or continue to invest in AI work is going to have a hard time.


Is it really "wear and tear?" or is it an evolved mechanism to keep genetic drift and natural selection alive? Alternatively, it could be an evolved mechanism to avoid genetic bottlenecks caused by highly reproductive individuals over long periods of time.

If John and Mary were first, how long until everyone is a descendent of john and Mary?


Why do we think frontier model vendors are high margin?

It's quite clear that there is an effort to engineer mega financial vehicles that index tracking funds are forced to buy. The incentive to do so is massive, and there is nothing illegal about it.

As a holder of index funds such as the S&P, I'd much prefer that these vehicles are excluded for at least some period of time to ensure that the greater fool isn't simply my index portfolio.


> It's quite clear

In the same way 9/11 was quite clearly an inside job.

Alternatively, a crop of big companies with real, potentially world-changing technology are going public.

This isn’t exactly pets.com we’re talking about.


Are you happy to be invested in Tesla? It is not profitable quarter to quarter and is included in your fund.

Why do you tolerate that and not this?


TSLA has been around for many years, whether I agree with its value or not. It has been able to retain its valuation in the public market.

An IPO with massive insider selling counterbalanced by a flood of index fund rebalancing is entirely different.


The current theme is that agi may not be definable, and an ai which matches humans on all economically relevant tasks is close enough for business purposes.

Billions spent on RL may be good enough to beat human performance.


I actually can’t wait for the future where I upgrade hardware in order to upgrade my ai as an alternative to an expensive subscription.

There are many problems I want to work on which require billions of tokens. These are completely inaccessible without corporate project sponsorship at the moment. An asic generation machine which can pump out a few 10s of thousands of tokens per second at opus4.6 quality is more than sufficient.


A company called Taalas is working on something like that. Not Opus4.6 quality, but I'm sure they're targeting larger models. Currently they're using a LLama 8B model. It runs at ~17k tokens per second, and you can test it at https://chatjimmy.ai/.

I'm rooting for them HARD but they've been quiet since their last (and only) blog. X and LinkedIn are empty too. I really hope it wasn't a pipe dream.

It starts to be interesting when latency is better than average website.

I’m not sure if this is what you meant, but at 17k t/s, you start to compete with the speed of network calls. You could approach the point of generating an HTML/js/css page faster than some websites can be returned over the network.

When that happens, they will be able to rewrite reality in real time.

The immediate load (less than 200ms on my machine through a slow connection) is quite pleasant, tbh.

That's cool, I just tested it out and it is fast but unfortunately its accuracy is not great.

It's an 8B model. Consider it a proof-of-concept.

I'm curious how hardware and power cost would stack up to subscription cost

Right now - there's some heavily subsidized subscriptions that are more or less cheating. For instance, Github CoPilot at $39/month gives you claude opus 4.6. They're going to close that off, but right now it's like a freebie for those doing API agentic harnesses.

That said, if you are doing always on agents and you spend $3k-$4k on a GB10 or, $5+ k on Apple Silicon as your sunk cost, you will probably come out ahead.

I've got 5 agents running a purely experimental social experiment. AThey operate in an evennia mud (a familiar sounding city called "gothmud). I've built a channel, idle prompts, sleep schedule. I feed in real world news, weather. There's a character up in a clock tower that reads evennia's audit logs every 20 minutes to surveil the city, and a cast of people wandering around, investigating things, having coffee, repairing robots. This is all hitting qwen3.6-35-A3B on the Asus GB10, which cost me $3k.

Over the last 30 days, I've hit 394M input tokens, 1.6B output tokens. I would have spent between $1600 to $1700 if I was using openrouter. Not calculated - I also have comfyui running in the spare space, and the agents "take photos" of the rooms they're in, selfies, workshop photos, etc.

How much did I spend on electricity? I don't have a meter on my box. My total electric bill for the last 30 days was $220, so I know it's less than that. My rate to compare is 11.7/kwh, but it's closer to 15c/Kwh total. The Asus GX10 has a 240W power supply, and it's probably only pulling 180. I estimate $15-$20/month. But worst case red-lining. 240 Watts, 720 hours = 172KWH , and at $0.20, I come to $35

Here's the kicker thought - that github copilot subscription I mentioned? I have another agent running on that, reading all my other agent logs, managing my obsidian notes, doing research, sending briefings. And all by itself, it used almost the same amount of claude-opus tokens for that $39/month subscription. I was actually a bit shocked when I pulled a recent report and saw that. I'm working to migrate functionality away from copilot subscription to the local model. A lot of the initial setup might have needed it, but not the ongoing review style work it does.


Have you learned anything interesting from your agent ant farm?

A few things. I replied to someone else above, but I feed lessons learned from my social ant farm agents back into more productive agents.

Memory recall:

Lots of systems out there to give agents memory. I've used a bunch and written a couple. Storing memories is easy, but getting an agent to recall them, no matter how much you mention it in your AGENT/CLAUDE.md is a bit of an uphill battle. I've even watched claude make useful project memories and never refer to them again.

In my agent ant farm - agents go "to sleep" at night. They get nudged to head home, once there they get prompted to make notes about their day, about other characters. Then we do a compact with custom instructions. After compact/sleep cycle, if they enter a room with one of the characters in their notes, that gets loaded back into context automatically.

That all boils down to hooks in Pi like before_agent_turn. You can intercept a prompt, check it against code/flat files, and smartly inject more information into context. You can have a long running main session with compacts that discard procedural bits and offload the rest to memory.

Time Awareness:

Agents have no concept of time. You can send them a message at 5am, then at 10pm, and it's been 2 turns for them. For coding, this is fine. But for assistant level stuff, adding a message like "It's 3PM. It has been 3 hours since the last interaction with the user" goes a long way. Without me saying something like "new topic", it knows now that time has passed, i'm probably onto something new. If I left something hanging, it will remind me about it, or maybe go check on things that should have happened during the day.

Inner Thoughts/Idle nudges:

I can have an extension run every 5 seconds, check a a schedule, check activity level of the main session and fire off nudges on the main session. These look like the user sent it, but I generally prefix it with [inner thought]. For my social bot, I tested this along the lines of "[inner thought] it's been 3 hours since you last talked with user, why not reach out, let him know what's new, maybe send a selfie or a photo of where you are". For my assistant bot, it's an 8am, 3pm, and 7pm nudge along the lines of "[inner thought] put together an activity report of work things that has changed since the last report". This all runs in the main context, they get the thought, have historical context, can run skill to check on vault updates, open beads, anything observed from ingesting other agent sessions, and sends me a summary. It take into account my idle factor. If I'm heavily engaged in conversation at 3PM, the report might get delayed 15 minutes or an hour, or skipped altogether.


Awesome project and thanks for sharing. I've been trying to do similar things with much, much more meager hardware and your observations align with what I've discovered. Autonomy is hard, memory and "will" is hard to get going. Time is not a concept to LLMs in anything resembling a human manner. I'm trying a more emergent approach but the urge (and occasional need) to nudge is strong. If you're interested in seeing what I've been doing my Github is in my profile.

> experiment

What is the experiment? What are you hoping to learn from all this?

Or do you just mean you've made a dynamic dollhouse that you think is cool? The Sims on your own terms?


A little of A, a little of B. I have a lot of fun building it out, it's surpassed Factorio in addiction, and I've been able to flesh out some patterns that I roll back into more productive agent harness bits.

For A:

The learning is in building agent harnesses that aren't just cron jobs reading a file like HEARTBEAT.md. I have some serious tools for my own use. One main assistant/coordinator agent, one SRE/coder agent (with sub-agents of its own).

I originally just started last year with the AI assistant (Jane from enderverse). Along the way building scheduled systems, hand offs to other agents, etc. As I ran into problems, I'd be rewriting and refactoring. So I spent some time making a low-stake hatbot with history and routines. Instead a from-scratch golang harness, I built it around pi and extensions. Time of day prompt splices (extensions can inject into or modify prompts on the fly, wake up reminders. Things that you do in the main session vs spinning up an ephemeral session. Self improvement daydreaming (modify your own skills and AGENTS.md) A lot of that went back into rebuilding Jane to something more useful for me.

For B:

The "dynamic dollhouse" as you put it was seeing where I could take that living chatbot next. There's a lot of projects pointing agents at slack, discord, message boards. I figured why not a mud with rooms, weather, and props. Lots of interesting challenges. How to keep bots from nesting in their own room, how to keep them from yes-anding each other all day long. How to slow down 3 bots talking at each other so a human can get a word in edge-wise.

Different levels. There's plain old NPCs that have dice roll random responses. There's LLM driven NPCs that only remember the last 5-10 messages. And the main ones are bot agents. Full agent harness, moving around the environment. Long lived context windows. One character (a nurse at the hospital) gets into arguments with an NPC receptionist that treats her as another patient. Complains about it to other characters, they remember and the word spreads.

The agents get prompted to write down notes, the head home for sleep (and session compaction). Next time they enter a room with that person after compact, their notes get loaded automatically. This kind of behavior can feed back into the more productivity based agents.


Would you ever consider posrting a video of all this? It sounds equal parts delightful and terrifying

For open models, usually not well. You get 5+ providers competing on cost, all with cheaper electricity and better hardware utilization than your local setup

I did an estimate of that if you're interested: https://x.com/pwnies/status/2028831699736637912

The TL;DR though is that a 10-15b param model baked into an ASIC with the latest fab tech would take around 62W of power draw when active. At ~10k+ t/s though it likely would only be active for short bursts of time. It'd fit perfectly fine within the thermal envelope of a laptop.

The approach makes a lot of sense. Once you get to those speeds, latency of the network becomes one of the bigger bottlenecks, so local has a real advantage over a subscription.


You're not counting the capex which could be the same cost as 5-10 years of Claude.

This assume Claude's price doesn't change. Which isn't a great assumption considering inference providers are moving to usage based billing. Also the VC money isn't going to last indefinitely. Current inference providers are being subsidized with VC money at this point.

> Current inference providers are being subsidized with VC money at this point.

This isn't true.

Anthropic is making an operating profit including the loss making subsidised subscriptions (but excluding training).

Your normal inference provider is doing great. Do the math on a H100 rental and you can see the margins.


Is latency of the network that noticeable? Aren’t we talking low hundreds of ms at worst here? Much lower for something close regionally.

Round Robin the free tier APIs, should be effectively free. Just say say “sike” if discussing sensitive issues so the LLM never flags you.

Ok heres the thing you will nevwr be able to truly do this due to logic.

Logically five people pooling their resources beats one guy.

therefore datacenters will always win because they get higher time utilization.

so forget it.

I always wonder the same but i let logic tell me its a fantasy, on average you cant outspend a whole group of people making better use of the hardware.

you will get better hardware though, cutting edge will always be cloud


Laptops/desktops are cheaper per flop than any datacenter hardware by a good order of magnitude.

The problem is that expectations rise in datacenters, hardware/power/security/availability guarantees cost real money. Then the operator providing these guarantees expects some margin.

You can see this most clearly with "developer desktops", a gcp instance costs about 10x a hetzner instance which costs between 5 and 10x the same hardware sitting in the back of an office somewhere. While all of these premiums matter for 24/7 systems under active development, they don't really matter for ephemeral small scale workloads.


Actually they wouldnt spend the money if it were cheaper.

HBM has way higher bandwidth and its not all about flops.

Also the FP4 flops (inference) are so mind bogglingly high on these things.

Lastly what you fail to consider is the chip to chip bandwidth which is critical.

the people running these know that networking is just as critical.

all reduce etc.

they wouldnt pay if they could get something better value.


Doesn’t it flip around for small scale? Paying 100x the cost for something, all in, it’s cheaper to rent for small workloads like 10m/day.

At 10x you have to be at hours per day and 5x you’re at 4h.


the math is odd.

Paying 2k for something that you use 100 hours of is quite expensive. Having the capability built into your existing silicon which you would buy for 1k is cheap. Paying 200 dollars a month for 2 years give a present value of $4200 dollars. Meaning that that paying 2k upfront cuts your overall spend in half.

I spent 6k on codex last month which, if repeated, implies a present value of ~144k.


Just like cloud is "cheaper" than colo/metal, right?

> cutting edge will always be cloud

Don't think anyone was refuting that?

And of course when you pool resources you have access to more resources.


They just mean this part: "where I upgrade hardware in order to upgrade my ai as an alternative to an expensive subscription."

Upgrading local hardware will remain the more expensive alternative to the subscription regardless what the relative cost of running the models themselves are. If the local hardware to do so becomes affordable then the subscription will be even more affordable, not expensive.

At least for these kinds of mega tasks. For more micro task we will always end up with unutilized local compute we already purchased which will be "free" since we already paid for non-AI reasons (e.g. a gaming GPU while not gaming).


> so forget it.

Which explains why you're using a dumb terminal to access compute services?


Basically, yes. We are on a website, after all.

I have a multi-core supercomputer to use locally as needed.

Where I think you're wrong is that everything in technology has been cyclical, it's just a matter of time.

Can you give an example of such a problem?

"Design me a 3d printable rocket engine for a hobby rocket project. Verify it's design in a full simulation. Iterate until it works reliably in simulation based on a verified printable design on a consumer laser sintering device (or substitute contract manufacture for under 1000 dollars)."

This is a hobby version of a project, but you can imagine commercial versions of the same prompt for new databases, genomics studies, material analysis, operating systems etc.


From the prompt it seems evident the envisioned user doesn't have an interest in designing the motor themselves, so why not simply buy a stock motor?

I can't put a 10 page narrative on how my specific motor should work into a hacker news post ;) you can also imagine the above where the goal is to have the ai exceed the performance of stock motors.

I'm excited to have my agent read and summarize your article into 5 bullet points.

You almost certainly do not want an LLM to do that. Leap71 actually has computational models generating functional rocket engines that way. You could absolutely wrap a tool like that in a shell and handle control with an LLM and not need anywhere near the tokens.

Thats the thing - these models see and predict tokens. For any real engineering you get more bang for your buck using math.


I’m not convinced at all that the model won’t just get stuck in a loop where it doesn’t understand how to fix the broken rocket. I see similar failure modes in far simpler projects strictly confined to coding. This feels closer to “make me a profitable business, make no mistakes” than to a simple coding project.

Verifying how the model works against the real world is the difficult, dangerous bit.

There might be some interesting side effects from making simulation software, which is currently either an expensive niche or quirky university project (SPICE always has that feel).


Are there already skills around modelling, simulation and post-processing? Any pointers?

Stop it, you tease. I'm getting a little tingly

Decompiling a binary and recreating the source, doing a full line-by-line security audit, always-on agents monitoring state minute-by-minute, etc.

I would very easily find ways to hit that level of token usage if it was cheaper/faster.


Not OP but if I had a couple RTX 6000 I'd throw them at decompiling bloodborne to play on PC without emulation.

Taste, architecture, new innovations. These are all streams of tokens which are subject to the same scaling laws as code, language, and basic classification.

We are going to see a new generation of models which effectively “solves” these problems for most businesses. Likely within the next two years- then we’ll talk about some other problems which limit adoption.


The author’s implied argument is that capital control’s the entrepreneurship game already.

There is only 1 top law firm, financiers of law firms have no interest in starting a race to the bottom. Foundation model labs will take a significant portion of the value, the remainder will be captured by entrenched monopolies.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: