More

throwdbaaway · 2026-07-27T15:31:47 1785166307

They need to get a license from moonshot to provide inference for K3. Probably have to follow the pricing set by moonshot as well.

throwdbaaway · 2026-07-22T08:56:46 1784710606

It works, thanks to https://github.com/ikawrakow/ik_llama.cpp/pull/1911, which got merged in early June. However, there might still be some issue with the chat template.

alfiedotwtf · 2026-07-22T14:24:36 1784730276

Awesome, thanks!

Edit: it amazes me how fast ik_llama.cpp moves

throwdbaaway · 2026-07-19T23:09:30 1784502570

I suspect this is why DeepSeek had to introduce the 2x peak hours pricing. The price would be too low otherwise.

throwdbaaway · 2026-07-14T20:03:56 1784059436

Yeah antirez made a lot of big claims in that paragraph. Sounds like a case of AI psychosis.

throwdbaaway · 2026-07-10T16:24:13 1783700653

If you max out the ram, TG with q3 should be at least 10 t/s. And with dsa, it can still stay close to that number as the context grows.

throwdbaaway · 2026-07-06T22:52:55 1783378375

Seems like a pretty pointless post that still centers around output tokens.

In agentic coding, cached input tokens is 90% of the API "cost". It doesn't require GPU compute, and DeepSeek has shown that it can be done 50~100x cheaper with MLA/CSA/HCA, and a whole bunch of disks. This should collapse the margin.

nozzlegear · 2026-07-06T23:03:14 1783378994

Aren't the American AI labs desperately struggling to find a market beyond just agentic coding?

le-mark · 2026-07-07T00:25:21 1783383921

I have heard but don’t have first hand knowledge that at least one company (financial services BPO) has moved most of their previously manual processing to llms. The person I talked to wasn’t forthcoming with any detail. This is what we’d expect to see though.

dgellow · 2026-07-07T06:49:11 1783406951

All AI labs. Not just Americans

throwdbaaway · 2026-07-06T23:05:43 1783379143

The current top comment in https://lobste.rs/s/ua1gxl/glm_5_2_coming_ai_margin_collapse correctly zoomed into cached input tokens, but landed on the opposite conclusion:

> That is, for your $100/month fee, you get $3600 equivalent of API usage. This is presumably because Anthropic has figured out some clever things to do with model routing and input caching, and also can subsidize with investor money and take a hit on their operating margins.

My take: this is exactly what Anthropic wants everyone to think. In reality, 90% of that $3600 are for cached input tokens, that can be made to cost next to nothing, as shown by DeepSeek.

throwdbaaway · 2026-07-06T23:49:41 1783381781

While we are all speculating, Boris kindly provided some guidance in https://news.ycombinator.com/item?id=47880089

> The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.

Using the current Opus pricing, that pre-lunch 900k tokens should roughly consist of:

720k input tokens = 0.72 x $5 = $3.6

180k output tokens = 0.18 x $25 = $4.5

900k 1h cached writes = 0.9 x $10 = $9

500M cached input tokens = 500 x $0.5 = $250

$267.1 in total, with 93.6% from cached input tokens. The portion that requires GPU compute is about 3% of the total.

Post-lunch, the 900k tokens should consist of:

900k input tokens = 0.9 x $5 = $4.5

900k 1h cached writes = 0.9 x $10 = $9

So Anthropic is fine with the $267.1 accumulated over 3~4 hours before lunch, but not fine with the $13.5 incurred immediately after lunch. Why?

The only plausible explanation is that the actual cost of caching is way less than the API pricing. If you use a coding plan, Anthropic doesn't really care about your cached input tokens usage. Indeed they want you to show your ccusage screenshots. On the other hand, if you pay by API tokens, the margin is huge for cached input tokens.

Only when you do something that requires a lot of FLOPs, e.g. the post-lunch 900k input tokens, the cost becomes real.

nl · 2026-07-07T09:18:24 1783415904

I wouldn't be too fixated on the specific numbers in that post.

Anthropic was extremely capacity constrained at that point. They still are but not to that extent.

I'd note that OpenAI offers 24 hour caching. I'd be surprised if Anthropic hasn't optimised their caching for Claude code too.

SemiAnalysis recently posted that their actual Opus usage works out at $0.99 because of caching.

The principles remain though.

Roark66 · 2026-07-07T14:00:19 1783432819

Recently I started getting messages from Clause Code (on a plan). "You're restoring an old session are you sure you don't want to compress the context? This will use a substantial amount of your usage quota"

So it seems they do care.

throwdbaaway · 2026-07-07T16:49:57 1783442997

That's exactly what I said. They do care when FLOPs are involved. Restoring an old session with 900k tokens will require a lot of FLOPs to reprocess the 900k token.

Meanwhile, they don't really care if you use hundreds of millions of cached input tokens, which doesn't consume any FLOP.

stingraycharles · 2026-07-06T23:25:04 1783380304

> MLA/CSA/HCA

Aren’t these techniques all “lossy” compression, and one of the reasons people complain about loss in quality as the context size grows larger?

throwdbaaway · 2026-07-06T23:29:04 1783380544

Indeed they are all lossy. Not sure how much they contribute to the quality loss in long context though. I got a 700k session with DSV4 Pro (official API), and the model was still coherent and didn't make any tool call error.

stingraycharles · 2026-07-07T01:29:44 1783387784

That’s a low bar though, and the least I would expect.

throwdbaaway · 2026-07-07T05:47:49 1783403269

Well I wouldn't call it a low bar, since some of the edits were quite complex. And 1M context in less than 6GB of VRAM is truly impressive, but somehow this gets way less attention than the crappy turbo quant from Google.

scope2093 · 2026-07-07T06:42:03 1783406523

I'd like to understand this please. Why would the 1M context be kept in VRAM if you're using DSV4 Pro through the API? Or did you refer to different sessions?

throwdbaaway · 2026-07-07T07:20:50 1783408850

Different sessions. With https://github.com/fairydreaming/llama.cpp/tree/dsv4, 1M context with DSV4 Flash takes less than 6GB of VRAM. I can't run DSV4 Pro, but it should take less than 9GB of VRAM for 1M context, based on the numbers shared in https://arxiv.org/html/2606.19348v1.

scope2093 · 2026-07-07T08:06:59 1783411619

Thank you for the links/docs. I'm quite excited to try it myself.

throwdbaaway · 2026-07-04T01:10:16 1783127416

And somehow they claimed that it is "lossless".

throwdbaaway · 2026-06-23T03:51:06 1782186666

On ZFS with zstd compression, I am getting 1.34x compressratio for the BF16 weights (across multiple models).

Here's the du output for GLM-5.2:

    $ du -s -BG /cube/models/zai-org/GLM-5.2/
    1099G   /cube/models/zai-org/GLM-5.2/

throwdbaaway · 2026-05-23T01:28:45 1779499725

And their disk-based caching is amazing. I got a long 700k context session spanning more than a week, with pauses in between that was longer than a day, and some rewinds mixed in as well.

Stats from pi:

↑400k ↓438k R432M 71.9%/1.0M

Half a billion tokens, $2.12

throwdbaaway · 2026-05-15T07:04:12 1778828652

Hah, that's because the prompt itself was only about 30 tokens. We need a much bigger prompt to properly test PP.