More

intothemild · 2026-05-30T09:59:49 1780135189

You should enable MTP now that its available.

LLamaCPP has had some massive updates in the last week or so.

npodbielski · 2026-05-30T14:05:01 1780149901

Yes, Qwen 3.6 MoE is hitting like 80-90tk/s on Strix halo. On R9700 I had like 170t/s. It was not possible to keep up. But MoE is circling very often. I switch then to dense model and have 20-30t/s but it is able to solve quite a lot of tasks.

alfiedotwtf · 2026-05-30T17:45:38 1780163138

For those speeds, I’m assuming Q4?

npodbielski · 2026-05-31T06:05:05 1780207505

Ud_Q4_k_xl

intothemild · 2026-05-30T14:32:00 1780151520

I get 50-60t/s tg on my r9700 with the dense, unsloth MTP quant UD-Q5_K_XL, K@8/V@4 256k context.

Using Vulkan backend.

``` llama-server -fa on -t 7 -ngl 999 --mlock --fit off --kv-offload --no-webui --metrics --chat-template-kwargs {"preserve_thinking": true} -b 2048 -ub 1024 -m /mnt/models/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q5_K_XL.gguf --mmproj /mnt/models/unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-F16.gguf -c 262144 --kv-unified -ctk q8_0 -ctv q4_0 --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-ngl 99 --alias unsloth/Qwen3.6-27B-MTP-GGUF --temp 0.60 --top-k 20 --top-p 0.95 --min-p 0.00 --presence-penalty 0.00 --repeat-penalty 1.00 ```

intothemild · 2026-05-24T08:38:10 1779611890

Yes. I run local models, Qwen3.6-27B and IMHO the massive level up was the agents and skills files that I've worked on.

Basically I run a flow

Brainstorming > Create Spec > Review Spec* > Create Plans > Review Plan* > Execute Plan (in subagents) > Review Against Plan > Code Review* > Open PR > Finish Plan (marks plan files done)

* Each review step marked with an asterisk uses a paid larger LLM, right now Deepseek V4 Pro. Having it do this catches a lot of small things, and now I'm effectively one shotting any task I give it.

And it's not costing me much at all, just those three reviews. I could use a free model like Gemini but I'm happy with what I've got.

lelele · 2026-05-24T21:43:46 1779659026

Would you mind sharing your HW configuration? Thank you.

intothemild · 2026-05-24T22:30:25 1779661825

Sure. It's just an old I7 8700 (non-k), 64gb ram. Running proxmox. But recently I put an AMD R9700 AI Pro, in there which is a 32gb inference focused card, think of it as a 32gb version of a 9070xt.

All the inference happens on that card, so the CPU/RAM is there for the other containers.

I'll eventually swap the motherboard and CPU for something better, so I can fit 1 or 3 more of those cards.

Why not NVIDIA? 32gb on team green means spending crazy money. And I can get 4 R9700s for the cost of one 32gb 5090.

128gb ... Vs 32gb.

Akamant · 2026-05-24T11:21:23 1779621683

Right on target

intothemild · 2026-05-17T16:15:59 1779034559

I've spent the last month bringing in a small demo of what the future could be like, running Qwen, Gemma, and Deepseek, behind LiteLLM so we can monitor token usage, and instead of some dumb ass "tokenmaxxing" we're actively trying to get the cost of inference both down, and in-house.

Boss is happy, very happy. We're rolling it out more widely now.

But this is the future.

intothemild · 2026-05-16T23:42:26 1778974946

Same. Having experienced the growth of computing in those eras, the show itself had a very well researched yet very nostalgic sense of "oh yes. I'd forgotten about that".

intothemild · 2026-05-16T23:40:50 1778974850

The best part of Silicon Valley was that it had a very south park quality to it.. in that things that were actually happening at the time were parodied on the show.

intothemild · 2026-05-16T09:26:46 1778923606

Exactly. People love trackpoint because it's right there in the middle of the keyboard, and you don't have to move your hands.

Any variation of trackpoint where you have to move your hand away from the keyboard, is a failure IMHO

intothemild · 2026-05-14T23:08:18 1778800098

Well considering right now MTP support is being developed, there was a conversation in that that seemed to throw around the idea of separating the MTP model out of the main GGUF, like with Mmproj. This was rejected.

Which I'm happy for. So given that decision, I don't think it's unreasonable to think that they might be open to including Mmproj files in the GGUF.

Only issue I can think of is, which one? BF16, F16? Etc

Philpax · 2026-05-14T23:20:44 1778800844

Quantiser's choice, IMO. They're best-placed to decide what compromise to make for their particular model.

intothemild · 2026-05-10T20:24:09 1778444649

That's already happening. Qwen3.6 and Gemma4.

Basically small and medium models that are crazy well trained for their sizes.

Then we have a lot of specular decoding stuff like MTP and others coming to speed up responses, and finally better quantisation to use less memory.

Local LLM is the future, and the larger labs know that the open models will eat their lunch once people realise that the gap is only a few months. If we were good with LLMs a couple months ago, we're good with the open models now.

krupan · 2026-05-10T21:30:30 1778448630

And how were those models developed and trained?

lelanthran · 2026-05-10T21:52:43 1778449963

> And how were those models developed and trained?

That's irrelevant to my decision to use local or not.

krupan · 2026-05-10T22:00:36 1778450436

That's not what this thread is about? We're saying some new breakthrough is needed, someone said it already has happened, and I'm asking if it really has. Has it? I don't think so, those models are not in some way fundamentally different than other LLMs

lelanthran · 2026-05-10T22:10:23 1778451023

> We're saying some new breakthrough is needed, someone said it already has happened, and I'm asking if it really has.

I didn't read "and how were those models trained" as "Are we there yet?"

intothemild · 2026-05-11T06:06:57 1778479617

There's a percentage of people who love to question how the open models were trained.. they are almost always going to try and make some argument about using the closed frontier models for distillation as some form of theft.

Just totally forgetting that the frontier models themselves stole an insane amount to get to where they are.

It's theft all the way across the board, and when someone tries to make the argument that open models theft is bad, but Altman or Amodei's theft is good.. they are revealing a lot about themselves

intothemild · 2026-05-06T13:32:39 1778074359

Don't forget to update the gguf you have too. The templates in them were updated recently too

intothemild · 2026-05-02T07:36:49 1777707409

If only they flapped. Maybe they'd still be in the air.