Yes, Qwen 3.6 MoE is hitting like 80-90tk/s on Strix halo. On R9700 I had like 170t/s. It was not possible to keep up. But MoE is circling very often. I switch then to dense model and have 20-30t/s but it is able to solve quite a lot of tasks.
Yes. I run local models, Qwen3.6-27B and IMHO the massive level up was the agents and skills files that I've worked on.
Basically I run a flow
Brainstorming > Create Spec > Review Spec* > Create Plans > Review Plan* > Execute Plan (in subagents) > Review Against Plan > Code Review* > Open PR > Finish Plan (marks plan files done)
* Each review step marked with an asterisk uses a paid larger LLM, right now Deepseek V4 Pro. Having it do this catches a lot of small things, and now I'm effectively one shotting any task I give it.
And it's not costing me much at all, just those three reviews. I could use a free model like Gemini but I'm happy with what I've got.
Sure. It's just an old I7 8700 (non-k), 64gb ram. Running proxmox. But recently I put an AMD R9700 AI Pro, in there which is a 32gb inference focused card, think of it as a 32gb version of a 9070xt.
All the inference happens on that card, so the CPU/RAM is there for the other containers.
I'll eventually swap the motherboard and CPU for something better, so I can fit 1 or 3 more of those cards.
Why not NVIDIA? 32gb on team green means spending crazy money. And I can get 4 R9700s for the cost of one 32gb 5090.
I've spent the last month bringing in a small demo of what the future could be like, running Qwen, Gemma, and Deepseek, behind LiteLLM so we can monitor token usage, and instead of some dumb ass "tokenmaxxing" we're actively trying to get the cost of inference both down, and in-house.
Boss is happy, very happy. We're rolling it out more widely now.
Same. Having experienced the growth of computing in those eras, the show itself had a very well researched yet very nostalgic sense of "oh yes. I'd forgotten about that".
The best part of Silicon Valley was that it had a very south park quality to it.. in that things that were actually happening at the time were parodied on the show.
Well considering right now MTP support is being developed, there was a conversation in that that seemed to throw around the idea of separating the MTP model out of the main GGUF, like with Mmproj. This was rejected.
Which I'm happy for. So given that decision, I don't think it's unreasonable to think that they might be open to including Mmproj files in the GGUF.
Only issue I can think of is, which one? BF16, F16? Etc
Basically small and medium models that are crazy well trained for their sizes.
Then we have a lot of specular decoding stuff like MTP and others coming to speed up responses, and finally better quantisation to use less memory.
Local LLM is the future, and the larger labs know that the open models will eat their lunch once people realise that the gap is only a few months. If we were good with LLMs a couple months ago, we're good with the open models now.
That's not what this thread is about? We're saying some new breakthrough is needed, someone said it already has happened, and I'm asking if it really has. Has it? I don't think so, those models are not in some way fundamentally different than other LLMs
There's a percentage of people who love to question how the open models were trained.. they are almost always going to try and make some argument about using the closed frontier models for distillation as some form of theft.
Just totally forgetting that the frontier models themselves stole an insane amount to get to where they are.
It's theft all the way across the board, and when someone tries to make the argument that open models theft is bad, but Altman or Amodei's theft is good.. they are revealing a lot about themselves
LLamaCPP has had some massive updates in the last week or so.
reply