Hacker Newsnew | past | comments | ask | show | jobs | submit | dabockster's commentslogin

Qwen HAS to be a part of the discussion here, even though Microsoft is a US based entity. Their 30b MoE models absolutely hit way above their weight when paired with the right harness program, and can be ran on "Costco gaming computer" specs when configured correctly in llama.cpp.

Sorry Trump Administration, but while the US has been downloading more ram by throwing data centers at everything and burning up everyone's power and water, China has come out with what's effectively a prototype edge compute capable AI model - regardless of how they built it. And arguably I can tokenmaxx on it just fine at around 30-40 tokens/sec.

And also, ASICs are on the way. Imagine one of those with a heavy hitting model (MoE or otherwise, Qwen or otherwise) installed in a PCIe slot at 10k+ tokens/sec and 75 watts max (maximum wattage deliverable by the PCIe slot alone) for $300-400 USD each.

https://taalas.com/the-path-to-ubiquitous-ai/

ASIC demo here: https://chatjimmy.ai/

Sorry/not sorry to rip this whole thing to shreds. But I'm sick and tired of these inefficient LLMs being produced that seemingly can only be offered by subscription from a data center, when I'm running a full AI stack right now (model and all) on my computer at home on a 750 watt max power supply. Microsoft really needs to get with the picture here and compete more with Qwen instead of just the US/EU entities.

Sincerely, your neighbor down in Tacoma. https://www.youtube.com/watch?v=V9jlo4Ht2YA&t=229s


> There should be no ability to "verify" a browser, and anyone should be able to emulate any browser.

Hard disagree. The AI industry has absolutely shredded the various anti-scraping and anti-botting social contracts that were in place prior to the covid pandemic. Like it's now common knowledge that robots.txt isn't a hard requirement and can be avoided entirely, for example. They have absolutely turned the open web into a dark forest.

Having a browser session able to be verified as untampered and/or "trusted" is probably going to be a thing going forward. Sucks a ton, but we all did this to ourselves.


> it's now common knowledge that robots.txt isn't a hard requirement and can be avoided entirely, for example

Was it ever not? It's a text file, not law.

> They have absolutely turned the open web into a dark forest.

Only if you have an ideological problem with people you don't like using the things you publish on the open web.

I'd say the web can be very open even without being copyleft. It makes some business models non-viable, but it doesn't prevent anyone from publishing what they want.

On the other hand, I don't think I would call something that preserves copyright at the cost of only admitting "approved/certified non-LLM scrapers" via attestation or similar "the open web".

> Having a browser session able to be verified as untampered and/or "trusted" is probably going to be a thing going forward. Sucks a ton, but we all did this to ourselves.

Who did what to whom?


Protocols like HTTP or formats like HTML were initially made to be machine-readable. You humans make your site machine-readable, publish on the internet and then get unhappy when machines start actually reading it.

Anyway, just put a captcha or require a cryptocurrency payment if you are unhappy with bots, but several people unhappy about scraping are less important than billion people unhappy about tracking their activity.


You're looking at that pre-covid time with rose tinted glasses. Half the reason sites like reddit or twitter offered free/open APIs was to ensure that the bots were being as efficient as possible rather than hammering the sites (The other half was altruistic but that good will is a very small line item to an MBA). Scrappers got so much better at just going to what's presented to humans because these kinds of APIs are no longer common so they had to. So now the lazy option is to no longer check if a site offers an API, rather than to check if it did and save time / not worry about maintenance by coding for an API.


Browser verification doesn't stop bots, that will just funnel even more money towards click farms which are using unmodified devices on racks.


> we all did this to ourselves

We meant who?


we already live in that world, Google and Apple cooperates with vendors like Cloudflare to make, essentially, the PAT / WEI implementation that they wanted.


Another reason to criminally prosecute the AI industry.


Intel could position their cards as strong for certain workloads. They had AV1 support first in market, for example.


Thunderbolt is really an unsung hero here. It is surprisingly nice to be able to move various components around my desk that would have otherwise sat in a huge tower hogging all the PCIe slots they can find.


Agreed, I've been doing experiments and it's wild to me what "just works" in a secondhand eGPU case or music production PCIe boxes.

Dual 10G NIC cards, way cheaper than a comparable dongle 36 HDDs in JBOD, absolutely! 12 optical drives, sure!


You don’t need it if you use llamacpp on Windows, or if you compile it on Linux with CUDA 13 and the correct kernel HMM support, and you’re only using MoE models (which, tbh, you should be doing anyways).


What MoE has to do with it? Aside from Flash-MoE that supports exactly one model and only on macOs - you still need to load entire model into memory. You also don't know what experts going to be activated, so it's not like you can predict which needs to be loaded.


With proper mmap support you don't really need the entire model in memory. It can be streamed from a fast SSD, and this is more useful for MoE models where not all expert-layers are uniformly used. Of course the more data you stream from SSD, the slower this is; caching stuff in RAM is still relevant to good performance.


Okay, yes, you don’t need the entire MoE model in memory for it to function.

But you still need the working set of frequently used experts to actually fit in RAM, or at least stay cached. Expert routing happens per token, per layer. If those weights aren’t resident, you’re effectively pulling them from disk on the critical path of generation — over and over again.

That’s not “just slower,” that’s order of magnitude slower. You’ll end up with constant page faults and page cache churn. And if swap is on the same device as the model, you’re now competing for bandwidth on top of that.

IMO the main benefit of mmap is ability to reclaim cold pages during high memory-pressure events when model isn't active.


You can do this on a Mac as well tho, right? So that 128 GB unified memory becomes cache for very fast 1+ TB Apple SSD.


I think the advantage of Flash-MoE compared to plain mmap is mostly the coalesced representation where a single expert-layer is represented by a single extent of sequential data. That could be introduced to existing binary formats like GGUF or HF - there is already a provision for differently structured representations, and that would easily fit.


This needs to be sold as the big ticket item for low level devs. Their chips are some of the most power efficient chips on the market right now.

Hoping they release a blade server version somehow.


Nvidia's recent GPUs are more power-efficient than Apple Silicon in raster, training and inference workloads.

A blade server would get cancelled just like the Mac Pro for exactly the same reasons: https://9to5mac.com/2026/03/02/some-apple-ai-servers-are-rep...


> Nvidia's recent GPUs are more power-efficient than Apple Silicon in raster, training and inference workloads.

I think you can do better than the proverbial Apples and Oranges comparison.

In terms of total system, "box on desk", Apple is likely to remain the performance per watt leader compared to random PC workstations with whatever GPUs you put inside.


Then ignore me, and go ask your local datacenter why Apple Silicon isn't on any of their racks.


Because they're stupid and only buys stuff that's "safe". Because nobody gets fired for buying IBM.


Apple releasing anything enterprise or "server" related would be a pretty big pivot - let alone blades.


CUDA 13 on Linux solves the unified memory problem via HMM and llamacpp. It’s an absolute pain to get running without disabling Secure Boot, but that should be remedied literally next month with the release of Ubuntu 26.04 LTS. Canonical is incorporating signed versions of both the new Nvidia open driver and CUDA into its own repo system, so look out for that. Signed Nvidia modules do already exist right now for RHEL and AlmaLinux, but those aren’t exactly the best desktop OSes.

But yeah, right now Apple actually has price <-> performance captured a lot of you’re buying a new computer just in general.


> For example, when my Windows gaming machine comes out of hibernation my ethernet controller insists that there's no connection. I can't convince it otherwise except by disabling the device and re-enabling it. I can't figure out where I might find information that tells me why this is happening, so I just wrote a powershell script to turn it off and then on again. I bet some Windows IT dork could figure it out in 30 seconds

Windows and Linux dork here (heh). It has to do with how various computer manufacturers implemented the Sleep/Standby State (S3/S4), how they've resisted implementing a common standard at the hardware level, and how Microsoft eventually gave up arguing and patched around it with their own Modern Standby system in the S0 state.

https://learn.microsoft.com/en-us/windows-hardware/design/de...

Tbh, though, the only computer I've ever seen Hibernate work well on are Macs. Every x86 computer usually has some sort of issue with it, except for maybe business laptop models (eg HP's Elitebook line).


> Tbh, though, the only computer I've ever seen Hibernate work well on are Macs. Every x86 computer usually has some sort of issue with it, except for maybe business laptop models (eg HP's Elitebook line).

This has always been my experience, going back I'd say at least to the early 2000s on cheap laptops, and all the way back to the earliest days of sleep and hibernate on desktops, where sleep just doesn't matter that much.

When I started dabbling in boot code around 2006, I read a bunch of the specs and one of them was ACPI, which I only scratched the surface of.

I think until then it had just not occurred to me that a modern paged protected OS would even want to call into any code supplied with the computer, vs. having it come from a driver disk, or be built in to the kernel where everyone can see it.

The whole idea of a bytecode interpreter running random code supplied by a fly-by-night system builder is a little unsettling.


"If Apple Business were a real revenue source, if they charged luxury prices for a luxurious business support experience, they could pay for developers to fix their stuff. Instead, Apple Business is a free side hustle for Apple, a hobby."

I'm wrestling with something similar to this right now in Linux. The only real player that charges "enough" to have a "absolutely zero tolerance for base OS breakage" approach to OS development is Red Hat. Ubuntu LTS is more widespread but only really because it's $0 even for large businesses, and that's honestly reflected in it sometimes having hardware breakage during a version's initial two year mainstream support run. Having Windows's business backed level of "doesn't break" on hardware is rare on Linux.


Bigger one:

* Predictability - eliminating the number of unknown factors that could cause a person to have issues using their computer. Reminds me of how a secretary I serviced was somehow able to install Google Desktop back in the day, and how that caused a massive argument between my boss and theirs when their computer needed to be re-imaged. Most IT approved programs are known to store user data in known locations on a computer, which makes backups and restorations very easy. Stuff like Google Desktop did not do that, which means likely breaking someone's workflow in the re-image process.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: