Hacker Newsnew | past | comments | ask | show | jobs | submit | rao-v's commentslogin

Wonder what the limit is as you add more 32 bit integers to the product. Just the primes over 32 bit?

If you're allowed to multiply as many 32-bit numbers as you want, the only numbers you won't be able to achieve by so doing are those with any prime factor larger than 2^32.

This is more than just the prime numbers. For example, a 41-bit prime can be multiplied by 16 and it will still fit into 64 bits.


What are you assuming about overflow? Three 32-bit numbers multiply out to 96 bits.

If you bring overflow into the mix things become a lot more complicated. You likely don't even need 32 bits, the numbers 2 and 3 might be enough (I don't know for sure or if there's a quick way to check).

Well, if you "bring overflow into the mix", what you get depends on your behavior when overflowing.

If you say that you want to be doing modular arithmetic instead of arithmetic, it doesn't look like 2 and 3 are enough. You're looking for a solution to

    2ᵃ * 3ᵇ ≡ n (mod 2⁶⁴)
If n is even, we can supply any number of 2 factors by fiddling with a. We can assume without loss of generality that n is odd and a = 0. Now we want

    3ᵇ ≡ n (mod 2⁶⁴)
for odd n.

If I'm reading wikipedia correctly, we know that this will fail for some n:

https://en.wikipedia.org/wiki/Primitive_root_modulo_n

> In symbols, g is a primitive root modulo n if for every integer a coprime to n, there is some integer k for which gᵏ ≡ a (mod n).

This is what we want, with g = 3, k = b, a = n₁, and n₂ = 2⁶⁴. Our restriction that n₁ (our n) is odd satisfies the requirement that a be coprime to n₂ (wikipedia's n, the modulus).

The article continues:

> a primitive root exists modulo n if and only if n is 4, pᵏ or 2pᵏ for some odd prime number p and some k ≥ 0.

2⁶⁴ does not satisfy this requirement and therefore there is no primitive root modulo 2⁶⁴. As such, 3 is not a primitive root modulo 2⁶⁴.


What’s strange about how things have developed is that this report 12-18 months ago would have been a massive scandal and would have caused durable brand damage.

Now nobody will remember or notice.


It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.

The latter is much better (since you can clean up, review, update responses and filter your datasets).

I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)


Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].

Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].

I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.

[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):

https://news.ycombinator.com/item?id=48165265

[2] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[4] Embarrassingly simple self-distillation improves code generation (201 comments):

https://news.ycombinator.com/item?id=47637757

[5] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193


So first - these are terrific papers and I'd not seen some of them before.

Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".


A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.

I agree, and if my suspicion is right, it’s rarer because it’s much easier to deploy the large LLM and filter for it’s best output than to waste time running it on arbitary output just to train the student.

Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights


Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there.

To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.

Could you share some latest articles or papers comparing both methods, especially on lanuage modelling case? I was not conviced by this claim when reading the original Knowledge Distillation paper. ChatGPT said there are some later works showing: 1. the gain may come from label smoothing; 2. soft logits are more meaningful for students much smaller than teacher.

I prefer synthetic dataset since the first day hearing distillation. The engineering friction is much lower than soft logits, and I have not observed or heard performance loss (in Speech and language area).

> I suspect nobody is doing real student teacher distillation

It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though


Yes absolutely! I should have been more specific - I don’t believe people are using it to train 30B models from 300B models (and I’d love to learn that I’m off about this)

One may view pre-training as distillation.

The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.


Terrific read thanks for digging it up

What is the fastest you can transfer data from ~10 meters away using a modern phone front camera and screen? Surely 100 kB/s is slow?

Depends on the zoom. With this setup you can transfer about 0.1 B/s per pixel of 60 FPS video. So a 65" screen and 1080p camera at 10 meters away would max out at 2 kB/s with the normal lens (26mm equiv) or 45 kB/s with the tele lens (120mm equiv.)

I'm cooking something faster but depends on the job situation and funding whether I have time to spend on it.

Napkin math: QR codes encode 0.75 bits per module, each module needs about 3 pixels of camera resolution, and the temporal resolution is quite dodgy as well, maybe 0.25 * min(cameraHz, screenHz). So if everything is perfect, 44 kB/s at 60Hz per a 500x500 pixel patch. I've seen ~250 kB/s when a 1920x1080@60 transfer is working well. At 4k@30, you might reach 0.5 MB/s. If you throw in the 2x subsampled UV channels to transfer data as well, you might get an extra 50%.


That's worse that I thought! I was thinking about holding up two phones facing each other at 10 meters, but using something smarter than QR codes.

Hey some of us are on hardware (gfx906 based Radeon MI50s with 32GB of stupidly fast VRAM and basically no compute) that inference significantly faster with Q_0 and Q_1 quants

Vega... unfortunately kinda sucks.

Its not amazing at compute (yet is a member of the GCN family, which I have been a fan of since its inception) and ended up being too expensive for perf/$ and perf/watt.

The only thing it did was make Nvidia rush Series 10 out the door and make it too good. Nvidia has been unable to live up to the gen-to-gen uplift Series 10 did, all because AMD made Nvidia blink.

Basically, you're 2 gens too early. CDNA2/gfx90a is the minimum you need to get any meaningful performance out of inference, or maybe CDNA1/gfx908 if you really don't need to quantize at all.

BTW, I did suggest this elsewhere in this HN story, but have you tried just disabling KV quant entirely? That is a huge speed uplift for compute-poor users.

Also, llama.cpp's support for gfx906 is probably never going to as good as it is for other cards, and good ROCm support for cards before they rebooted the driver/stack team is probably never going to materialize. I don't see the point in hanging onto them.

Like, if I was in your place, replacing it with even a 9060xt, with half the RAM, would be a step up. They go for $450. People have been building dedicated inference machines with these and they've been amazing, just throwing in 3 or 4 in, and scaling VRAM to meet needs.


I'd have to try the KV cache trick but folks get pretty competitive speeds with the current 31B/27B dense models e.g. https://www.reddit.com/r/LocalLLaMA/comments/1tc9j6u/mi50s_q...

This would be a fun and popular project for the right sort of person


I’d love for this chart to also capture the loudest signals we’ve sent. Surely somebody must have accidentally broadcasted a non-directional megawatt radio signal at some point right?

Actually humm maybe nukes are our brightest non directional transmission?


My understanding is that, these days, a lot of advanced degrees held by teachers are in Education, not say Math or History.

I’d love to see this data recut by degree type.

Edit - wow we’re talking about 50-70% of the masters being in Education, Special Education or Admin fields. (Page 14: https://mhec.org/wp-content/uploads/2025/10/202510-MHEC-Grad...)

This data is basically telling us nothing about the value of a topical masters degree.


My wife is a teacher. She wanted to teach history, so she had to get a history degree with a specialization in education. But there were no jobs available, so she accepted a conditional as a special education teacher. That's what drove her to get a master's degree in special ed.

While doing teaching special ed she developed a fondness for teaching math. But she isn't allowed to take on a general ed math class because she doesn't have a "math endorsement" - which would require her to go back to school again for basically another advanced degree in math. And she can't get a general ed job in history because it's too competitive and her years of experience makes her too expensive compared to fresh blood.


I'd say that there is no such statistically significant data.

Practically nobody teaching K-12 has subject-matter masters degrees. It's just not part of the career trajectory. As unusual as a nurse having an M.A. in history or something. Yes, would occur on the margins of people changing course in life, but not the mainline.

Specifically, the question here is about the efficacy of pay-scale bumps for Masters degrees in education. To your point (and my counter-point), teachers get a substantial pay bump* if they hold a M.Ed, but no bump if they hold a masters in their teachable areas.

For persons who can afford it in the moment, taking a one year or two or three year part-time M.Ed. after getting a few years teaching experience (an entrance requirement in most M.Ed. programs) can pay for itself over the next 2-5 years, then is all surplus for the rest of the career.

* - all of the varies a bit by jurisdiction but I think this is "the general case".


May your contexts always be short


Before I read some of the study, I thought that was relevant too, but each "step [was] conducted as an independent, single-turn session."


Good point!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: