I'm a Norwegian, and I use the national library almost every day for searching through texts. They have truly one of the best working user interfaces (and functionality) for searching through the massive amounts of text.
It's really fantastic. I just wished there were fewer restrictions on the content that is accessible.
(a lot is only accessible from Norwegian IP addresses, so it's one of the main reasons I maintain a VPN as I'm Norwegian but live in the UK; a second set is only available from the IP addresses of libraries or research institutions - still huge amounts that are generally available, though)
My biggest gripe with it are the restrictions, indeed.
When searching through the closed newspapers, you have to apply for access manually, which gives you 8 hours of access. Great. Only that the access is seemingly manually granted - so if you apply 16:05 on a Friday, chances are you won't get any access until 9-10 the next Monday.
With that said, I do understand why it is like that. If people could apply via API, and get instant access, they would probably just stop buying newspaper subscriptions.
You can access quite a bit directly. Check out nb.no (or https://www.nb.no/en/ for an English version of the page, but of course most of the works are in Norwegian)
There are escalating series of restrictions, basically:
* Available for everyone.
* Available from a Norwegian IP -> just requires a VPN.
* Available from Norwegian libraries
* Availble under "special conditions". This would mean from a participating research institution or university, or similar.
Pretty much everything that is out of copyright falls in the first category. The second and third categories has a bunch of copyrighted material where the copyright holders have granted limited usage rights. A bunch of newspaper archive material that is still under copyright (but sadly not the biggest ones) are available from Norwegian IPs for example.
How true is this statement:
"He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language."
I thought all big players already train on basically everything remotely available to them no matter the language or quality, so his take sounds like an opinion formed in the early days of generally available LLMs.
If you want LLMs to have knowledge of the Norwegian language, wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available? Why go to the expense of training your own model, especially when it will be inferior to state of the art models.
I task GPT/Claude with researching stuff that pertains to very specific cultural or legal aspects in French politics, on a daily basis.
Even though French is a way more common language globally than Norwegian, these models still haven't figured out that, no matter the language I myself speak to them (German or English depending on my mood) their web searches need to be done in French to return reasonable results. I have to remind them every time lest they come back with "uh, didn't find anything relevant, here take some hallucinations instead."
So, given the anglo-centrism of current models, my confidence in American providers giving any shits about non-american users/use-cases is pretty low. And lower the smaller the language community is.
I've noticed that it also imposes american moral judgements on certain things, even though it reasons (sometimes) in the native language.
I was trying to work out how and when to use swear words, and the relative power index of them. it translated english swear words into the target language then lectured me on not using them.
It took a bunch of prodding for it to actually think as the target language to then get the (mostly) correct response.
Would be curious about the model and the prompt for this.
Not kidding at all. I had a similar issue with a project where I needed to classify images into specific demographics, and Gemini, while capable, was entirely not going to do the task… until in my JSON response I left room for it to tell me why this was not a good idea and why it was culturally insensitive. Then boom… full JSON array: hair color, eye color, skin color, fitness level, likely ethnicity, likely country of origin, and about 10 other values.
You’re probably wondering what on earth I was working on. I was matching Ai gen headshots to Ai voices so that in an app the voice picker had human (Ai) faces.
I have the opposite problem. I often have to ask ChatGPT about things related to Norway and I have to constantly correct it when it keeps switching to responding in Norwegian no matter how many times I tell it to only answer in Norwegian when I request it.
Aren’t you already using English in the LLM convo? Telling the model to use French for research or to find resources in French seems like a reasonable step.
If you’re doing this on a daily basis, then you should have an AGENTS.md that accumulates directional instructions like this.
This is how you use the tool correctly.
There’s this weird pattern I’ve noticed where people expect LLMs to require zero effort or proficiency on their part, and when the LLM isn’t perfect without it, of course it wasn’t; LLMs suck.
The issue is that French, Italian, African, Japanese people shouldn't have the inconvenience of instructing the LLM tool to get the basic facts about their own culture. They should use an LLM that has already been trained like that by default. Nobody has obligation to use a tool that thinks it is talking to an American. If I go to Google for example I want to get facts about my own country in my own language.
Wouldn't those people be asking the questions in their own language in the first place? The model will reply in the language you use. This thread is about people asking for information about a language that is not the one they are messaging the LLM in
Even if the model will reply in my language, I often notice it searching in english. Or thinking in english. There's always something lost in translation. Sometimes it's just minor nuances. Other times it mangles the legal facts with those of other countries.
This sounds like the problem of people calling "911" as the emergency number which they see in so much US-American media but which is not the emergency number in their own country.
I remember being bored as a teenager on a family holiday to New Zealand in the 1990s, so I went and dialled 911 from a payphone to see what would happen-I got a recorded message saying that in New Zealand, the emergency number isn’t 911, it is 111. Dialling 000 (the Australian emergency number) produced a similar recorded message.
They always sound like an obnoxious American tourist talking through a translator, the chatbot training dataset is the same and foundation models are always built with >50% American English data for some reason.
> Aren’t you already using English in the LLM convo? Telling the model to use French for research or to find resources in French seems like a reasonable step.
Most ordinary people will just use their native language and they have no way of knowing that the model always reasons in English and therefore is strongly biased toward using English search terms. So they don't know they have to remind the model to search in their local language.
If you ask in French, it searches in French, right?
I have the opposite problem, where I'll ask in English, about something in a foreign country, the results it finds will all be in that foreign language, and the LLM will switch languages and respond in that language (which I don't speak).
So then I have to ask it "can you repeat that in English please."
I keep waiting for the new GPT-Definitelty-AGI-For-Real-This-Time to fix it but it's still there.
> If you ask in French, it searches in French, right?
not necessarily. i often prompt Claude in German and then see the reasoning happening in English. of course it will eventually reply in German, but that does not mean that the tooling in the background was using German.
Same for me - I mostly ask stuff in English but sometimes add specific terms or names in Japanese as needed. My Japanese is intermediate, but it will often switch immediately and reply only and entirely in Japanese. I'm pretty sure they have a system prompt with hairline triggers for foreign languages BECAUSE of the overrepresentation of English in the training corpora.
> their web searches need to be done in French to return reasonable results.
I wonder how much of this is also just the search engine's region setting.
It's a big problem I regularly have with Google. I almost always want English language, US-centric results, so I have my region set to the US. But occasionally I want results relevant to my actual country, and even searching in my native language usually yields much worse results than just opening an incognito tab and letting it default to my real location.
What incentives does OpenAI have to make sure the AI actually works well with Norwegian beyond capturing a (small) Norwegian market? What incentives do they have to take Norwegian values into consideration, or to preserve Norwegian culture into the future? The matter is also a question of national sovereignty, so to simply release the data and nicely ask foreign companies to solve the problem for you, would be a fool's move
It's also a bit funny because Norway definitely has enough money to hire a team of Anthropic's best to go out there and train them a model that does whatever they want. They probably have enough money to fund their own Anthropic competitor.
I highly doubt that hiring people who don't even speak the language would result in a better model for Norwegian. If anything, they could pay Anthropic for some tips and tricks for training. But that does not seem necessary as Deepseek & co detail everything for free
Considering the fact that the US is complaining about Norway putting too much money into the US market, imagine what would happen if all that money was spent in Norway. It would be chaos.
It was tried in early 1980s and nearly drove any non oil-related industry in the country extinct.
Norway has a manpower bottleneck. The UK had spent its oil windfall domestically and it barely registered. But for a nation of then some 4 million the economy melts down with so much monetary mass.
So blaming population is a cheap excuse that doesn't hold water. Especially that you can always import the skilled people you lack, when you have virtually unlimited money and some of the highest standards of living in the world.
wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available?
Only if you believe other people will value that enough to expend the effort necessary to use it. If you believe other people will see it as low value and ignore it then you'd be better off doing the training yourself in order to guarantee it happens.
There's also a secondary benefit that your team doing the work will learn some useful skills while they do it.
Yeah, was about to comment that too, instead of training a new model and new weights exclusively for Norwegian (and expecting/wanting every other small/medium-sized country to do the same) which seems infinity harder, they could have made high quality transcriptions and translations of the stories currently described only in Norwegian into English, and making it all public. I guess there still would be a worry that it'd be counted as "less important" compared to other history, news and culture about other countries.
Oddly enough, my wife was recently involved in a project to translate historical crime novels from Norwegian; since all the available late 20th century Scandinavian crime novels have already been translated and turned into popular TV series, the plan was to go further back. Into the 1930s. The first cut was done with LLMs, but encountered the problem that (a) Norwegian itself has changed noticeably since then, in both major dialects, and (b) the machine translation deteriorated on large sections, resulting in entirely missing paragraphs and pages in a few places. Not to mention the usual translation issues (what police role does lensman map to?) and localisation (to what extent should the casual antisemitism be left in or removed?)
Translation is never a bijective process. It's never quite the same experience in translation as it is in the original, due to the cultural differences between reader and writer. Larger in this case because 1930s Norway is very different even from 2020s Norway.
Ultimately this was not a success due to marketing difficulties; it is very difficult to get a book noticed.
Sorry if I was unclear, I didn't want to give the impression I think translations or even transcriptions in some cases is easy, or without problems, or not painstakingly time-consuming, it very much is.
I just think building a LLM from scratch is ever harder, with more potential problems that are harder to solve, more time-consuming and even more resource-intensive.
Yes, why wouldn't it be easier to transcribe and translate, skills humanity had for centuries, compared to LLMs that we've only learnt to build these last few years, and even require a frikken computer to do? Of course one of these is harder than the other...
Look at it from this lens: translating and transcribing these stories hasn't happened for the centuries they existed, while as you point out the skills where always there. In contrast LLMs have been here for a few years at most and everyone and their dogs are trying to get in on the "race".
With absolutely no insight into why, which one has better odds to happen first is obvious to me.
Sure, it isn't as "hot" to translate stuff as it used to be some hundreds of years ago, and building LLMs surely is "hot" today, I don't doubt more people are attempting to build LLMs today than translating huge datasets, especially if we narrow the two to exclusively "In Norwegian".
Having insights into both translations, transcriptions and attempting to build LLMs myself, I'm fairly sure which effort would be successful first, regardless of how many attempt it first.
Copyrights and statutes don't allow them to do that. The mandate of the National Library maybe permits them to make an LLM through (though I won't at all be surprised if someone sues them anyway).
Permissions, probably. Copyrights and statutes. Knowing the librarians, unfortunately the prestige of their job is more vested in denying you access than giving you access.
I mean it's their job to give people access to information, and they certainly do, but the mark of a professional, in their eyes, is guarding information. It's much more embarrassing for them professionally to give too much access than too little.
LLM training gives them a "respectable" way of bypassing that and give the world their information (which, in fairness, they probably all really want to do if they could).
If they wanted to they all have scanners and access to information on how to create torrents. Setting the information free isn't complicated, so it'd seem most of them, do not want to.
Where do you seed a 60 petabyte torrent? I'm sure some choice cuts of what individuals feel is important have made it to Anna's, but I don't think refusal to go on a full data liberation spree is evidence they don't care.
Foreign LLMs are probably not trained on the Norwegian National Library. I regularly find things in there (with regular keyword search, for genealogy) which neither search engines or language models know.
Of course I then usually put the information I'm interested in somewhere AI could scrape it. But it would take a long, long time to get everything interesting out of there.
Yep in the article it says ..the National Library .. has the single largest digital collection of Norwegian books, newspapers, web pages .. it is entitled to receive copies of every published book and broadcasted content. Its legal deposit mandate in this area extended beyond books, as it was duty-bound to collect and preserve all of Norway’s cultural heritage .. an agreement with Norwegian newspapers permitted LLM training on copyrighted content.
It is just copyrighted data, that is harder to get a hold of. All the copies are available to anyone to use if they just read it. Copyright makes other uses complicated. I wonder if the whole Creative commons debate was a mistake, you can never fix copyright in a digital world.
Not remotely true in my estimation. I don't really speak Norwegian, but I do speak Swedish(which means I mostly understand Norwegian as they're very similar). Every model I've tried speaking Swedish to does it perfectly. I'd be surprised if the same isn't true for Norwegian already
Of course they speak swedish. But often, they do not reason in Swedish and do not search in swedish. Swedish makes up a tiny fraction of training data, while the vast majority is English, from the US. Which means the answers will always have a bias towards US culture, even if you ask in Swedish and the LLM answers in Swedish.
While Google does a good job with language support in their models, GPT-5.5 can't write proper Norwegian. It's even making up words that does not exist.
different models have been very different in this way.. almost ten years ago the French made a very large effort to capture languages.. the release notes I read at the time IIR had quite a few languages from South Asia / India, and in Africa. The language that was prominently missing was German IIR. I cannot say for the 2025-2026 models since so much has happened.. but models are not equal.
Maybe it can at least write like a Norwegian instead of just English-translated-into-Norwegian. It would be interesting to see if they try something like the experiments in https://arxiv.org/pdf/2507.22445 on it.
Current-best models are pretty fluent at major languages and cultures, so it's untrue at least for the "any" qualifier. Performance is barely affected or might be even better sometimes. However English patterns can subtly leak into native patterns of other languages. It's obviously very different for low-resource languages, but to improve them you need more data, not a new model.
>Current-best models are pretty fluent at major languages and cultures
strong disagree on that one. As a German interacting with ChatGPT, even in German it gives me the feeling of talking to the Pluribus people, which reminds me of an anecdote of Walmart failing in Germany because people were freaked out by the constantly upbeat, smiling employees.
Understanding a culture is a very different task than translating the syntax of a text, and these systems might be capable of syntactic fluency but they do not really understand culture. You have to metaphorically abuse these models until they stop sounding like the crossover of a HR department person and a Mormon missionary
I fully agree. I'm Swedish and have recently used GPT to help me draft some cover letters in Swedish. Even with all the mandatory personality tweaks and prompting, it always seems to default to highly florid and self-congratulatory Americanisms if I'm not careful. It's very subtle.
I do understand where proponents of language equivalency are coming from. LLMs seem to be extremely good at answering simple, one-shot type questions and mechanical 'low-level' translations for most languages. I feel like as soon as you introduce complex chains of thought or multi-step cross-linguistic tasks, minor imperfections stack and become magnified, just as with coding tasks or context rot.
yeah and alignment is all about how to be less evil which is no easy job... I can just imagine Chinese LLM renders 1989 tianmen square as an incident orchestrated by CIA which CCP successfully thwarted etc etc
As the article explains, Norway's National Library has a database of practically everything published and broadcast in Norwegian going back many decades. From the way the dataset described in the article, it does not sound like OpenAI et al. would have easy access to it in its entirety.
> The Olivia system is an HPE Cray Supercomputing EX system, with 448 GPUs and 64,512 CPU cores.
Training a sovereign LLM with this meager hardware as opposed to a LORA on some open source model seems like a huge mistake and a potential red flag.
There is no way these people have the resources to train a fully fledged LLM, so claiming that is their goal makes me think they don't intend for the LLM to be useful.
Which begs the question, whose money are they wasting - and why?
It may not be useful to anyone outside, but it's possible that one of the goals is institutional learning (that is, embedding the knowledge in how to build LLMs in an organization).
Even though it's nominally the national library behind this, they were probably chosen (as per the article) because they legally own and can use all NO material for this end. I'd guess researchers from related entities like unis will be involved in the process.
They successfully have made PoC finetunes before, so the next step is training fully fledged LLMs.
I don’t think they aim to anything worthwhile. The finetunes were incredibly broken. I’m guessing it’s more about having the method to do it. I’m not convinced it’s super useful but I’m not one to decide who gets to do what with the research funds.
One finetune I tried did make fun of humans expressing their feelings in the chat. Often.
One other finetune did hallucinate that it was a doctor and my baby had terrible diseases, every time I just wrote "hei" (with a generic neutral system prompt that likely triggered this behaviour though).
I think Olivia is big enough for what it’s used for. In my opinion it’s better to stay up to date and not waste too much money on hardware at the moment.
The article's slides mention how much of an engineering challenge it is just for them to clean their data and create new hardware and software flows to use the data for training. So perhaps it is a big learning exercise to build up institutional / national knowledge of LLM creation.
i18n language models are not area something frontier labs are focusing ton of resources on? ( certainly not in Norwegian)
The corpus of content in Norwegian - may not require very large clusters, or even if it does, this is best that the library could do, it would be certainly more than anyone else is investing in Norwegian models
SOTA models do not have the access to the quality of content that the national library does? The article mentions licensing with newspapers specifically, and the library has access to its own content archive.
English and Norwegian are not closely related language families, perhaps LoRA is not best approach?
I am curious if there is published research on how well localization works with LoRA depending on how far off the target language grammar/vocabulary is from English.
Projects like this typically have more than one objective and are not only building SOTA project, but is also to build/train foundational local talent , similar to universities launching satellites .
> English and Norwegian are not closely related language families, perhaps LoRA is not best approach?
Yes, they are. English is a West Germanic language. Norwegian is a North Germanic language. The French vocabulary in English obscures it a bit, but the two languages have similar grammar and the vocabulary has a huge number of close cognates.
E.g. day -> dag, ship -> skip, apple -> eple, cow -> ku (which makes more sense when you pronounce them correctly out loud), bairn (child; mostly Scotland and Northern England) -> barn, hop -> hopp, yule -> jul just to give a random selection of English Germanic words.
But more than that, the frontier models both a) knows Norwegian quite well, b) certainly knowns German and Dutch well, and there's a continuum of language transfer around the North sea especially when accounting for sounds rather than modern orthography, e.g. to take a couple of examples from above: ship -> schip -> Schiff -> skib -> skip; day -> dag -> Tag -> dag). The "jump" to Dutch already weeds out most of the French. A lot of modern Norwegian orthography comes from Danish, which again shares more than modern Norwegian does with German.
Knowing any of these helps a lot with learning Norwegian and vice versa. E.g. I'm Norwegian, I've never learnt Dutch, but I have learnt English and German, and I can read Dutch fairly well from that alone.
This makes me deeply curious about how LLMs understand language. Do LLMs relate cognates more than words that are dissimilar in different languages? I wonder if that plays some role in the effectiveness of tokenization.
I have no idea if the similar spelling will somehow help - I used that mostly because it's a simple way if illustrating the close relationship, but I suspect you'd find that the meanings of closely related words are likely to more directly overlap.
The grammar is perhaps more likely to help. Similar word order etc. Even weirdness like German - my only top grade on a German essay in school was one where I on purpose ignored what I thought I knew about German and tried to evoke "old fashioned" Norwegian. The result was guessing at a bunch of grammatical structures that I didn't know if was valid German. Turned out I was right about most of it - century old Norwegian was far closer to century old Danish, was a lot closer to valid German, and enough so to impress my teacher enough to overlook a number of orthographic mistakes.
The same thing works for guessing German grammar from English. The farther back you go in English, the more its grammar resembles German.
"What sayest thou?" -> "Was sagst du?"
In fact, for the above, you don't even have to know a single German word. You just have to know what for question words, "wh" -> "w", that the English "y" at the end of a syllable usually comes from an older Germanic "g" sound, and that "th" was replaced by "d" in German. That gets you 90% of the way from early modern English to modern German in the above example.
That's interesting. I haven't thought about it in that direction before. I'm "of course" aware of the High German consonant shift, which also muddled things a lot (the continuum around to North Sea is a lot "cleaner" if you look at Plattdeutsch instead), but never thought much about what other simple transformations to apply with standard modern German.
That's enough resources to build on something like the Olmo 3 recipe but with a mix prioritizing their own data and post-training for their own tasks. If they build their own embedding model, index everything in the library, and train their model to query that data while answering historical, cultural, legal, and strategic questions from their perspective... Pretty interesting and likely useful. They won't beat Anthropic at dumping out React code but also there's no real reason to duplicate that.
LoRA won't fix the tokenization problem. Norwegian on a typical English-heavy BPE vocab uses 1.5-2x more tokens per word — that compounds into real inference cost, not just quality
The largest problem is available training data actually.
They have already done experiments with dittrent sub 10b models with both fine-tuning and fully from scratch. And last I check the fully from scratch captured the language in a better way.
"Training a sovereign LLM with this meager hardware"
Norway has a sovereign fund worth O[MS|Apple|etc] except it is largely in readies and not pixie dust.
Whilst the UK frittered away North Sea oil profits, Norge squirreled them away instead.
So, if the grand dream of LLMs and AI does actually come to some sort of fruition and not simply another case of the Emperor's New Clothes combined with some lovely tulips and a dotcom boom and bust, then Norge can simply stuff shit loads of cash into buying whatever they need. Cash is king after all.
The beast they have described here is just a library system. I think I'd like my country's (UK) library system to have resources like that.
I don't think you are asking the right question: When you say "meager", I see "rather impressive PoC from a well resourced organisation"
The reason they have the largest sovereign wealth fund (aside from getting it right in the 80s, unlike the UK), is that there is quite a bit of regulation around where and how the money is invested.
It is run to maximise growth for example, so even though Norway is way ahead with electric car usage and infrastructure (presumably because they have a climate likely to be most affected by global warming/heating) their fund still invests in fossil fuels as they are a profit/growth opportunity.
Anyway, i don't think it's as easy as "simply stuff shit loads of cash into buying whatever they need". I believe there would be a serious political discussion needed for that to happen.
> There is no way these people have the resources to train a fully fledged LLM, so claiming that is their goal makes me think they don't intend for the LLM to be useful.
Depends on what they are doing and why. but at most big labs, only the final model training happens on the big clusters. a lot of experimentation happens on <500 gpus per dev.
This is the use case for the small NVIDIA boxes that a researcher can have on their desk for $5k and do useful experiments before spending all the grant money on a huge training run for the final product.
"Norway's sovereign wealth fund, officially known as the Government Pension Fund Global, is the world's largest sovereign wealth fund with assets exceeding \(\$2\) trillion. Established in 1990 and managed by Norges Bank Investment Management, it was created to channel surplus petroleum revenues into long-term global investments to benefit future generations."
DeepSeek claims to have trained on something like 2k H800, this is ~0.5k GH200 … it’s not nothing. Sure they’re not going to _serve_ it at scale, but that’s not the point?
Also the line between “finetuning a base model” and “man this is a real good initialization” gets pretty blurry at scale.
> Which begs the question, whose money are they wasting - and why?
Norway is better run as a country than 99% of the countries on the planet, including the one that invented current LLM tech, so I'd give them the benefit of the doubt.
And this is before anybody ever thought about optimizing the training process. (Currently it's just pytorch analyst-as-coder slop, with extremely overprovisioned quantizations, etc.)
The frontier models know Norwegian just fine. They can also adapt to Norwegian dialects, and even ape old Norwegian fairly well.
E.g. I had Claude describe the novel "De knyttede næver" from 1911 in Norwegian orthography ca. 1911, as it's a novel I've read, and it does a good job.
What it lacks is an understanding of Norwegian literature, culture and history. It had to look up "De knyttede næver", which was one of the best-selling Norwegian novels around the time it was published before I'd get anything out of it (ChatGPT does better; in thinking mode in particular it gives a detailed summary).
While not exactly well known today, the author was a prominent newspaper journalist for decades, and the novel series is well enough known that e.g. there's a Norwegian singer that took his stage name after the protagonist, and it was covered in Norwegian papers and books for decades (partly because of controversy over the authors political views and how they coloured his novels), so it does feel like a reasonable test that reveals a quite significant knowledge gap.
I do agree with you that it'd be better if the data set from the national library was made more accessible, though it seems a major addition here is that they have a deal to train on copyrighted data locked away in their archives that they have limitations on the use of.
But even just making the out of copyright data in their collections would be a great start.
You'd think so. It seems like there are a lot of odd gaps like that.
I also have a favourite English language PhD thesis I ask every new model about that they still struggle to find even though there's a Wikipedia article about it that links a blog post I wrote about it.
Anyone who thinks they've exhausted even publicly crawlable resources should ask them about some obscure stuff.
you might be surprised if you take this approach.. give key words and phrases in small amounts, each sentence of a prompt building on a previous sentence. Take a an example that is not very hard, like Lewis Carrol Alice in Wonderland original text. Although a quick question might get things sort of wrong, or miss details, if you guide the LLM to a certain part of the story, then a certain set of characters in that part of the story, then a certain statement or dramatic moment with those characters in that part of the story, you might get very specific detail that is close to line-by-line accurate. On the other hand, if you ask a quick, ordinary question about the same part of the story without supplying context and character names, you get something equally vague. YMMV
For the PhD thesis in question, I've actually tested a lot of requests about different parts of it, and both Claude and ChatGPT still draws a total blank if you don't let them do searches.
Why should they share all this data with the greedy american corporations that are stealing everyones data for their own profit? Much better to keep the legal agreement with the national institutions and possibly develop something actual useful to their own country.
You are contradicting yourself. If you're hoarding the data for yourself you're not going to develop something useful. Sharing the data means that it will be integrated into the big LLMs, which will be useful "for their own country".
> Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket) discussed the project at Huawei’s ID Forum 2026 in Paris, saying that no commercial LLM provider was developing a local (Norwegian) language LLM. He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.
I am not overly confident that Marius Husnes knows what he’s talking about here.
He’s right though, although it’s not entirely about the training corpus. It’s about the tokenizer that tokenizes substrings more efficiently based on a necessary bias towards a target language. English oriented LLMs are more powerful for English than other languages because the token space is more parsimonious in English language. Try any online Anthropic tokenizer that calls their api with common English words (typically one or fewer tokens) and Norwegian words - you’ll often see 2-4 tokens instead sometimes more. Some languages like Thai are at a huge disadvantage. Likewise often the corpus selection also is heavily skewed towards the target language simply because more energy is applied to sourcing written works in that language. There will also be semantic biases in the vector space due to cross influence between semantically similar embeddings between languages that create a different than cultural baseline. Finally fine tuning greatly impacts cultural expression in the LLM. None of these are trivial effects.
There are a lot of efforts to create LLMs for dying languages and others that use cross cultural models to boost, but if your language is well literate, there’s a good reason to build a heritage LLM specific to your language and culture. Expecting OpenAI or Anthropic to prioritize your language over their target audience when a tradeoff is to be made is absurd.
Did you even try to verify your claims. I tested it on few translations on wikipedia articles using [1] and it takes 15-20% more tokens for Norwegian.
English performs the best because there is more data in English and high quality sources are either only in English or there is a good translation in English.
Tests I've done with NO and FI texts, for the same number of characters, with the GPT5 tokenizer I get around 2x the tokens than EN. With the older tokenizers it's more like 2x or even 3x.
When I am chatting with ChatGPT - it is fairly obvious that it is American - its native language, its style, its attitude is American - even if we chat in Danish.
Just as we cannot rely on Netflix and HBO to produce Scandinavian TV-shows even though they might do at the moment, we need to make our own stuff in this area too.
And over time, the technology to do this will become cheap and readily available for us to do so.
I chat to it in English instead of my native Spanish not only because of performance, but because I cannot stand the unnatural style it has in Spanish.
> And over time, the technology to do this will become cheap and readily available for us to do so.
But then the English models will be even better and you'll be back to square one. My guess is that things are going to become more and more American. If you assume that "culture" is a resource like "microchips", then from economic point of view it makes sense to have one country specialize in producing it, and the rest just consume. This is why when you turn on the main radio station of a random country, you're so likely to hit American music.
'Only one country should export culture, for economic efficiency' is the kind of take that the Norweigians (and everyone else) would like to protect themselves from.
> then from economic point of view it makes sense to have one country specialize in producing it, and the rest just consume
And, for exactly the same reasons as Europeans need to have sovereign compute to protect against economic imperialism, it is also essential to maintain local culture in order to avoid the great replacement of everything with Americanisms.
Yes, it requires pushing against the economics. But you have to do that if you believe that culture has any value per se at all.
> If you assume that "culture" is a resource like "microchips"
I do not. American culture exports American values, which are not universal. Simplest examples being the attitudes towards violence and nudity, which are very different in Europe, and vary within Europe as well.
Poland have its one LLM called Bielik. It's not only better in preserving Polish sounding wording, it's also better in writing government documents. Why better? They did arena and statistically it's just better.
You're making the mistake of thinking whether he knows what he is talking about matters. He is brewing a potion. It's ingredients are a trendy term, a vaguely spooky threat and a clear, overly simplistic solution that of course he will graciously assume control of, for the good of the motherland.
This potion is potent and you'd think it would stop working from frequent misuse but you'd be wrong!
may not be the most efficient way to go about things, but there remains a seemingly obvious use case for non-latin languages to do things from scratch.
see sarvam.ai and their tokenisation improvements on local languages [1]. not every llm needs to help with coding, nor it needs to already become Babel fish.
language is culture, so i can see the motivation behind their initiative. it must be nice to afford to do this yourself.
>but there remains a seemingly obvious use case for non-latin languages to do things from scratch
>see sarvam.ai and their tokenisation improvements on local languages
You don't need to build from scratch to improve tokenization, though.
Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus).
the improvements for sarvam was with the amount of tokens used to represent words in english vs non-english languages.
the great thing about the current momentum is that someone can test this hypothesis by applying the T-Bank approach to the same set of languages and compare outcomes.
unfortunately not everyone has the same level of respectable compute this easily available. at least those outside of the ZIRP/VC ecosystem of the valley.
This is a massive storage deployment. Given the I/O demands of LLM training, especially for checkpointing, moving to this scale of NVMe flash makes sense compared to traditional disk arrays.
The wording in this article is a bit strange, why the extreme focus on the brand of storage media? Also, the term LLM seems to be used in a very broad way here, are they actually building a language model from scratch, or are they fine-tuning?
>As Husnes put it; Norway is a small country solving a problem every non-English-speaking nation will face: how do you build AI that reflects your language, your culture and your history? AI needs custodians, not just builders.
I'm afraid the answer is, mostly you don't.
Such a thing requires strong political will that, at least in my environment, seems basically impossible to align.
The costs are prohibitive, but beyond that, the type of person who cares about local representation like that is either completely fine with letting foreign companies implement it (after all, you can use ChatGPT in Basque if you want to) or is against the idea of AI altogether.
I guess it's subject to debate whether the cost indeed is prohibitive in the case of Norway. They are a small but extremely wealthy country - after all, they currently hold the equivalent of 1,5% of all the listed companies globally through the investments of their sovereign wealth fund.
I'm sure if Norway approached the American labs with goal of making a curated datasets for training, they would absolutely get in the training door, and those models would likely run circles around anything that could be domestically done.
That being said though, I can feel you cringing through the screen.
>That being said though, I can feel you cringing through the screen.
Then I failed to express myself in writing. I'm definitely a fan of this kind of initiative and am not happy with the type of viability I think they have.
I might very well be projecting a whole lot of local dynamics of national identity, politics and culture though.
This can’t be right. 2 PB of flash is like $200k. It’s within reach of many individuals. Then again I guess you don’t need that much storage so maybe it is.
Boy pricing is pretty nuts these days. I have half a petabyte in Seagate enterprise drives myself and I didn’t pay anything close to that to acquire it. Such a pity about the flash storage. 2 years ago we built 200 TiB or something of flash using Samsung PM1633 or something and it was a fraction of the cost per gigabyte that $1m would imply.
What's special about it is not the flash but training an LLM based on the content, much of which is still in copyright and which the library has restrictions on how they are allowed to use (irrespective of the legal position of training on it) and which required an agreement with the copyright holders.
As a Norwegian I have never needed a Norwegian language model. Doing most things in Norwegian puts you at a disadvantage internationally anyways. Maybe this has value in schools, but wouldn't it just give kids more trust in relying on LLM's? My friends who work in education report that group work has become insufferable because many do not think critically and ask LLM to verify everything. I really don't see a benefit, but maybe they will find one - that is what research is for.
I am reminded that we recently concluded our experiment of forcing things to be digital on school was considered a flop. These things have a cost if we are wrong.
> He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.
I don’t know this is true. But whatever sounds true enough and gets funding seems to be what flies these days.
Can confirm that. Norway may have a small population, but if you live there you'll think it's truly the center of the world (aside from the US. Norwegians love America)
What is called culture here will increasingly be propaganda. It reminds me of people cheering twitter as a replacement of RSS or using facebook to communicate with your customers rather than email. You won't know which will be the winning company, don't know who might control it in the future and we cant predict what it will cost. It doesn't take much to be very annoying.
As a Norwegian this sounds like a mistake. Who will use this LLM? Where? For what? The underlying data could be made more easily searchable and digestible for agents in general if the goal is better knowledge of Norwegian culture.
That said, they are quite limited in what they are allowed to share of in-copyright works, and nb.no is a fantastic resource as it is (though you'll need a Norwegian IP address for too much of it - it's one of th main reasons I maintain a VPN) - if they are allowed to make it accessible there, it'd be great.
But they also have vast amounts of out-of-copyright data that I hope they'd make more easily accessible...
Why would the gap grow? There is no more training data to acquire, frontier model are training on the entire internet. Everything from now on is just fine-tuning.
Your statement assumes training data is the only thing that matters for the big players, while not considering it limiting for the small Norwegian model. That’s a fallacy.
Exactly, if there's one thing transformers are good at it's translation. One I've found particularly nice: any question ChatGPT can answer in English it can answer in French. I'm assuming Norwegian too. So there's no point.
The point is that norway willl have its own LLM. And will not have dependencies to another state or private company. The goal is not to be the best model. But to have a model that include more Norwegian data then other LLM and that it's not screwed against other sources.
But what does that give you? If the model is far less capable? What will it do for you with that Norwegian data, that a better model could not do with better search or context?
A model has many dimensions. You can't have them on one scale from good to bad. The model will most likely be poor at coding. But will give better answers about Norwegian cultur. I assume the tone of voice will be (by default) much closer to how Norwegians talk and write then what we current see from model from the US. They seem to be a bit to much.. Norwegian people are a bit more down to earth
Yes transformers are great at translation as that is their purpose.
LLMs are not great at preserving cultural uniqueness and diversity. Take how “delve” has reentered the lexicon because the human assessors for pre training dialect of English uses “delve” a lot.
There is a lot of benefits to training specifically for a unique culture with unique norms to preserve the culture as we increasingly rely on LLMs.
Both Claude and ChatGPT can translate into minor dialects of Norwegian they will have seen very few works in because very few printed works exist in them.
E.g. I've tested both my local spoken dialect, which is rarely written, and a sociolect used by a 1970's Maoist group consiting of a few hundred people, where most of the printed material consists of novels from a couple of ex-members that became authors.
In the latter case, it claimed to not know, but was able to get a good match from just a description.
I also just had it ape Norwegian orthography from the 1910's by having it look up the rules and translate a text it had first translated from English to modern Norwegian, and it did just fine.
They will have seem some work in these dialects, but mostly it transfer really well to know related languages (English, Dutch, German, Swedish, Danish, roughly form a continuum from least in common to most in common with modern Norwegian; they all share vocabulary and significant parts of grammar with Norwegian), and then a relatively limited exposure to Norwegian itself is sufficient to do fairly well.
They're also really good at "style transfer" of text in the form of tweaking orthography, word order, and minor grammar changes from descriptions and examples.
(incidentally, the latter is one way of getting an LLM to sound a lot less like an LLM)
This is all true, but I assumed the original posters were talking about cultural knowledge, not linguistic correspondences.
To do translation well you still need cultural knowledge. (E.g. the particular modes of specific kinds of legalese, or slang and the nuances of social class, etc)
I think it's not that this knowledge isn't present in the model somewhere, but probably more that it gets killed by instruction tuning for US corporate values.
That's about 350MB per capita. Humans can produce 2-6kb per hour. That's 13 years of non-stop typing. Wonder where it all comes from. I guess it's websites that aren't compressed / extracted.
It's a legal deposit library, same as e.g. Library of Congress. Which means almost every published book, magazine, and newspaper and many other works published in Norway, as well as large collections of Norwegian works published abroad (such as thousands of Norwegian-language newspapers published by the Norwegian immigrant communities in the US) for many decades and a large proportion of the same from the last 200+ years are stored there.
They do also crawl websites (or at least did) in the .no tld.
5x 400gbit running to a 2U box whoa, the PCI lanes must have heat shielding.
More seriously there is a sensibility limit on extreme density where it's not needed. The idea that you're just going to magically get 2 TBit/s out of those ports seems unlikely even with tweaked software, and you're stuck with a power and comms hotspot that's liable to dictate the remainder of your network design.
At max utilisation that 2U would take 12 hours to drain, and only 12 hours assuming peak and likely unachievable throughput and the box otherwise being completely out of service. Not a great start
if you read the article 2pb is available as flash storage in the data pipeline, used to dedupe, clean, normalize, etc, for training from 60pb of raw data.
whenever Huawei want to buy billions of dollars worth of US licenses and stuff, they stop being a "national security threat" for a while because reasons
Even entire governments are captured by a mild LLM psychosis. Which is sad in the case of Norway. I lived in Norway for two years and always found their government to be highly rational, this is not a rational use of public funds (but I suppose they have plenty of capital).
Western society is completely captured by this form of psychosis and its going to bite us in the a* very soon.
I firmly believe all the Boomer leaders throughout the world are being sold a bag of lies by technocrats that "AI", specifically LLMs, are going to cure disease and death and therefor they are willing to handover all control to the technocrats. Fckin croakers at it again.
I think it is highly rational. You see it from the wrong point of view. It seems to be less a short utilitarian project or economic endeavour, but a cultural one. Think about it more of in terms of applied humanities. Which languages go extinct, which cultures disappear and are superseded by a monocultural globalist hegemony.
Exactly. Nasjonalbiblioteket (National Library of Norway) has centuries of written material (Bokmål, Nynorsk and some Sami) and decades of audio and video material featuring varied dialects from all over the country. I believe training models that encompass this information can help in preserving both our language, history, and culture for future generations that increasingly turn to AI to get their information.
reply