Blogs
AI Runtime Stories (#5) National LLMs Are Not Built to Win Benchmarks

in culpa qui officia deserunt mollit anim id est laborumLorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
When Indonesia launched Sahabat-AI in late 2024, the press release led with benchmark scores: how it ranked against GPT-4 on multilingual reasoning and which position it placed on SEA-HELM (Southeast Asian evaluation framework developed by Stanford). Every national AI launch follows this template. Malaysia’s ILMU published comparative scores across Bahasa and English comprehension tasks. India’s Sarvam cited MMLU performance. The UAE’s Falcon from the Technology Innovation Institute led with parameter counts alongside frontier model comparisons. The framing treats national AI as a performance race.
The strongest critique is that most of these models are not foundational AI models at all. Sahabat runs on a Gemma 2 base; SEA-LION from AI Singapore is built on Qwen; Sarvam’s first release was a Mistral fine-tune. The criticism is partially accurate, and it misses the point. Fine-tuning on open-source foundation models is how foundational model investments start, not finish.
Historically, national rail networks ran on British locomotives. National telecoms ran on foreign switching equipment. The question was never whether the underlying technology was domestic; it was whether the country controlled its own network. Fine-tuning on open-source foundation models is how you start building that control. Training from ground-zero (e.g., Falcon, ILMU, Sarvam's 105-billion-parameter second model, India's Krutrim) follows when the political, commercial and technical conditions align.
The telecoms analogy holds more precisely than it usually gets credit for. When Indonesia built PT Telkom in 1965, or when Malaysia nationalised its telephone network, no one asked whether the state operator could beat AT&T on call quality. The governance question was whether critical communication infrastructure could operate without foreign-controlled dependencies.
National AI builds on the same logic. Language and cultural preservation is the most compelling case: for Bahasa Indonesia, Malay, Tamil, and Thai, government documents, legal texts, and local idioms require accuracy that models trained on Western corpora do not reliably deliver. Global geopolitics adds structural force as the United Nations AI declaration of February 2026 carries over a hundred signatories committed to AI sovereignty, and ASEAN governments alongside China’s MIIT are treating compute as controlled infrastructure (similar to electricity grids). On top of that, Thailand’s ThaiSC, Malaysia’s YTL AI Supercomputer, and India’s Yotta Shakti cluster were not just built to run someone else’s model.
There is, however, a technical floor that makes national models structurally defensible. Frontier models are English-biased at the architecture level, not just in training data. Tokenisers trained on Western corpora use four to eight tokens to represent a single Indic word; while English averages at 1.4. That gap means Indian-language inference on GPT runs three to five times more expensive and delivers three to five times less precision per token for the same native-language task. Sarvam addressed this: their custom tokeniser brings Indic token counts down to between 1.4 and 2.1 per word, matching English-level efficiency. At a billion inference calls in Hindi, that cost difference is a competitive position that fine-tuning a foreign model is not ready to meet.
That said, application distribution is a more important domain to watch. Sahabat-AI’s integration into the GoTo ecosystem (parent of GoPay and Gojek) embeds it in the financial and logistics infrastructure that tens of millions of Indonesians use daily. Malaysia’s ILMU sits inside Ryt Bank (now with 1.2 million users) and Astro, the country’s leading digital bank and broadcast network. South Korea’s HyperCLOVA X runs inside Naver, the dominant search and content platform for 50 million users. None of these deployments needed a model that beat the latest frontier model. They just required generative capabilities above an acceptable threshold with built-in language coverage, and data residency arrangements that a foreign provider simply cannot offer.
When Sarvam released its Mistral-based model in 2023, critics noted it was not truly Indian AI. Within six months, the Indian government expressed a preference for models trained from scratch on Indian compute. Sarvam released a 105-billion-parameter model, trained from scratch, eighteen months later. Krutrim, built from scratch by Ola founder Bhavish Aggarwal, emerged through the same political environment. The critique accelerated the build. And others will soon do the same.
The commercial dimension of that dependency arrived in concrete form on April 4, 2026, when Anthropic repriced its API terms for third-party agent harnesses. Users who had built workflows on a $200 monthly subscription moved to pay-as-you-go that could cost between $1,000 and $10,000 per month. Boris Cherny, head of Claude Code, gave an explanation: subscriptions were not built for the usage patterns of these third-party tools. For national AI programmes, that change illustrated the potential downstream cost of infrastructure that runs on another party’s pricing model. National LLMs carry their own sustainability questions; government-funded compute is not free. But the political and commercial implications change when the model operator is a domestic institution rather than a lab in California.
The competition question itself is being reframed. Government services in Malaysia, Indonesia, and India face sovereignty requirements to run on domestic AI infrastructure; the procurement criterion is data residency and governance control, which frontier labs cannot offer regardless of their benchmark rankings. Benedict Evans observed in February 2026 that there is no known mechanism for any single AI company to achieve a lead others could never match. OpenAI’s share of AI inference traffic fell from 55% to 40% in twelve months. The gap national labs need to close is narrowing every quarter with every new model release (though much is to be expected from Mythos).
The gaps to fill are consistent across almost every national LLM. Reasoning remains the most significant as most are instruction and chat models; Sarvam has begun through reinforcement learning from process outcomes, but no national lab has a dedicated reasoning model at production quality. Agentic infrastructure is now part of the SEA-LION and Sarvam frameworks (with Arya) but absent from most others; tool calling, memory management, and multi-step orchestration are mostly not production-ready in any national stack.
Developer ecosystems exist but are thin; SDK depth, documentation quality, and playground tooling trail OpenAI and Anthropic equivalents by a measurable margin. The hardest gap is also sustainability: most national LLMs run on government grants or single-company funding with no commercial model that self-funds ongoing pre-training cycles. These are significant gaps in the build out of infrastructure that has the potential to transform national economies and global landscapes.
The models that close these gaps while their distribution advantages compound have a clearer path than the benchmark framing suggests. They need to be the default intelligence layer where their governments operate, their citizens transact, and their developers build. The field they are competing in is narrowing from the top, widening from the bottom, and anchored to regulatory infrastructure that frontier labs cannot replicate. On that measure, the race has barely started.
Connect to unlock exclusive insights, smart AI tools, and real connections that spark action.
Schedule a chat to unlock the full experience
Join 6000+ industry executives who trust us.