Google has really emptied its coffers this time! Gemma 4 has made a splash in the open - source comm

Issuing time:2026-04-03 16:23


In the early hours of the morning, Google DeepMind dropped the first bombshell in the open-source world in 2026: the official release of Gemma 4.


They released four full-size models at once, from the 2B which can fit into a phone to the 31B which can run at full speed with a single SIM card, all of which are based on the same source material as the closed-source flagship Gemini 3.



A year later, Gemma not only completed an epic leap, but also directly rewrote the rules of the game for the entire open-source big model.


The most explosive number:

31B Dense ranked third among open-source projects on the Arena AI text benchmark, with an Elo score of 1452.


Its two competitors, one with over 60 billion parameters and the other with over 100 billion, have allowed Gemma 4, with its mere 31 billion, to squeeze into the multi-billion dollar poker table.


Even more outrageous is 26B MoE, with a total of 25.2 billion parameters, but only 3.8 billion are activated during inference, while Elo directly hits 1441, ranking sixth among open source algorithms.



A glance at the report card reveals that this is not an iteration, but a suppression of the bloodline of the previous generation.

  • Mathematical Reasoning (AIME 2026): 89.2% vs 21.2%, a surge of 68 percentage points.

  • Programming skills (LiveCodeBench)80% vs 29.1%, a generational gap

  • Agent capabilities (t2-bench)86.4% vs 6.6%, the difference is ridiculously large.

In addition, Gemma 4 achieved a 40% performance boost in benchmark tests for multilingual reasoning and knowledge-based question answering.



What's chilling is that this small 31B model actually outperformed a closed-source model that was 20 times its size.


Now, a Mac Mini can run Gemma 4, and some people have even successfully run it offline on their phones.



Hugging Face CEO Clément Delangue summed it up in one sentence:This is a huge milestone.





01

Four models enable seamless integration between edge, cloud, and mobile devices.



The Gemma 4 suite offers a basic version and a command-tuned version in each size, precisely covering all use cases:


  • E2B/E4B (mainstay on the end side)Optimized in collaboration with Google Pixel, Qualcomm, and MediaTek, the E4B can run completely offline on phones, Raspberry Pi, and Jetson Orin Nano with near-zero latency. Its performance even surpasses that of its predecessor, the Gemma 3 27B.


  • 26B MoE (King of Speed)Inference only activates 3.8 billion parameters, and the token generation speed is extremely fast, making it the first choice for low-latency agent scenarios. After quantization, it can run on a single 24GB graphics card.



  • 31B Dense (Performance Ceiling):The open-source model ranks third in overall performance. Its bfloat16 weights can fit into an 80GB H100 card, and after 4-bit quantization, it can run smoothly on consumer-grade graphics cards.


It's worth mentioning that the entire series supports Google's latest TurboQuant compression algorithm, which further reduces memory usage with almost no loss of quality.




02

Small model performs like a large model



Gemma 4 has no obvious weaknesses and crushes its predecessor in almost all benchmark tests:


  • Mathematics and Science:31B AIME 2026 score 89.2% (previous generation 20.8%), GPQA Diamond scientific knowledge 84.3%, close to the level of a human PhD.


  • Programming skills31B LiveCodeBench v6 achieved 80%, Codeforces Elo 2150, equivalent to a professional purple-level player; 26B MoE also achieved 77.1%, outperforming the vast majority of models of the same level.


  • MultimodalThe 31B MMMU Pro achieved 76.9% multimodal inference, while the 26B achieved 73.8%, far exceeding the previous generation's 49.7%.


  • Long contextThe 31B supports 256K context and achieves 66.4% accuracy in MRCR v2 128K pin tests, which is 5 times that of its predecessor.


Even with the smallest E4B, it can achieve 42.5% in AIME and 52% in LiveCodeBench—achievements that were only found in flagship-level large models a year ago.



03

Make the most of every parameter



Gemma 4 didn't pile on fancy new concepts; instead, it combined proven technologies to their fullest potential. Google even proactively cut components like Altup, which had "uncertain effects."


  • Layer-by-layer embedding (PLE)

    Traditional Transformers are like stuffing everything you need for the day into a backpack before leaving home, placing a heavy burden on the embedding layers. PLE, on the other hand, equips each layer with a dedicated low-dimensional signal channel, so that wherever you go, someone hands you the tools you need most at that moment. The additional overhead is minimal, but each layer gains its own dedicated adjustment capabilities, which is the core secret to making a small model perform like a large model.


  • Shared KV cache

    Finally, the Nth layer no longer calculates the Key and Value itself, but directly reuses the KV tensors from the previous layers. This significantly reduces memory usage and computation during inference, making it particularly friendly to long context generation and edge deployment. Google claims that the impact on quality is "negligible."


  • Alternating attention mechanism

    The model alternates between local sliding window attention and global full-context attention, using a 512-token window for small models and a 1024-token window for large models. This ensures efficiency in local modeling while extending the context coverage through the global layer.





04

One model handles image viewing, sound listening, and video reading.



The entire Gemma 4 series supports image and video input, and E2B and E4B are also compatible with audio, truly achieving full modal unification.


  • Visual understandingSupports variable aspect ratios (no more forced cropping), five adjustable image token budgets, and allows free switching between fast classification and high-precision OCR. Given a webpage screenshot, it can directly return the precise coordinates of a button in JSON format.


  • Video UnderstandingIt can accurately describe video content and recognize subtitles and brand logos; E4B can also extract audio track information and understand lyrics and dialogue.


  • Audio transcriptionThe English transcription of E4B is almost perfect, with natural punctuation and sentence breaks.


  • Native function callsIt has built-in tool invocation capabilities from the training phase, and can automatically handle multi-round, multi-tool agent workflows without any complex prompting engineering.





05

Apache 2.0



The biggest non-technical news from this release is that Gemma 4 is the first to adopt the Apache 2.0 open-source license.


The previous Gemma series used a Google custom license, which had various restrictions and attribution requirements, and corporate legal departments had to review each clause before it could be used commercially.


Apache 2.0, on the other hand, does it all in one step:

✅ No custom terms

✅ No commercial restrictions

✅ Can be freely modified, distributed, and packaged into the product.

✅ No gray area


Since its initial release, Gemma has been downloaded over 400 million times, with over 100,000 community-derived versions. With the support of Apache 2.0, this number is bound to experience explosive growth.


The release of Gemma 4 has fully solidified Google's two-pronged strategy. The top layer consists of the Gemini series of closed-source models, which occupy the performance ceiling and monetize through APIs; the bottom layer consists of the Gemma series of open-source models, which feed the developer ecosystem with the same technologies and seize the entry point for local deployment, edge inference, and agent development.


One focuses on generating revenue, while the other focuses on building an ecosystem. They don't conflict with each other; on the contrary, they amplify each other's impact.


For developers, the choice is now crystal clear:

  • With a size of 31B, it can achieve results with nearly 100 billion parameters;

  • Apache 2.0 can be used freely without any legal risks;

  • It covers everything from mobile phones to servers, and has a complete toolchain for fine-tuning.



Google proved with Gemma 4 that parameter efficiency is the future of open-source models, with 31 bytes outperforming competitors 20 times larger, and 2 bytes fitting into a phone pocket.


The competition for open-source large models has entered a new era starting today.



Nebula Data, headquartered in Singapore, has branches in Jakarta, Guangzhou, Shanghai, and Hong Kong. The company independently developed Nebula Lab, a one-stop AI content generation and model aggregation platform, equipped with an enterprise-grade AI Agent, aggregating globally applicable large-scale models and industry-specific vertical models. Simultaneously, it launched the Nebula AIoT hardware ecosystem (including smart interactive terminals, IoT gateways, and other products), forming a full-link intelligent solution from cloud to edge to device. This provides integrated services to customers in e-commerce, manufacturing, retail, and other fields, from cloud computing power support and AI intelligent decision-making to terminal scenario implementation. Furthermore, it offers global AIDC (AI Intelligent Computing Center) + low-latency network services, empowering enterprises to embrace AI, connect to the physical world, and expand their global business through its technological foundation.