A More Visual Guide to Gemma 4

Logo A More Visual Guide to Gemma 4

Slideshow

Edit

Download

1. A lunch-and-learn begins

2. What Gemma 4 is

3. Why a model family matters

4. Four variants, one progression

5. Dense models as the baseline

6. MoE as sparse capacity

7. The attention cost problem

8. Sliding-window attention in plain terms

9. What gets lost with local-only views

10. Why global layers are inserted

11. Interleaving as an efficiency rhythm

12. The final layer should see everything

13. Window sizes as a pragmatic knob

14. Global attention becomes the bottleneck

15. GQA as shared keys and values

16. More aggressive grouping for global layers

17. Doubling key dimensions to recover bandwidth

18. K equals V to cut cache size

19. RoPE as rotating vector pairs

20. Frequency tradeoffs over long contexts

21. p-RoPE preserves semantic dimensions

22. Why p-RoPE targets global attention

23. The global-attention recipe in one breath

24. GemmaVis turns images into tokens

25. Aspect ratios without distortion

26. 2D RoPE for spatial meaning

27. Pooling into a soft token budget

28. Choosing budgets as a speed-detail tradeoff

29. Ending with the deployment choice

Story Setup

1. A lunch-and-learn begins

Sarah plans a lunch-and-learn and decides the best way to teach is to mirror how she learned: a calm arc that starts broad, tightens into mechanisms, and ends with tradeoffs. She wants each slide to feel like a stepping stone, so her teammates can follow the same path from curiosity to confident decisions.

Add Comment

Overview

2. What Gemma 4 is

She opens by framing Gemma 4 as a family of multimodal Transformer models built to work with text and images, and in some smaller options, audio. The point, she explains, is not one monolith but a shared set of design ideas so teams can choose a size that fits their latency and memory limits without changing paradigms.

Add Comment

3. Why a model family matters

Sarah emphasizes that a consistent family is operationally valuable: prompts, evaluation habits, and deployment tooling can stay similar even as the underlying size changes. She wants her team to see scaling as a dial, not a redesign, which makes it easier to prototype on smaller hardware and then graduate to larger servers when needed.

Add Comment

4. Four variants, one progression

She introduces the lineup in one sweep: compact E2B and E4B options, a large 31B dense model, and a 26B A4B Mixture-of-Experts option that activates a smaller slice of compute at inference. Sarah’s narrative sets up the theme that “size” can mean parameters, active compute, or both, depending on design.

1 Like

Add Comment

Dense vs MoE

5. Dense models as the baseline

To ground everyone, Sarah explains dense Transformers as the steady baseline: every token goes through the same feedforward capacity in every layer, so compute per token is predictable. That simplicity can help with debugging and performance consistency, which is why dense models often become the reference point for comparisons.

Add Comment

6. MoE as sparse capacity

She contrasts that with Mixture-of-Experts: instead of one big feedforward block, there are many smaller expert blocks, and tokens are routed to only a subset. The team grasps her core message: MoE can store lots of knowledge in total parameters, while keeping per-token compute closer to a smaller active footprint.

1 Like

Add Comment

Attention Interleaving

7. The attention cost problem

Sarah transitions to attention by highlighting the fundamental cost issue: letting each token attend to all prior tokens captures long-range structure, but it can be expensive in both compute and memory. She frames the rest of the middle section as a set of engineering choices to keep context handling strong without blowing budgets.

Add Comment

8. Sliding-window attention in plain terms

Loading equations

Add Comment

9. What gets lost with local-only views

Sarah describes the intuitive failure mode: when information must hop across many local layers and steps, details can blur. She calls it a “telephone game” effect, where the model may preserve the gist but distort specifics, making long-range dependencies harder to recover when the earliest evidence is far behind the window.

Add Comment

10. Why global layers are inserted

To fix the dilution, she introduces periodic global attention layers that reconnect distant tokens directly. These layers act like full-context refresh points: they can re-anchor the representation to the entire history, restoring coherence in tasks that require earlier constraints, long documents, or reasoning that depends on far-apart facts.

Add Comment

11. Interleaving as an efficiency rhythm

Sarah presents interleaving as a rhythm rather than a hack: most layers stay local for speed, and occasional global layers pay the full price to maintain long-range structure. She notes that different model sizes can use different local-to-global patterns, but the intent remains consistent—efficient most of the time, complete when needed.

Add Comment

12. The final layer should see everything

She highlights one design principle her team finds memorable: the last Transformer layer should be global attention. In her words, it avoids ending the network on a narrow local snapshot; the final representation integrates the full context before producing outputs, which helps prevent last-moment myopia in summarization or instruction following.

Add Comment

13. Window sizes as a pragmatic knob

Loading equations

Add Comment

Global Attention Efficiency

14. Global attention becomes the bottleneck

She warns that interleaving doesn’t eliminate the hard part: global layers still need to attend over the entire context and cache long histories, which drives latency and memory. So the real game is making global attention cheaper without breaking its job, and she cues the next section as a toolbox of targeted optimizations.

Add Comment

15. GQA as shared keys and values

Sarah introduces Grouped Query Attention by describing the cache pain point: storing separate keys and values per head can be heavy. With GQA, multiple query heads share fewer key/value heads, shrinking the KV cache. She stresses that it’s a controlled sacrifice—some head diversity is traded for a large memory win.

Add Comment

16. More aggressive grouping for global layers

She explains why grouping becomes more aggressive in global layers: those layers are the expensive ones, so they get the strongest cost-cutting. By letting many query heads share a single KV stream, the model reduces cache growth where it hurts most. Sarah frames the tradeoff as acceptable if other design choices restore capacity.

Add Comment

17. Doubling key dimensions to recover bandwidth

Because heavier grouping reduces distinct KV channels, Sarah describes a compensating move: increase the dimensionality of keys in global attention so each key can carry more information. The team understands the balancing act—some memory creeps back in, but representational bandwidth improves, keeping global layers useful rather than merely cheap.

Add Comment

18. K equals V to cut cache size

She then shares a simpler cache-saving trick: set keys equal to values in global attention so only one tensor needs to be stored for past states. Sarah presents it as an engineering compromise that trims memory pressure right where contexts are longest, helping global layers stay deployable under realistic serving constraints.

Add Comment

Positional Encoding

19. RoPE as rotating vector pairs

To explain positional encoding, Sarah uses a geometric story: Rotary Positional Encoding splits query and key vectors into 2D pairs and rotates each pair by a position-dependent angle. Different pairs rotate at different rates, letting attention infer relative distance through geometry instead of adding a separate position embedding to every token.

Add Comment

20. Frequency tradeoffs over long contexts

She notes that high-frequency pairs change rapidly with position and help with precise ordering, while low-frequency pairs rotate slowly and better preserve semantics. Over very long contexts, accumulated rotation can misalign far-apart tokens, so she frames long-context stability as a battle against positional noise leaking into meaning-bearing dimensions.

Add Comment

21. p-RoPE preserves semantic dimensions

Loading equations

Add Comment

22. Why p-RoPE targets global attention

She connects the dots: global attention spans the widest range of relative distances, so it suffers most from accumulated rotation effects. Applying p-RoPE specifically to global layers keeps long-range comparisons steadier by reserving more stable subspace for meaning. Sarah frames it as “save precision where distance is enormous.”

Add Comment

Global Attention Efficiency

23. The global-attention recipe in one breath

Loading equations

Add Comment

Vision (GemmaVis)

24. GemmaVis turns images into tokens

She shifts to multimodality by introducing GemmaVis: a vision encoder that splits an image into fixed-size patches and maps each patch to an embedding, like turning pixels into a token sequence. Sarah stresses the conceptual bridge—once the image is tokenized, the language model can condition on it using familiar sequence machinery.

Add Comment

25. Aspect ratios without distortion

Sarah explains how variable aspect ratios are supported: resize the image while preserving its original proportions, then pad the remaining space so the patch grid stays consistent. This avoids the subtle failures caused by forcing everything into a square, which can stretch text in screenshots or skew shapes that matter in diagrams.

Add Comment

26. 2D RoPE for spatial meaning

Loading equations

Add Comment

27. Pooling into a soft token budget

Loading equations

Add Comment

28. Choosing budgets as a speed-detail tradeoff

She presents the visual-token budget as a user-facing dial: higher budgets retain fine detail for small text, charts, or dense UI screenshots, while lower budgets speed up inference and reduce context load. Sarah encourages the team to treat this like selecting image resolution for a task—don’t overspend tokens when the question is simple.

Add Comment

Wrap-up

29. Ending with the deployment choice

Sarah closes by comparing the 31B dense option to the 26B A4B MoE: dense offers steadier, simpler behavior with consistent compute, while MoE can deliver efficiency by activating a smaller expert subset per token. She summarizes the decision rule she wants remembered—choose dense for simplicity and predictability, MoE for scalable efficiency.

Add Comment

Create a new book

Your presentation is just minutes away

New Book