The Geometry of Compression: Why Angles Beats Naive Storage

05/05/2026

•

9 min read

•

Bartosz Lenart

AI compression transformers TurboQuant KV cache geometry zellige machine learning

Get Instant Insight

The best compression does not remember every piece. It remembers how the pieces still fit.

The Core Idea

The wall. A zellige maker in Fes builds a vast mosaic from a small set of shapes, angles, and edges. Not memorized. Generated.
The cache. A long-context transformer stores key and value vectors for every past token. Naive storage collapses under its own weight.
The fix. TurboQuant rotates those vectors into a better coordinate frame, drops precision to about 3.5 bits per channel, and preserves the alignments attention actually uses.

Why It Matters

Bad compression saves the wrong things.
Good compression saves what lets the system work.
Great compression saves the rule.

The practical question is not "how much can I remove?"

It is: what must still fit after the representation gets smaller?

[ NEURAL COMPRESSION COMPLETE ]
80% signal retained.
Full depth below.

The geometry of compression: broken zellige rosette held together by red lines of force

The best compression is geometry, not picture.

A zellige wall is not stored in anyone's head tile by tile.

Zellige (الزليج) is traditional Moroccan mosaic work: small ceramic tiles, cut by hand, fitted into stars, crosses, knots, and repeating fields. From a distance it looks like a single ornate image. Up close, it is a puzzle of edges and angles.

The maker does not need to remember every tile as a separate fact. The wall comes from a smaller set of instructions: draw this grid, cut this family of shapes, repeat this angle, let this edge meet that one.

The wall is the output. The method is the compression.

The same idea now sits inside one of the expensive bottlenecks in artificial intelligence: the key-value cache of a long-context transformer. A transformer is the type of neural network behind most modern language models. Its key-value cache is the memory shelf where the model keeps traces of previous words. As the conversation grows, the shelf fills. Reading from it becomes slower and more expensive.

In 2025, Google's TurboQuant paper showed a way to shrink that shelf without breaking attention.¹ It changes the coordinate frame of the vectors, then stores them with fewer bits. The point is not to keep every number perfect. The point is to keep the comparisons that tell the model what to look at.

Ceramic and silicon are very different materials. The principle is the same:

The best compression does not remember every piece. It remembers how the pieces still fit.

The Wall Is Geometry, Not Picture

A geometric construction seed blooms into a dark zellige wall like a living algorithm

The wall is not copied. It is made from rules: each tile fits because its edge, angle, and orientation answer the tile beside it.

Look at a traditional Moroccan zellige mosaic and the eye gets lost. Star opens inside star. Edges interlock. Symmetries repeat, but the surface does not feel mechanical. The first instinct is to treat the wall as an image: a complicated arrangement of coloured pieces.

But that is the wrong level of description.

Zellige is not made by copying an image into ceramic. It is made by following a method.

The maker begins with a geometric guide, often based on star polygons and repeated grids, then cuts a limited alphabet of tile shapes.² A piece fits because its edge, angle, colour, and orientation answer the pieces around it.

The toolkit is simple, but not vague.

In a Fes workshop, the pattern starts as geometry: a circle stepped out with a compass, lines drawn across it, stars and grids emerging from the intersections.²

Then the drawing becomes a set of working shapes.

A square glazed tile is held in one hand. A sample shape is traced on the back with a sharpened stick or pencil.

The cutter rests the tile on a small iron anvil and strikes along the line with a sharp hammer-chisel, the menqach: a heavy iron tool with a chisel edge at each end, kept very sharp, swung short and close to the body.

Tap, turn, tap, turn, until a star point, triangle, cross, or kite comes free.

The cut is not the end.

Each edge is bevelled so the front faces meet tightly and the backs leave room for binder. Finished pieces, called furmah, are sorted into baskets by colour and shape.

At the assembly bench, the wall is built upside down on a smooth board. Tiles are laid face-down on a paper pattern, hidden colour against the surface.

A frame is set. Plaster and then cement are pressed into the joints. Only after the panel hardens is it lifted and turned.

The maker is not measuring angles. Precision lives in the repeated shape, the trained hand, the edge of the menqach, and the moment the hidden face is revealed.³

The tiler does not need to memorize the whole surface tile by tile. Once the design and repeat are chosen, geometry can extend the pattern as the wall grows.

That is why the metaphor works.

The compression is not in the finished pattern. It is in the way the pattern is made.

A huge surface can grow from a small rulebook because the rulebook keeps the right things in place: points, angles, edges, symmetries, repetitions.

The wall stores a grammar of meeting.

The Recipe Is Smaller Than The Surface

A compact bone-white talisman spills a vast zellige surface into the darkness

A tiny recipe, an immense surface.

If you describe the wall pixel by pixel, the description becomes enormous. If you describe every tile independently, it is still wasteful. But if you describe the rule, the surface becomes cheap.

Use this grid. Cut this family of shapes. Rotate here. Reflect there. Repeat this unit. Match these edges. Keep this angle.

That is not a blurry summary of the wall. It is a recipe for making the wall. Given the rule, the pattern can be extended, repaired, taught, or rebuilt. The description is small because it does not carry every visible detail. It carries the fit that produces the detail.

That gives us the useful distinction:

Bad compression saves decoration.
Good compression saves function.
Great compression saves the rule that can make the thing again.

Islamic geometric ornament makes this visible because the compression is physical. A small rule expands into ceramic.⁴

The Cache Is Also A Wall

A dark cathedral of fading coordinate pillars crossed by bone-white query arcs with blood-red inner-product pulses

Attention does not care about raw coordinates. It cares about which vectors point toward each other.

Now move from the wall to the datacenter.

A long-context transformer has its own version of too many tiles. A token is a small piece of text, often a word fragment. For every previous token, the model stores key and value vectors: long lists of numbers that help decide what earlier text matters now. As the context grows, this cache grows with it. At long lengths, moving that memory becomes one of the main costs of producing an answer.¹

The naive answer is to store each coordinate with fewer bits. Round the numbers. Shrink the cache. Hope the model survives.

But attention does not care about coordinates as coordinates. It cares about how vectors line up. In particular, it cares about inner products, a simple alignment score: how strongly a query vector points in the same direction as the keys already stored. If quantization damages that geometry, the cache may be smaller, but the model has lost the map it needed.

That is like saving the colours of the tiles while ruining the angles that let them fit.

TurboQuant Rotates The Wall

A shattered zellige lattice caught mid-rotation inside a dark circular Art Deco aperture, blood-red arc holding inner-product chords

Rotate first, then cut away: the coordinate frame decides what can disappear.

TurboQuant's move is to change the basis before deleting precision.¹

First, it applies a random orthogonal rotation to the vector. In plain language, it turns the coordinate system without stretching it, like rotating a drawing on transparent paper. The same shape is still there, but its weight is spread more evenly across the axes. After rotation, the normalized squared coordinates follow a known Beta law that depends on dimension rather than on the particular dataset.¹

Then it quantizes against that known distribution. Because the distribution is known, the scalar quantizer can be prepared in advance. No calibration set. No retraining. No special case for every wall.

Finally, for inner products, TurboQuant adds a one-bit residual correction using a Quantized Johnson-Lindenstrauss transform. The name is heavy. The job is simple: keep distances and alignments usable while moving into a smaller form. The correction does not make every coordinate perfect. It keeps the inner-product estimate unbiased, because that is the comparison attention actually uses.⁵

The result is the part that matters: KV caches compress to about 3.5 bits per coordinate with absolute quality neutrality on the reported long-context benchmarks. That is more than a 4x reduction over fp16 storage, and the team reports memory savings of at least 6x on long-context retrieval tasks at lower bit widths.⁶ Separately, the paper proves that TurboQuant's distortion rate stays within a small constant factor of the information-theoretic lower bound.¹

The trick is not that TurboQuant deletes harder. The trick is that it first finds a basis where deletion hurts less.

What Survives Deletion

Fading tiles and coordinate dust dissolve into darkness while bone-white guide lines and blood-red seams remain suspended at the center

Deletion reveals the skeleton: the few lines that were doing the real work all along.

Compression is usually discussed as storage: fewer bits, smaller files, less memory. But the better question is not how much was removed. The better question is what survived.

For the wall, what survives is adjacency, angle, symmetry, and the rule of assembly. The individual tile matters less than the way tiles are allowed to meet.

For TurboQuant, what survives is inner-product geometry. The individual coordinate matters less than how the vectors line up.

In both cases, the representation becomes efficient because it keeps the geometry of use. The wall needs to be built, extended, and visually held together. The transformer needs to retrieve, compare, and attend. Different jobs require different survivors, but the question has the same shape:

What must remain true after the representation gets smaller?

A JPEG keeps what the eye notices. A tile geometry keeps the rule that builds the surface. A vector quantizer keeps a small codebook of typical shapes. TurboQuant's answer is narrower and unusually clean: keep the inner products useful, and the cache can get much smaller.

The Smallest Version That Still Works

A storm of abstract paper shards dissolves into darkness while a compact four-part Art Deco frame emerges, connected by a single red path of action

From the noise, the smallest working shape.

This is not only a lesson for tiles and transformers. It is a useful way to think.

Most people try to handle complexity by carrying more detail: longer notes, longer plans, more screenshots, more saved tabs, more context. That often makes the problem heavier instead of clearer. The better move is to ask what must survive.

Do not try to remember every example. Remember the rule that makes the examples.

Do not write a transcript of a meeting. Write the goal, the constraint, the decision, the open question, and the next action.

Do not keep every possible object "just in case." Keep the things that let the room still work.

Do not explain every thought when communicating. Keep the message in shape: what happened, why it matters, what changes now.

The same rule applies to AI agents. Do not forward the whole transcript when a small packet; goal, constraints, evidence, next action gives the next step what it needs.

In ordinary life, compression is not forgetting. It is keeping enough to reconstruct, decide, or act. The practical question is simple:

What is the smallest version of this that still works?

Compression Is A Basis Problem

Left: a geometric form collapsing in the wrong basis, fragments falling like black glass. Right: the same form finding a clean compressed lattice aligned to bone-white axes, blood-red basis lines passing through both worlds

Wrong basis: deletion is damage. Right basis: deletion is clean work.

The most important word in compression may not be "small." It may be "basis."

In the wrong basis, deletion is damage. Remove coordinates, and meaning collapses. Round numbers, and retrieval breaks. Simplify a pattern, and the wall loses its fit.

In the right basis, deletion becomes clean work. The unnecessary detail separates from what matters. The representation can shrink because the system has found the form where the important parts are easiest to keep.

That is what the maker does with points and angles. That is what TurboQuant does with rotation and quantization.

The wall matters here because it shows the deeper move in physical form: replace exhaustive storage with a rule for how the pieces fit.

Memory holds the tiles as a picture.

Geometry holds the tiles as a pattern.

The geometry of compression is older than the machines now exploiting it. A wall can teach it by hand. A transformer can prove it at scale.

A dark scholarly constellation of bone-white research nodes orbiting a central compression rosette, connected by ash-grey lines and blood-red citation paths

References

Zandieh, A., Daliri, M., Hadian, M., & Mirrokni, V. (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874 [cs.LG]. To appear at ICLR 2026. DOI: 10.48550/arXiv.2504.19874. Source for the random orthogonal rotation, induced Beta law for normalized squared coordinates, data-oblivious scalar quantizer, two-stage MSE + 1-bit QJL inner-product estimator, KV-cache framing, the 3.5-bit absolute-quality-neutrality result, and the near-optimal distortion bound (within a constant factor of about 2.7 of the information-theoretic lower bound). ↩ ↩² ↩³ ↩⁴ ↩⁵
Bonner, J. (2017). Islamic Geometric Patterns: Their Historical Development and Traditional Methods of Construction (with a chapter by Craig Kaplan; foreword by Roger Penrose). Springer New York. ISBN: 978-1-4419-0216-0. DOI: 10.1007/978-1-4419-0217-7. Source for the polygonal technique and the local-rule, edge-and-angle grammar that generates zellige and broader Islamic geometric ornament. ↩ ↩²
DeMerchant, C. Moroccan Tiles Zellige from Fez. christinedemerchant.com/morocco_tile_zellige.html; Behind the Scenes Adventures. How Moroccan Mosaics are Made: The Art of Zellij. btsadventures.com/articles/how-moroccan-mosaics-are-made-the-art-of-zellij. Sources for workshop practice: Fes clay preparation, glazed square blanks, marking from sample shapes or guides, menqach cutting, bevelled edges, face-down assembly on a pattern, cement/plaster backing, flipping, and polishing. ↩
Bonner, J. (2003). Three Traditions of Self-Similarity in Fourteenth and Fifteenth Century Islamic Geometric Ornament. In Meeting Alhambra: ISAMA-BRIDGES 2003 Conference Proceedings, pp. 1–12. archive.bridgesmathart.org/2003/bridges2003-1.html. Source for the claim that Islamic geometric ornament achieves striking visual complexity from a small set of construction rules expanded into a physical surface. ↩
Zandieh, A., Daliri, M., & Han, I. (2025). QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 25805–25813. DOI: 10.1609/aaai.v39i24.34773. arXiv:2406.03482. Source for the 1-bit Quantized Johnson-Lindenstrauss residual estimator that TurboQuant uses for inner products, including its zero-overhead, unbiased inner-product property and the originally reported "more than fivefold" KV-cache reduction at 3 bits without accuracy loss. ↩
Zandieh, A., & Mirrokni, V. (2026, March 24). TurboQuant: Redefining AI efficiency with extreme compression. Google Research Blog. research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression. Source for the long-context benchmark suite (LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, L-Eval), the "at least 6x" KV-memory reduction on long-context retrieval tasks, and the up-to-8x speedup on attention-logit computation at 4 bits versus 32-bit unquantized keys on H100 GPUs. ↩