The Geometry of Compression: Why Angles Beat Naive Storage
Get Instant Insight
The best compression does not remember every piece. It remembers how the pieces still fit.
The Core Idea
- The wall. A zellige maker in Fes builds a vast mosaic from a small set of shapes, angles, and edges, working from rules rather than memory.
- The cache. A long-context transformer stores key and value vectors for every past token. Naive storage collapses under its own weight.
- The fix. TurboQuant rotates those vectors into a better coordinate frame, drops precision to about 3.5 bits per channel, and preserves the alignments attention actually uses.
Why It Matters
The practical question for any system pressed for memory: what must still fit once the representation gets smaller? Good compression keeps the rule that produces the surface, not the surface itself.
[ NEURAL COMPRESSION COMPLETE ]
80% signal retained.
Full depth below.
The best compression is geometry, not picture.
A zellige wall is not stored in anyone's head tile by tile.
Zellige (الزليج) is traditional Moroccan mosaic work: small ceramic tiles, cut by hand, fitted into stars, crosses, knots, and repeating fields. From a distance it looks like a single ornate image. Up close, it is a puzzle of edges and angles.
The wall comes from a few instructions: draw this grid, cut these shapes, repeat this angle, let edges meet.
The wall is the output.
The same idea sits inside one of AI's expensive bottlenecks: the key-value cache of a long-context transformer.
A transformer is the network behind most language models; its cache is a memory shelf for traces of every previous token.
As the conversation grows, the shelf fills. Reading from it gets slower and costlier.
In 2025, Google's TurboQuant paper showed a way to shrink that shelf without breaking attention.1
It rotates the vectors into a new coordinate frame, then stores them with fewer bits. What matters are the comparisons attention uses; individual numbers can drift.
Different material, same principle: the best compression keeps how the pieces fit, not the pieces themselves.
The Wall Is Geometry, Not Picture
Made from rules: each tile fits because its edge, angle, and orientation answer the tile beside it.
Look at a traditional Moroccan zellige mosaic and the eye gets lost. Stars open inside stars, edges interlock, symmetries repeat, yet the surface never feels mechanical. The first instinct is to read the wall as an image, a complicated arrangement of coloured pieces, but that is not the most efficient level of description.
Zellige is built by method.
The maker works from a geometric guide of star polygons and grids, cutting a small alphabet of shapes that answer each other at the edge.2
In a Fes workshop, the pattern starts as construction geometry: a circle stepped out with a compass, lines drawn across it, stars and grids emerging from the intersections.2
A glazed tile is held in one hand and marked from a sample on the back. It rests on a small iron anvil; the cutter strikes along the line with the menqach, a short, sharp, double-ended chisel-hammer swung close to the body.
Tap, turn, tap, turn, until a star point, triangle, cross, or kite comes free.
The maker measures almost nothing. Precision lives in the repeated shape, the trained hand, and the edge of the menqach.3
The tiler never memorises the whole surface. Once the design and repeat are chosen, geometry extends the pattern as the wall grows.
The compression lives in how the pattern is made: a small rulebook of points, edges, angles, and repetitions, expanded into ceramic.
The Recipe Is Smaller Than The Surface
A tiny recipe, an immense surface.
Describe the wall pixel by pixel and the description becomes enormous. Describe the rule instead, and the surface becomes cheap.
Use this grid. Cut this family of shapes. Rotate here. Reflect there. Repeat this unit. Match these edges. Keep this angle.
That is a recipe for making the wall, not a low-resolution summary of it. Given the rule, the pattern can be extended, repaired, taught, or rebuilt.
The useful distinction:
- Bad compression saves decoration.
- Good compression saves function.
- Great compression saves the rule that can make the thing again.
In Islamic ornament the compression is physical: a small rule expands into ceramic.4
The Cache Is Also A Wall
The cache is storage. The map is what attention reads.
A long-context transformer has its own version of too many tiles.
For every word, the model leaves a key and a value on the shelf.1
The longer the context, the heavier it grows, and moving it costs more than thinking.
The obvious fix is to round the numbers down and hope the model survives.
But attention doesn't compare numbers one by one. It reads the inner product: one score for how well a new question lines up with each stored key.
Damage that score and the cache shrinks, but the map is gone. Colours preserved, angles ruined.
TurboQuant Rotates The Wall
Rotate first, then cut away: the coordinate frame decides what can disappear.
TurboQuant's move is to change the basis before deleting precision.1
First, it rotates the vector without stretching it, like turning a drawing on transparent paper.
Same shape, weight spread more evenly across the axes.
After rotation, the squared coordinates follow a known Beta law that depends only on dimension, not on the dataset.1
Then it quantizes against that distribution. The codebook is prepared in advance.
No calibration set. No retraining. No special case for every "wall".
Finally, for inner products, TurboQuant adds a one-bit residual correction through a Quantized Johnson-Lindenstrauss transform. Heavy name, simple job: keep alignments usable in a smaller form. The score stays accurate while individual numbers drift.5
The result:
- KV caches compress to about 3.5 bits per coordinate with absolute quality neutrality on the reported long-context benchmarks
- More than a 4x reduction over fp16 storage, and at least 6x memory savings on long-context retrieval tasks at lower bit widths6
- Distortion stays within a small constant factor of the information-theoretic lower bound1
The trick is not deleting harder, but finding a basis where deletion hurts less.
The win lands hardest in agentic work. Coding agents, research agents, branching planners: each one builds a cache faster than the conversation grows. Shrink it four or five times and the same GPU runs more agents, or keeps one alive far longer. Because nothing is calibrated, it keeps working when the workload shifts.
The numbers have receipts. On the test where a model has to find a hidden sentence in a long document, TurboQuant matches the uncompressed model at four times less memory.1 The trickiest piece of the recipe was published a year earlier and runs in public code today.5
The worst-case loss is bounded by a mathematical proof, not a hopeful chart.
What Survives Deletion
Deletion reveals the skeleton: the few lines that were doing the real work all along.
Compression is usually discussed as storage. The more useful question is what survived.
For the wall, what survives is angle and edge. For TurboQuant, what survives is inner-product geometry. In both, the representation keeps the geometry of use rather than the surface of it. Different jobs require different survivors, but the question has the same shape:
What must remain true after the representation gets smaller?
A JPEG keeps what the eye notices. A tile geometry keeps the rule that builds the surface. A vector quantizer keeps a small codebook of typical shapes. TurboQuant's answer is narrower and unusually clean: keep the inner products useful, and the cache can get much smaller.
The Smallest Version That Still Works
From the noise, the smallest working shape.
The same idea works far beyond tiles and transformers.
To some, handling complexity means carrying more detail: longer notes, longer plans, more screenshots, more saved tabs. The pile usually makes the problem heavier instead of clearer.
The better move is to ask what must survive:
- A meeting record is goal, constraint, decision, next action. The transcript is overhead.
- A room holds the few objects that keep it working; the rest can go.
- A message is what happened, why it matters, what changes now. Strip everything else.
In ordinary life, compression is keeping enough to reconstruct, decide, or act.
The practical question is simple:
What is the smallest version of this that still works?
Rotate Before You Cut
Cut without rotating: damage. Rotate first: clean work.
The most important step in compression may not be "cut." It may be "rotate."
Cut before you rotate, and damage follows. Drop coordinates, and meaning collapses. Round numbers, and retrieval breaks. Strip a pattern, and the wall loses its fit.
Rotate before you cut, and the work comes clean. The unnecessary detail separates from what matters. The representation can shrink because the system has found the form where the important parts are easiest to keep.
That is what the maker does with points and angles. That is what TurboQuant does with rotation and quantization.
Memory holds the tiles as a picture.
Geometry holds the tiles as a pattern.
The geometry of compression is older than the machines now exploiting it. A transformer can prove it at scale.
Anyone who played Tetris in the 1990s already had the intuition: check queue, rotate first, drop second, calculate the moves, and you already know which piece isn't going to fit. It's always the S piece that ruins it.
Enjoyed the article?
Get new essays on AI reasoning, cognition, and system design.
No noise.
References
Footnotes
-
Zandieh, A., Daliri, M., Hadian, M., & Mirrokni, V. (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874 [cs.LG]. To appear at ICLR 2026. DOI: 10.48550/arXiv.2504.19874. Source for the random orthogonal rotation, induced Beta law on the squared coordinates of unit-norm vectors, data-oblivious scalar quantizer, two-stage MSE + 1-bit QJL inner-product estimator, KV-cache framing, the 3.5-bit absolute-quality-neutrality result, and the near-optimal distortion bound (within a constant factor of about 2.7 of the information-theoretic lower bound). ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Bonner, J. (2017). Islamic Geometric Patterns: Their Historical Development and Traditional Methods of Construction (with a chapter by Craig Kaplan; foreword by Roger Penrose). Springer New York. ISBN: 978-1-4419-0216-0. DOI: 10.1007/978-1-4419-0217-7. Source for the polygonal technique and the local-rule, edge-and-angle grammar that generates zellige and broader Islamic geometric ornament. ↩ ↩2
-
DeMerchant, C. Moroccan Tiles Zellige from Fez. christinedemerchant.com/morocco_tile_zellige.html; Behind the Scenes Adventures. How Moroccan Mosaics are Made: The Art of Zellij. btsadventures.com/articles/how-moroccan-mosaics-are-made-the-art-of-zellij. Sources for workshop practice: Fes clay preparation, glazed square blanks, marking from sample shapes or guides, and menqach cutting on an iron anvil. ↩
-
Bonner, J. (2003). Three Traditions of Self-Similarity in Fourteenth and Fifteenth Century Islamic Geometric Ornament. In Meeting Alhambra: ISAMA-BRIDGES 2003 Conference Proceedings, pp. 1–12. archive.bridgesmathart.org/2003/bridges2003-1.html. Source for the claim that Islamic geometric ornament achieves striking visual complexity from a small set of construction rules expanded into a physical surface. ↩
-
Zandieh, A., Daliri, M., & Han, I. (2025). QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 25805–25813. DOI: 10.1609/aaai.v39i24.34773. arXiv:2406.03482. Source for the 1-bit Quantized Johnson-Lindenstrauss residual estimator that TurboQuant uses for inner products, including its zero-overhead, unbiased inner-product property and the originally reported "more than fivefold" KV-cache reduction at 3 bits without accuracy loss. ↩ ↩2
-
Zandieh, A., & Mirrokni, V. (2026, March 24). TurboQuant: Redefining AI efficiency with extreme compression. Google Research Blog. research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression. Source for the long-context benchmark suite (LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, L-Eval), the "at least 6x" KV-memory reduction on long-context retrieval tasks, and the up-to-8x speedup on attention-logit computation at 4 bits versus 32-bit unquantized keys on H100 GPUs. ↩
License
This work is licensed under a Creative Commons Attribution 4.0 International License.