ArticlesProjectsWeeklyCredentialsAbout
subqsparse-attentionsub-quadraticssatransformersattention-mechanismlong-contextllm-architectureflashattentionlinear-attentiondeep-learning

Inside SubQ: How a Fully Sub-Quadratic Sparse-Attention Architecture Actually Works

·15 min read

On 5 May 2026, Subquadratic Inc. announced SubQ — the first frontier-class large language model built entirely on a sub-quadratic attention architecture. It processes 12 million tokens in a single context window at one-fifth the cost of comparable models, with benchmark numbers that match or beat Claude Opus and GPT at long-context tasks. The technical report is still pending, but the company has published enough benchmarks and architectural description to understand the problem they're solving and why it's hard.

This article is about the mechanics: what quadratic attention actually costs, why every previous sub-quadratic approach broke quality at scale, and what Sparse Attention Architecture (SSA) appears to do differently.


The Quadratic Wall

Every transformer you've ever used — GPT-4, Claude, Gemini, Llama — has attention at its core. The calculation looks like this:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

where Q,K,VRn×dkQ, K, V \in \mathbb{R}^{n \times d_k} are the query, key, and value matrices for a sequence of nn tokens. The term QKQK^\top is a matrix multiply that produces an n×nn \times n score matrix — one score for every pair of tokens in the sequence.

That n×nn \times n matrix is the problem. Its size is O(n2)O(n^2). For a sequence of:

  • 4,096 tokens → 16 million attention pairs
  • 128,000 tokens (Claude's "128K context") → 16 billion attention pairs
  • 1,000,000 tokens → 1 trillion attention pairs
  • 12,000,000 tokens (SubQ's research result) → 144 trillion attention pairs

The compute grows with the square of sequence length. Doubling the context quadruples the attention cost. This is not a constant you can engineer away — it's the mathematical structure of full self-attention.

What This Looks Like in Practice

The quadratic cost creates a hard economic ceiling. At 1 million tokens with standard attention, you're paying 10,000x as much compute for attention compared to a 10,000-token context. This is why "long-context" models are so expensive, and why performance degrades at extreme lengths: the model never actually trains meaningfully on million-token inputs because the GPU memory and compute required to process such examples during training are prohibitive.

FlashAttention (Dao et al., 2022) was a major engineering breakthrough that made O(n2)O(n^2) attention fast in practice through IO-aware tiling — reordering memory reads/writes to reduce HBM bandwidth usage. It made the constant factor much smaller. But the algorithmic complexity stayed O(n2)O(n^2). FlashAttention is fast quadratic attention. Sub-quadratic is a different problem.

Why O(n2)O(n^2) memory is the binding constraint

The attention score matrix requires O(n2)O(n^2) memory even before you materialise the softmax outputs. At n=1,000,000n = 1{,}000{,}000 with float16, the score matrix alone is 2×(106)222 \times (10^6)^2 \approx 2 TB. No current GPU has 2 TB of VRAM. FlashAttention sidesteps this by never materialising the full matrix — it computes tiles in SRAM — but the underlying compute is still quadratic. Getting to genuinely sub-quadratic attention requires changing the algorithm, not just the implementation.


Five Years of Sub-Quadratic Attempts

The research community has understood this problem since 2020. A whole family of approaches tried to break the quadratic barrier, and each made a different tradeoff.

Sparse Pattern Attention (Longformer, BigBird — 2020)

Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) both observe that most tokens don't need to attend to every other token. Longformer uses a sliding window: each token attends to a fixed window of ww neighbours, reducing attention to O(nw)O(nw). For long documents, wnw \ll n, so this is effectively linear. BigBird extends this with a mix of local window attention, random attention (a few random tokens per query), and global attention (a small set of tokens that attend to everything).

These work well for encoder-only models (BERT-style) on NLP benchmarks. They don't work well for autoregressive generation at frontier quality: the sparse patterns are fixed by design and can't learn which pairs actually matter for a given input.

Linear Attention via Kernel Approximation (2020–2022)

A cleaner theoretical approach rewrites the softmax attention as a kernel function approximation. If you approximate softmax(qikj)ϕ(qi)ϕ(kj)\text{softmax}(q_i k_j^\top) \approx \phi(q_i)^\top \phi(k_j) for some feature map ϕ\phi, then:

Attention(Q,K,V)i=ϕ(Qi)j=1nϕ(Kj)Vjϕ(Qi)j=1nϕ(Kj)\text{Attention}(Q, K, V)_i = \frac{\phi(Q_i)^\top \sum_{j=1}^n \phi(K_j) V_j^\top}{\phi(Q_i)^\top \sum_{j=1}^n \phi(K_j)}

This lets you compute jϕ(Kj)Vj\sum_j \phi(K_j) V_j^\top once for the whole sequence and then look it up for each query — reducing complexity to O(nd2)O(n \cdot d^2), which is linear in nn. Performers (Choromanski et al., 2020) and several related papers use this approach.

The catch: the kernel approximation of softmax is lossy. For low-rank approximations to work, the attention pattern must be smooth. Empirically, frontier-scale transformer attention patterns are not smooth — they're spiky, concentrated on specific tokens, with sharp peaks. The kernel trick degrades quality substantially on tasks that depend on exact retrieval from context.

State Space Models (Mamba — 2023)

Mamba (Gu and Dao, 2023) takes a fundamentally different route: instead of attention at all, use a selective state space model (SSM) with input-dependent transition matrices. Processing is sequential (linear time, linear memory) and the model compresses context into a fixed-size hidden state rather than keeping all past tokens available.

This achieves O(n)O(n) training compute and O(1)O(1) inference state — remarkable. The limitation is compression: a finite hidden state can only remember a finite amount from the past. At very long context, information from early in the sequence is necessarily forgotten. For tasks that require exact retrieval of specific tokens from millions of tokens ago, SSMs degrade. This is measurable on the RULER and MRCR benchmarks.


What SubQ Is Claiming

SubQ's claim is that their Sparse Attention Architecture (SSA) achieves genuine sub-quadratic (linear-with-context) scaling without making either of the typical concessions: it doesn't use fixed sparsity patterns (so it can attend to whichever tokens actually matter), and it doesn't compress history into a fixed state (so it maintains exact retrieval capability).

The company describes it as "a ground-up redesign of how attention works, built to be subquadratic from first principles." The formal technical report is forthcoming. But the benchmarks are instructive:

RULER 128K (13 sub-tasks requiring long-context reasoning): SubQ 95.0% vs. Claude Opus 4.6 94.8%
MRCR v2 at 1M tokens (multi-round coreference, 8 needles): SubQ 65.9% vs. GPT-5.5 74.0%, Claude 3.5 78.3%
SWE-Bench Verified (real-world software engineering): SubQ 81.8%
Speed vs. FlashAttention at 1M tokens: 52× faster with 63% less compute

The MRCR v2 score is the most diagnostic. This test hides specific named references across a million-token context and asks the model to track them — it tests exact retrieval, which is where compression-based approaches fail. 65.9% is below some competitors but far above Gemini 3.1 Pro (26.3%) and GPT-4o (32.2%), and competitive with SSM-hybrid models. The 83% research-level number (not yet third-party validated) would place it comfortably at the frontier.


How SSA Likely Works: The Mechanics

Since the technical paper isn't published yet, what follows is grounded in what the company has disclosed and in the established literature on learned sparse attention.

The core insight is that attention weights in trained transformers are empirically sparse. In a model trained at scale, the average token attends strongly to a small set of other tokens, with most attention weights near zero after softmax. The mathematical structure is quadratic, but the information in the attention patterns is sparse.

Learned dynamic sparsity is the plausible mechanism: instead of pre-defining which tokens attend to which, the model learns a lightweight routing function that predicts, for each query token, which kk key tokens are most likely to receive high attention weight. Then only those kk pairs are computed. If knk \ll n, the complexity drops from O(n2)O(n^2) to O(nk)O(nk).

The difficulty — and this is where all previous learned sparse attention approaches failed — is making the routing accurate before computing the attention scores. To know which tokens matter for a query, you nominally need to compute the attention score against all tokens, which puts you back at O(n2)O(n^2). The routing must be a cheaper proxy.

Possible approaches include:

  • Product quantisation of key vectors: cluster keys into n\sqrt{n} buckets, route each query to the relevant buckets. This is O(nn)O(n\sqrt{n}) — sub-quadratic but not linear.
  • Locality-sensitive hashing (LSH): hash queries and keys into buckets such that similar vectors collide with high probability (used in Reformer, 2020). Bucketed attention is approximately O(nlogn)O(n \log n) but the hash collision quality degrades for transformers trained at scale.
  • Two-pass attention: a small lightweight "scout" network operates on downsampled representations to identify relevant token positions, followed by exact attention on only those positions. The scout is cheap; the full attention pays only for the identified pairs.
  • Hierarchical/multi-scale compression: build a tree of summary tokens — each level summarises blocks from the level below. Attend through the hierarchy, expanding only the branches that score high. Access pattern is O(nlogn)O(n \log n) or better.

What SubQ appears to have done — and this is the key advance the company emphasises — is make the routing good enough that the model doesn't degrade relative to dense attention at frontier quality. All the approaches above were known. Making them work at scale without quality loss is the research contribution.

The 1000×1000\times attention compute reduction at 12M tokens is consistent with a routing sparsity of about k1000k \approx 1000 — attending to roughly 1,000 tokens per query out of 12,000,000. Whether the routing achieves this while preserving recall on hard retrieval tasks is exactly what the MRCR v2 benchmark probes.


The Memory Complexity Gain

Beyond compute, sub-quadratic attention solves a related but distinct problem: KV cache memory during inference.

In autoregressive generation, the model maintains a key-value cache of every previously generated token. For dense attention, this cache grows as O(ndL)O(n \cdot d \cdot L) where LL is the number of layers. At 12M tokens, this is prohibitive. Sparse attention reduces the effective cache to the routing neighbourhood of the current query — meaning generation at 12M tokens doesn't require storing all 12M KV vectors in hot SRAM simultaneously.

SubQ quotes 150 tokens/second at a 12M context. That rate implies the system is operating efficiently within the memory envelope of current hardware, which would not be possible with dense attention at that context length regardless of FLOPs.


The Code: A Sparse Attention Simulator

The companion project (subq-sparse-attention-architecture) implements the core ideas: dense attention, two styles of sparse attention (fixed sliding window and learned top-kk routing), and a benchmark comparing their compute usage and recall quality against a held-out retrieval test.

import math
import torch
import torch.nn.functional as F
from dataclasses import dataclass

@dataclass
class AttentionConfig:
    d_model: int = 64
    n_heads: int = 4
    top_k: int = 32          # tokens each query attends to in sparse mode
    window_size: int = 64    # for sliding-window baseline

def dense_attention(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor) -> torch.Tensor:
    """Standard O(n²) attention — all pairs computed."""
    scale = math.sqrt(q.size(-1))
    scores = torch.matmul(q, k.transpose(-2, -1)) / scale        # (B, H, N, N)
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, v)                               # (B, H, N, D)

def sparse_topk_attention(
    q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, top_k: int
) -> torch.Tensor:
    """
    Learned top-k sparse attention.

    For each query, compute a cheap dot-product score against all keys,
    select the top-k keys, then run exact softmax attention over only
    those k pairs.  Complexity: O(n·k) for the gather + exact attn step,
    plus O(n²) for the initial score — so a two-pass approach reduces cost
    only when k << n AND the routing pass is cheaper (e.g. lower precision,
    downsampled keys).  Here we show the structure clearly; the companion
    project uses a lightweight learnable router.
    """
    B, H, N, D = q.shape
    scale = math.sqrt(D)

    # Routing pass: dot-product scores for all pairs (conceptually the
    # expensive step — in a real SSA this is approximated cheaply)
    routing_scores = torch.matmul(q, k.transpose(-2, -1)) / scale  # (B,H,N,N)

    # Select top-k key indices for each query
    topk_vals, topk_idx = routing_scores.topk(min(top_k, N), dim=-1)  # (B,H,N,k)

    # Gather the selected key and value vectors
    idx_expanded = topk_idx.unsqueeze(-1).expand(-1, -1, -1, D)     # (B,H,N,k,D)
    k_sel = k.unsqueeze(2).expand(-1, -1, N, -1, -1).gather(3, idx_expanded)
    v_sel = v.unsqueeze(2).expand(-1, -1, N, -1, -1).gather(3, idx_expanded)

    # Exact softmax over the k selected pairs
    weights = F.softmax(topk_vals, dim=-1)                           # (B,H,N,k)
    out = torch.einsum('bhnk,bhnkd->bhnd', weights, v_sel)          # (B,H,N,D)
    return out

def measure_retrieval_recall(
    seq_len: int, top_k: int, n_needles: int = 8, d: int = 64, seed: int = 42
) -> dict:
    """
    Plant `n_needles` random token embeddings in a long sequence, then
    check whether top-k sparse attention retrieves them at query time.
    Returns recall for dense vs. sparse attention.
    """
    torch.manual_seed(seed)
    q = torch.randn(1, 1, seq_len, d)
    k = torch.randn(1, 1, seq_len, d)
    v = torch.randn(1, 1, seq_len, d)

    # Inject needles: make some keys very similar to specific queries
    needle_pos = torch.randint(0, seq_len, (n_needles,))
    for i, pos in enumerate(needle_pos):
        k[0, 0, pos] = q[0, 0, pos] * 5.0    # high-similarity pair

    # Run both attention variants
    out_dense = dense_attention(q, k, v)
    out_sparse = sparse_topk_attention(q, k, v, top_k=top_k)

    # Recall: fraction of needle positions where sparse ≈ dense output
    mse = F.mse_loss(out_sparse[0, 0, needle_pos], out_dense[0, 0, needle_pos])
    return {
        "seq_len": seq_len,
        "top_k": top_k,
        "sparsity": top_k / seq_len,
        "needle_mse": mse.item(),
    }

Running python sparse_attention.py --benchmark produces a table showing retrieval MSE at increasing sparsity levels, demonstrating the quality cliff that plagued earlier sparse attention approaches and how learned routing pushes that cliff much further out.


What the Benchmarks Actually Tell Us

Reading the SubQ numbers carefully reveals both what they've solved and where challenges remain.

RULER 128K at 95% is strong evidence of correct information routing at 128K tokens. The 13 RULER sub-tasks include needle-in-a-haystack, variable tracking, and multi-hop QA — they comprehensively probe long-context recall. 95% matches the best dense-attention models, which means the sparse routing is reliably finding the right tokens in contexts up to 128K.

MRCR v2 at 1M tokens: 65.9% (production) vs. 83% (research) reveals the harder problem. The gap between the research model (which likely operates at lower throughput or higher compute per token) and the production model (optimised for cost and speed) is about 17 percentage points. This is honest and informative — it shows that scaling the context to 1M while maintaining production-grade cost/speed constraints costs measurable quality. Whether this narrows as the team optimises is the open question.

The 52× speed advantage over FlashAttention at 1M tokens with 63% less compute is the claim that demands scrutiny from the research community. FlashAttention is already the fastest attention implementation known; 52× faster implies genuine algorithmic gains, not just engineering. If the routing is genuinely O(n)O(n) with a small constant, these numbers are plausible at extreme context lengths where the n2n^2 term dominates.


Why This Matters

The quadratic attention ceiling wasn't just a performance limitation — it was a design constraint for every AI application built in the last decade. RAG pipelines exist because you can't fit a full corpus into a context window. Chunking strategies exist because models degrade at long context. Multi-agent orchestration exists in part as a workaround for single-agent context limits.

Sub-quadratic attention at frontier quality doesn't just make existing applications cheaper. It removes the engineering constraint that shaped how those applications were designed in the first place. Entire codebases, months of conversation history, large database schemas, long-running agent state — these become first-class inputs rather than things to be managed around.

The technical report will tell us how the routing works and what the actual scaling laws look like. Until then, the benchmarks give enough signal: the quality is there, the speed is real, and the context window is at least one order of magnitude beyond the current frontier. Whether the specific mechanism generalises to 50M+ tokens — as the company suggests is their trajectory — remains to be seen.

But for the first time since the original transformer paper, there's a credible architectural alternative that doesn't ask you to choose between scale and quality.


References

  • Vaswani et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30. — The original transformer paper. Introduces scaled dot-product attention and the QKQK^\top operation.

  • Beltagy et al. (2020). Longformer: The Long-Document Transformer. arXiv:2004.05150. — Sliding window + global attention; first practical O(n)O(n) attention for long documents.

  • Zaheer et al. (2020). Big Bird: Transformers for Longer Sequences. NeurIPS 2020; arXiv:2007.14062. — Sparse attention combining random, local, and global patterns; proves universal approximation is preserved.

  • Choromanski et al. (2020). Rethinking Attention with Performers. ICLR 2021. — FAVOR+ kernel approximation of softmax, reducing attention to O(n)O(n) via random feature maps.

  • Kitaev et al. (2020). Reformer: The Efficient Transformer. ICLR 2020. — LSH-based approximate attention; O(nlogn)O(n \log n) via locality-sensitive hashing.

  • Dao et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. — IO-aware tiling for exact O(n2)O(n^2) attention; the current SOTA implementation baseline SubQ compares against.

  • Gu and Dao (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752. — Selective SSM architecture; O(n)O(n) compute but with compression tradeoffs on exact retrieval tasks.

  • Dangel, J. (2026, May 5). Introducing SubQ: The First Fully Subquadratic LLM. subq.ai/introducing-subq. — The product announcement with benchmark methodology and architecture overview.