Skip to content

Research · Long-context

1M tokens. Same hardware.

Three years of research on attention sparsity, KV-cache compression, and prefix caching, packaged into the long-context endpoints in our managed inference product. The result is million-token context windows that cost roughly the same as 8K-token serving on the same hardware.

Headline result

6.2x compute reduction. 99% of dense-attention quality.

Compute reduction
6.2×
At 1M-token context, vs dense FlashAttention.
RULER quality
99.1%
Of dense-attention RULER score.
Throughput
3,840
Tokens/sec on a single B200, 1M-token prefill.
Cost @ 1M ctx
1.4×
Of the same model at 8K context. Not 100x.

Approach

Three pieces of the puzzle.

Learned sparse attention

A workload-calibrated sparsity pattern that drops attention compute by 6x at 1M tokens while preserving 99% of dense quality on RULER. Calibration uses 200 sample completions per workload class.

NeurIPS 2024

KV-cache compression

Layer-wise bit allocation derived from a sensitivity analysis. Mixed FP8/INT4 quantization of the KV cache. Memory footprint reduced 3.4x with measured-zero MT-Bench regression.

MLSys 2025

Tiered prefix caching

Treats KV-cache reuse as a memory hierarchy problem. Cross-request prefix caches in HBM, NVMe, and remote object storage. 7.8x throughput on shared-prefix workloads in production traces.

MLSys 2024

In production

Available as one API parameter.

Long-context endpoints are first-class in the managed inference product. There is no separate model or deployment — pass attention="iframe/learned-sparse" and the runtime selects the calibrated pattern for the workload.

The pricing is the same as our standard 8K endpoints, plus a small surcharge above 128K to cover the KV cache. You will not see a 100x bill for switching context length.

long_context.py
from iframe import inference

# A 1M-token context window is a single API parameter.
client = inference.Client()

response = client.completions.create(
    model="iframe/llama-3.1-70b-1m",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": load_repository("./monorepo")},
        {"role": "user", "content": "Find the bug in the auth module."},
    ],
    # Sparse attention pattern is selected per-request, not per-model.
    attention="iframe/learned-sparse",
    max_tokens=2048,
)

print(response.choices[0].message.content)

Workloads

Where this matters most.

Whole-codebase reasoning

Repository-scale code completion and analysis. Engineers point the model at a 600K-line monorepo and ask for the bug — the model has the whole answer in context.

Multi-document QA

Legal discovery, financial filings, medical records. Hundreds of documents in a single prompt with citations back to the source pages, not a RAG index.

Conversation memory

Agents and assistants that retain a million tokens of conversational history. Affordable enough to run continuously, accurate enough to use as the canonical memory layer.

Long-context inference

A million-token prompt is a single API call.

Try the endpoint with a free account. Production usage is metered like any other endpoint.