Research · Long-context
1M tokens. Same hardware.
Three years of research on attention sparsity, KV-cache compression, and prefix caching, packaged into the long-context endpoints in our managed inference product. The result is million-token context windows that cost roughly the same as 8K-token serving on the same hardware.
Headline result
6.2x compute reduction. 99% of dense-attention quality.
Approach
Three pieces of the puzzle.
Learned sparse attention
A workload-calibrated sparsity pattern that drops attention compute by 6x at 1M tokens while preserving 99% of dense quality on RULER. Calibration uses 200 sample completions per workload class.
KV-cache compression
Layer-wise bit allocation derived from a sensitivity analysis. Mixed FP8/INT4 quantization of the KV cache. Memory footprint reduced 3.4x with measured-zero MT-Bench regression.
Tiered prefix caching
Treats KV-cache reuse as a memory hierarchy problem. Cross-request prefix caches in HBM, NVMe, and remote object storage. 7.8x throughput on shared-prefix workloads in production traces.
In production
Available as one API parameter.
Long-context endpoints are first-class in the managed inference product. There is no separate model or deployment — pass attention="iframe/learned-sparse" and the runtime selects the calibrated pattern for the workload.
The pricing is the same as our standard 8K endpoints, plus a small surcharge above 128K to cover the KV cache. You will not see a 100x bill for switching context length.
from iframe import inference
# A 1M-token context window is a single API parameter.
client = inference.Client()
response = client.completions.create(
model="iframe/llama-3.1-70b-1m",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": load_repository("./monorepo")},
{"role": "user", "content": "Find the bug in the auth module."},
],
# Sparse attention pattern is selected per-request, not per-model.
attention="iframe/learned-sparse",
max_tokens=2048,
)
print(response.choices[0].message.content)
Workloads
Where this matters most.
Whole-codebase reasoning
Repository-scale code completion and analysis. Engineers point the model at a 600K-line monorepo and ask for the bug — the model has the whole answer in context.
Multi-document QA
Legal discovery, financial filings, medical records. Hundreds of documents in a single prompt with citations back to the source pages, not a RAG index.
Conversation memory
Agents and assistants that retain a million tokens of conversational history. Affordable enough to run continuously, accurate enough to use as the canonical memory layer.
Long-context inference
A million-token prompt is a single API call.
Try the endpoint with a free account. Production usage is metered like any other endpoint.