Research
The cloud is downstream of the lab.
iframe.ai is a research lab that operates a cloud. Every paper we publish ends up in the runtime our customers use, and every customer workload we run feeds the questions we publish on. The two are not separable.
By the numbers
What the lab has shipped.
Research areas
Three lines of work.
Every line of research has a clear path to a product feature. We don't write papers we can't ship.
Long-context inference
Sparse attention, KV-cache compression, and prefix caching schemes that make million-token contexts cost-competitive with 8K contexts on equivalent hardware.
Inference acceleration
Speculative decoding, FP8 and INT4 quantization, kernel fusion. Our managed inference endpoints run 10-20x faster than vLLM defaults on the same hardware.
Vendor-neutral runtime
Collective communication, scheduler, and tensor-parallel patterns that work across NVIDIA, AMD, and Intel. Single source of truth for distributed training.
Selected papers
Recent peer-reviewed work.
NeurIPS 2024
Sparse attention for million-token context windows
Bahmani et al.
A learned sparsity scheme that maintains 99% of dense-attention accuracy while reducing compute by 6x at 1M-token contexts. Now in production for our long-context inference endpoints.
Read paperMLSys 2025
Quantization-aware speculative decoding
Saadat et al.
Joint design of FP8 quantization and tree-structured speculative decoding. 4.2x throughput improvement on Llama 3.1 70B with negligible quality loss measured against MT-Bench.
Read paperOSDI 2025
Vendor-neutral collective communication primitives
Chen et al.
A clean abstraction over NCCL, RCCL, and oneCCL that lets a single training script run unmodified across NVIDIA, AMD, and Intel hardware at within 3% of native performance.
Read paper
The lab
People, partners, and access.
Inside the lab
Twenty-three full-time researchers split across three teams: long-context, acceleration, and systems. Working in the same building as the engineers who run production.
University partners
Joint papers, sabbaticals, summer programs, and unrestricted compute grants for principal investigators at our partner institutions.
Open benchmarks
Reproducible scripts and harnesses behind every performance claim we publish. Run them yourself; we publish the seeds.
Compute for science
Compute grants for principal investigators.
Free or discounted GPU credits for accredited research, with grant-friendly procurement and unrestricted publication rights.