Research · Papers

Peer-reviewed publications.

Selected work from the iframe.ai lab. Every paper links to a code release, a blog post explaining the implications for production, and the version of the runtime where it landed.

Open benchmarks Inside the lab

FilterAllLong-contextAccelerationSystems

NeurIPS 2024Dec 2024Long-context
Sparse attention for million-token context windows
Soroush Bahmani, Wei Chen, Lila Saadat, Anthony Park
A learned sparsity scheme that maintains 99.1% of dense-attention accuracy on RULER while reducing FLOPs by 6.2x at 1M-token contexts. We show that the optimal pattern is workload-dependent and can be discovered with a small calibration set.
412 citationsPDF Code
MLSys 2025Apr 2025Acceleration
Quantization-aware speculative decoding
Lila Saadat, Soroush Bahmani, Daria Volkov
Joint design of FP8 quantization and tree-structured speculative decoding. We characterize the failure modes of naive composition and propose a calibration procedure that yields 4.2x throughput on Llama 3.1 70B with measured-zero MT-Bench regression.
187 citationsPDF Code
OSDI 2025Jul 2025Systems
Vendor-neutral collective communication primitives
Wei Chen, Marisa Torres, Anthony Park, Soroush Bahmani
A unified abstraction over NCCL, RCCL, and oneCCL with backend-specific cost models. A single training script runs unmodified across NVIDIA, AMD, and Intel hardware at within 3% of the native library on each platform.
96 citationsPDF Code
MLSys 2024May 2024Long-context
Prefix caching at inference time: a memory hierarchy view
Daria Volkov, Lila Saadat
We treat KV-cache reuse as a memory hierarchy problem. The result is a tiered caching policy that achieves 7.8x throughput on shared-prefix workloads (RAG, agents, code completion) on production traces.
320 citationsPDF
SOSP 2024Nov 2024Systems
Tensor-parallel scheduling under heterogeneous hardware
Wei Chen, Anthony Park
Distributed training across mixed-vendor clusters, where each node has different peak FLOPs and bandwidth. We characterize the performance frontier and propose a runtime scheduler that finds 92% of an oracle on representative workloads.
73 citationsPDF
ICML 2024Jul 2024Acceleration
INT4 quantization without calibration drift
Lila Saadat, Daria Volkov, Soroush Bahmani
A regularized rounding scheme for post-training INT4 quantization. We provide bounds on quality regression and show that, contra prior reports, INT4 inference is competitive with FP8 on most production workloads with the right rounding policy.
245 citationsPDF Code

Reproducibility

Run the benchmarks yourself.

Every performance number we publish has a script and a seed. Open-source, no signup.

Open benchmarks Sign up to reproduce

Peer-reviewed publications.

Sparse attention for million-token context windows

Quantization-aware speculative decoding

Vendor-neutral collective communication primitives

Prefix caching at inference time: a memory hierarchy view

Tensor-parallel scheduling under heterogeneous hardware

INT4 quantization without calibration drift

Run the benchmarks yourself.