Research · Papers
Peer-reviewed publications.
Selected work from the iframe.ai lab. Every paper links to a code release, a blog post explaining the implications for production, and the version of the runtime where it landed.
- NeurIPS 2024Dec 2024Long-context
Sparse attention for million-token context windows
Soroush Bahmani, Wei Chen, Lila Saadat, Anthony Park
A learned sparsity scheme that maintains 99.1% of dense-attention accuracy on RULER while reducing FLOPs by 6.2x at 1M-token contexts. We show that the optimal pattern is workload-dependent and can be discovered with a small calibration set.
- MLSys 2025Apr 2025Acceleration
Quantization-aware speculative decoding
Lila Saadat, Soroush Bahmani, Daria Volkov
Joint design of FP8 quantization and tree-structured speculative decoding. We characterize the failure modes of naive composition and propose a calibration procedure that yields 4.2x throughput on Llama 3.1 70B with measured-zero MT-Bench regression.
- OSDI 2025Jul 2025Systems
Vendor-neutral collective communication primitives
Wei Chen, Marisa Torres, Anthony Park, Soroush Bahmani
A unified abstraction over NCCL, RCCL, and oneCCL with backend-specific cost models. A single training script runs unmodified across NVIDIA, AMD, and Intel hardware at within 3% of the native library on each platform.
- MLSys 2024May 2024Long-context
Prefix caching at inference time: a memory hierarchy view
Daria Volkov, Lila Saadat
We treat KV-cache reuse as a memory hierarchy problem. The result is a tiered caching policy that achieves 7.8x throughput on shared-prefix workloads (RAG, agents, code completion) on production traces.
320 citationsPDF - SOSP 2024Nov 2024Systems
Tensor-parallel scheduling under heterogeneous hardware
Wei Chen, Anthony Park
Distributed training across mixed-vendor clusters, where each node has different peak FLOPs and bandwidth. We characterize the performance frontier and propose a runtime scheduler that finds 92% of an oracle on representative workloads.
73 citationsPDF - ICML 2024Jul 2024Acceleration
INT4 quantization without calibration drift
Lila Saadat, Daria Volkov, Soroush Bahmani
A regularized rounding scheme for post-training INT4 quantization. We provide bounds on quality regression and show that, contra prior reports, INT4 inference is competitive with FP8 on most production workloads with the right rounding policy.
Reproducibility
Run the benchmarks yourself.
Every performance number we publish has a script and a seed. Open-source, no signup.