Skip to content

Research · Papers

Peer-reviewed publications.

Selected work from the iframe.ai lab. Every paper links to a code release, a blog post explaining the implications for production, and the version of the runtime where it landed.

FilterAllLong-contextAccelerationSystems
  1. NeurIPS 2024Dec 2024Long-context

    Sparse attention for million-token context windows

    Soroush Bahmani, Wei Chen, Lila Saadat, Anthony Park

    A learned sparsity scheme that maintains 99.1% of dense-attention accuracy on RULER while reducing FLOPs by 6.2x at 1M-token contexts. We show that the optimal pattern is workload-dependent and can be discovered with a small calibration set.

    412 citationsPDFCode
  2. MLSys 2025Apr 2025Acceleration

    Quantization-aware speculative decoding

    Lila Saadat, Soroush Bahmani, Daria Volkov

    Joint design of FP8 quantization and tree-structured speculative decoding. We characterize the failure modes of naive composition and propose a calibration procedure that yields 4.2x throughput on Llama 3.1 70B with measured-zero MT-Bench regression.

    187 citationsPDFCode
  3. OSDI 2025Jul 2025Systems

    Vendor-neutral collective communication primitives

    Wei Chen, Marisa Torres, Anthony Park, Soroush Bahmani

    A unified abstraction over NCCL, RCCL, and oneCCL with backend-specific cost models. A single training script runs unmodified across NVIDIA, AMD, and Intel hardware at within 3% of the native library on each platform.

    96 citationsPDFCode
  4. MLSys 2024May 2024Long-context

    Prefix caching at inference time: a memory hierarchy view

    Daria Volkov, Lila Saadat

    We treat KV-cache reuse as a memory hierarchy problem. The result is a tiered caching policy that achieves 7.8x throughput on shared-prefix workloads (RAG, agents, code completion) on production traces.

    320 citationsPDF
  5. SOSP 2024Nov 2024Systems

    Tensor-parallel scheduling under heterogeneous hardware

    Wei Chen, Anthony Park

    Distributed training across mixed-vendor clusters, where each node has different peak FLOPs and bandwidth. We characterize the performance frontier and propose a runtime scheduler that finds 92% of an oracle on representative workloads.

    73 citationsPDF
  6. ICML 2024Jul 2024Acceleration

    INT4 quantization without calibration drift

    Lila Saadat, Daria Volkov, Soroush Bahmani

    A regularized rounding scheme for post-training INT4 quantization. We provide bounds on quality regression and show that, contra prior reports, INT4 inference is competitive with FP8 on most production workloads with the right rounding policy.

    245 citationsPDFCode

Reproducibility

Run the benchmarks yourself.

Every performance number we publish has a script and a seed. Open-source, no signup.