Research · Benchmarks

Reproducible. By design.

Every performance claim we publish ships with a script and a seed. We run the harnesses on a clean cluster once a quarter, publish the JSON, and host the raw output. If your numbers do not match ours, please open an issue.

Browse the repos Customer benchmarks

Four harnesses

What we maintain.

github.com/iframe-ai/inference-bench
Inference throughput harness
End-to-end token throughput across input/output ratios, batch sizes, and contention patterns. Compares vLLM, TensorRT-LLM, SGLang, and our runtime on identical hardware.
20x measured speedup at the published config
github.com/iframe-ai/longctx-evals
Long-context evaluation harness
RULER, needle-in-a-haystack, multi-doc QA, and code-completion across whole repositories. Scores correlate with what production users actually care about.
99.1% of dense-attention quality at 6.2x compute reduction
github.com/iframe-ai/training-bench
Multi-vendor training harness
Same training script, three vendors. Reports tokens-per-second per dollar, MFU, and step-time variance on NVIDIA, AMD, and Intel hardware.
Within 3% of native libraries on each vendor
github.com/iframe-ai/pricing-data
Cost calculator data
Monthly snapshots of AWS, Azure, and GCP list prices for the same SKUs we sell. The data behind the calculator on /pricing/calculator is published as CSV.
Updated on the first business day of every month

Run it

Five-line reproduction.

The published numbers on the home page, the inference page, and our research papers all come from the inference harness. The reproduction recipe is five lines of bash plus a clone — the same command we run on our quarterly clean cluster.

The seed is hard-coded in the published config so the runs are bit-identical. The numbers we publish are the medians of five runs, with min/max bands shown in the JSON output.

reproduce.sh

# clone the inference harness
git clone https://github.com/iframe-ai/inference-bench
cd inference-bench

# install
uv sync

# run the published Llama-70B / B200 / 1K-in 1K-out config
uv run bench \
    --model llama-3.1-70b \
    --hardware b200 \
    --workload tokens-1k-in-1k-out \
    --runtime iframe \
    --seed 17

# results land in ./out/run-2026-04-26.json

Practices

What we will not do.

We will not cherry-pick configs

Numbers are reported across input/output ratios that span the production envelope, not the single config where we look best. The harness emits a heatmap.

We will not change kernels between runs

Every published number identifies the runtime version that produced it. Once a number is published, the corresponding runtime version is preserved indefinitely.

We will not test against a strawman

Comparison runtimes use their published recommended configs, not their out-of-the-box defaults. If we beat them less than expected on a workload, we will say so.

Run the benchmarks. Verify the numbers.

Free trial credits cover the published configs.

Reproducible. By design.

What we maintain.

Inference throughput harness

Long-context evaluation harness