Skip to content

Research · Benchmarks

Reproducible. By design.

Every performance claim we publish ships with a script and a seed. We run the harnesses on a clean cluster once a quarter, publish the JSON, and host the raw output. If your numbers do not match ours, please open an issue.

Four harnesses

What we maintain.

  • github.com/iframe-ai/inference-bench

    Inference throughput harness

    End-to-end token throughput across input/output ratios, batch sizes, and contention patterns. Compares vLLM, TensorRT-LLM, SGLang, and our runtime on identical hardware.

    20x measured speedup at the published config

  • github.com/iframe-ai/longctx-evals

    Long-context evaluation harness

    RULER, needle-in-a-haystack, multi-doc QA, and code-completion across whole repositories. Scores correlate with what production users actually care about.

    99.1% of dense-attention quality at 6.2x compute reduction

  • github.com/iframe-ai/training-bench

    Multi-vendor training harness

    Same training script, three vendors. Reports tokens-per-second per dollar, MFU, and step-time variance on NVIDIA, AMD, and Intel hardware.

    Within 3% of native libraries on each vendor

  • github.com/iframe-ai/pricing-data

    Cost calculator data

    Monthly snapshots of AWS, Azure, and GCP list prices for the same SKUs we sell. The data behind the calculator on /pricing/calculator is published as CSV.

    Updated on the first business day of every month

Run it

Five-line reproduction.

The published numbers on the home page, the inference page, and our research papers all come from the inference harness. The reproduction recipe is five lines of bash plus a clone — the same command we run on our quarterly clean cluster.

The seed is hard-coded in the published config so the runs are bit-identical. The numbers we publish are the medians of five runs, with min/max bands shown in the JSON output.

reproduce.sh
# clone the inference harness
git clone https://github.com/iframe-ai/inference-bench
cd inference-bench

# install
uv sync

# run the published Llama-70B / B200 / 1K-in 1K-out config
uv run bench \
    --model llama-3.1-70b \
    --hardware b200 \
    --workload tokens-1k-in-1k-out \
    --runtime iframe \
    --seed 17

# results land in ./out/run-2026-04-26.json

Practices

What we will not do.

We will not cherry-pick configs

Numbers are reported across input/output ratios that span the production envelope, not the single config where we look best. The harness emits a heatmap.

We will not change kernels between runs

Every published number identifies the runtime version that produced it. Once a number is published, the corresponding runtime version is preserved indefinitely.

We will not test against a strawman

Comparison runtimes use their published recommended configs, not their out-of-the-box defaults. If we beat them less than expected on a workload, we will say so.

Run the benchmarks. Verify the numbers.

Free trial credits cover the published configs.