Research · Acceleration

20× on the same hardware.

Our managed inference product runs 10-20x faster than vLLM defaults on the same GPU. The improvement comes from six independent techniques composed into a single optimization pipeline, with quality regressions bounded by automated benchmarks before any change ships.

Try managed inference Read the papers

Headline result

Composed gains across six techniques.

End-to-end speedup

20×

Llama 3.1 70B, 1K-in / 1K-out, B200.

Cost / 1M tokens

$0.32

Input. $0.55 output. List price.

P99 / median

1.4×

Tail latency at 95% utilization.

MT-Bench regression

0.00

Versus full-precision baseline.

Six techniques

Each one published, each one in production.

Tree-structured speculative decoding
A small draft model proposes multiple continuations in parallel, the target model verifies. We use tree topologies and a calibrated rejection sampler to reach 4-5x throughput on production workloads.
4.7xtokens/sec
FP8 quantization with calibrated rounding
Activation-aware quantization with per-channel scaling. Measured-zero quality regression on MT-Bench across the Llama 3.1 family at FP8.
1.9xthroughput
INT4 weight quantization
Regularized rounding scheme from our ICML 2024 paper. INT4 inference at FP8 quality on the workloads where prior INT4 schemes drop 4-6 MT-Bench points.
2.4xthroughput
Continuous batching with priority queueing
Workload-aware queueing keeps tail latency bounded under load. P99 stays under 1.4x the median up to 95% utilization, against an industry-typical 3-4x degradation.
1.6xthroughput at SLO
Custom CUDA / HIP / SYCL kernels
Fused attention, layer-norm, and rotary-embedding kernels for NVIDIA, AMD, and Intel hardware. Each backend matches the native vendor library within 3% on our internal regression suite.
1.4xkernel time
Vendor-neutral runtime
A single source-compatible scheduler and collective library that runs unmodified across NVIDIA, AMD, and Intel. Customer choice of hardware does not require a port.
—vendors supported: 3

Vendor neutrality

One runtime. Three vendors. No source changes.#

NVIDIA

B300, B200, H200, H100, A100. NCCL backend, FlashAttention-3 kernels, Hopper / Blackwell-specific tensor-core paths.

Fully supported

AMD

MI300X. RCCL backend, ROCm 6.x, CK-derived attention kernels. Within 3% of NVIDIA's H200 on Llama 3.1 inference at the same precision.

Fully supported

Intel

Gaudi 2 and Gaudi 3. oneCCL backend, SynapseAI runtime, custom kernel pack. The same training script that runs on H200 runs unmodified.

Fully supported

Inference acceleration

20x is a feature, not a benchmark.

Run the same model and measure for yourself. Free trial credits, no commitment.

Try inference Open benchmarks

20× on the same hardware.

Composed gains across six techniques.

Each one published, each one in production.

Tree-structured speculative decoding

FP8 quantization with calibrated rounding

INT4 weight quantization

Continuous batching with priority queueing

Custom CUDA / HIP / SYCL kernels

Vendor-neutral runtime

One runtime. Three vendors. No source changes.#

NVIDIA

AMD

Intel

20x is a feature, not a benchmark.