Skip to content

Research · Acceleration

20× on the same hardware.

Our managed inference product runs 10-20x faster than vLLM defaults on the same GPU. The improvement comes from six independent techniques composed into a single optimization pipeline, with quality regressions bounded by automated benchmarks before any change ships.

Headline result

Composed gains across six techniques.

End-to-end speedup
20×
Llama 3.1 70B, 1K-in / 1K-out, B200.
Cost / 1M tokens
$0.32
Input. $0.55 output. List price.
P99 / median
1.4×
Tail latency at 95% utilization.
MT-Bench regression
0.00
Versus full-precision baseline.

Six techniques

Each one published, each one in production.

  • Tree-structured speculative decoding

    A small draft model proposes multiple continuations in parallel, the target model verifies. We use tree topologies and a calibrated rejection sampler to reach 4-5x throughput on production workloads.

    4.7xtokens/sec
  • FP8 quantization with calibrated rounding

    Activation-aware quantization with per-channel scaling. Measured-zero quality regression on MT-Bench across the Llama 3.1 family at FP8.

    1.9xthroughput
  • INT4 weight quantization

    Regularized rounding scheme from our ICML 2024 paper. INT4 inference at FP8 quality on the workloads where prior INT4 schemes drop 4-6 MT-Bench points.

    2.4xthroughput
  • Continuous batching with priority queueing

    Workload-aware queueing keeps tail latency bounded under load. P99 stays under 1.4x the median up to 95% utilization, against an industry-typical 3-4x degradation.

    1.6xthroughput at SLO
  • Custom CUDA / HIP / SYCL kernels

    Fused attention, layer-norm, and rotary-embedding kernels for NVIDIA, AMD, and Intel hardware. Each backend matches the native vendor library within 3% on our internal regression suite.

    1.4xkernel time
  • Vendor-neutral runtime

    A single source-compatible scheduler and collective library that runs unmodified across NVIDIA, AMD, and Intel. Customer choice of hardware does not require a port.

    vendors supported: 3

Vendor neutrality

One runtime. Three vendors. No source changes.#

NVIDIA

B300, B200, H200, H100, A100. NCCL backend, FlashAttention-3 kernels, Hopper / Blackwell-specific tensor-core paths.

Fully supported

AMD

MI300X. RCCL backend, ROCm 6.x, CK-derived attention kernels. Within 3% of NVIDIA's H200 on Llama 3.1 inference at the same precision.

Fully supported

Intel

Gaudi 2 and Gaudi 3. oneCCL backend, SynapseAI runtime, custom kernel pack. The same training script that runs on H200 runs unmodified.

Fully supported

Inference acceleration

20x is a feature, not a benchmark.

Run the same model and measure for yourself. Free trial credits, no commitment.