Research · Acceleration
20× on the same hardware.
Our managed inference product runs 10-20x faster than vLLM defaults on the same GPU. The improvement comes from six independent techniques composed into a single optimization pipeline, with quality regressions bounded by automated benchmarks before any change ships.
Headline result
Composed gains across six techniques.
Six techniques
Each one published, each one in production.
Tree-structured speculative decoding
A small draft model proposes multiple continuations in parallel, the target model verifies. We use tree topologies and a calibrated rejection sampler to reach 4-5x throughput on production workloads.
4.7xtokens/secFP8 quantization with calibrated rounding
Activation-aware quantization with per-channel scaling. Measured-zero quality regression on MT-Bench across the Llama 3.1 family at FP8.
1.9xthroughputINT4 weight quantization
Regularized rounding scheme from our ICML 2024 paper. INT4 inference at FP8 quality on the workloads where prior INT4 schemes drop 4-6 MT-Bench points.
2.4xthroughputContinuous batching with priority queueing
Workload-aware queueing keeps tail latency bounded under load. P99 stays under 1.4x the median up to 95% utilization, against an industry-typical 3-4x degradation.
1.6xthroughput at SLOCustom CUDA / HIP / SYCL kernels
Fused attention, layer-norm, and rotary-embedding kernels for NVIDIA, AMD, and Intel hardware. Each backend matches the native vendor library within 3% on our internal regression suite.
1.4xkernel timeVendor-neutral runtime
A single source-compatible scheduler and collective library that runs unmodified across NVIDIA, AMD, and Intel. Customer choice of hardware does not require a port.
—vendors supported: 3
Vendor neutrality
One runtime. Three vendors. No source changes.#
NVIDIA
B300, B200, H200, H100, A100. NCCL backend, FlashAttention-3 kernels, Hopper / Blackwell-specific tensor-core paths.
AMD
MI300X. RCCL backend, ROCm 6.x, CK-derived attention kernels. Within 3% of NVIDIA's H200 on Llama 3.1 inference at the same precision.
Intel
Gaudi 2 and Gaudi 3. oneCCL backend, SynapseAI runtime, custom kernel pack. The same training script that runs on H200 runs unmodified.
Inference acceleration
20x is a feature, not a benchmark.
Run the same model and measure for yourself. Free trial credits, no commitment.