Skip to content

Inference

20× throughput vs vanilla serving

20× faster inference. Same hardware. Same models.

A managed inference stack — quantization, fine-tuning, optimized kernels, smart batching — productized from years of research. Run open-source models in production at a fraction of the cost per million tokens.

What you get

Three layers, automated.#

The 20× headline number is not from one trick. It's the compound effect of work at each layer of the stack.

Quantization.

Built-in INT8 / FP8 / 4-bit paths with calibration suites. Quality preservation tested across the supported model catalog.

Fine-tuning.

LoRA, QLoRA, and full fine-tuning on your data. Training runs at the same 3× more affordable math as raw GPU rentals.

Optimized serving.

Engine-level work on KV-cache management, scheduling, batching, and kernel-level dispatch. The 20× number lives here.

Performance

Numbers from the open benchmark suite.#

Methodology and raw data are downloadable. Re-run them on your own hardware.

Throughput
20×
vs vLLM / TGI baseline on Llama-class models
p99 latency
< 50 ms
On production chat workloads, in-region
Cost
Lowest
Industry-leading $/M tokens at the published list price

Supported models

A curated open-source catalog. Plus your own.#

We test, quantize, calibrate, and pin every model on this list to a known-good build. Bring your own and we'll do the same for it.

FamilySizesMax contextModality
Llama 3.x8B · 70B · 405B1MText
Qwen 2.57B · 14B · 72B1MText · vision
Mixtral8×7B · 8×22B256KText
DeepSeekV2 · V3 · R11MText · reasoning
MistralSmall · Large256KText
CustomBring your ownPer-modelPer-model

API

OpenAI-compatible. Drop-in replacement.#

If you have OpenAI SDK code today, change the base URL and the model. Everything else works.

quickstart.py
from iframe import Inference

client = Inference(api_key="...")
response = client.chat.completions.create(
    model="llama-3.1-70b",
    messages=[{"role": "user", "content": "Explain quantization in one paragraph."}],
)
print(response.choices[0].message.content)

SDKs: Python, Node, Go, Rust. OpenAI SDK works unmodified — point it at api.iframe.ai.

Endpoints: chat completions, completions, embeddings, fine-tunes, batch.

Streaming: SSE and websocket. First-token latency is the latency you optimize for.

Long context

Million tokens on standard hardware.#

iFrame's long-context engine handles million-token-class inputs on the same SKUs that serve everyday chat. Patent registered. Tested at 1B+ tokens.

FAQ

Inference questions.#

Run the API. Or run a fine-tune. Or both.

Sign up free, drop in your OpenAI SDK, and ship today. Talk to sales for dedicated capacity and custom models.