Inference

20× throughput vs vanilla serving

20× faster inference. Same hardware. Same models.

A managed inference stack — quantization, fine-tuning, optimized kernels, smart batching — productized from years of research. Run open-source models in production at a fraction of the cost per million tokens.

What you get

Three layers, automated.#

The 20× headline number is not from one trick. It's the compound effect of work at each layer of the stack.

Quantization.

Built-in INT8 / FP8 / 4-bit paths with calibration suites. Quality preservation tested across the supported model catalog.

Fine-tuning.

LoRA, QLoRA, and full fine-tuning on your data. Training runs at the same 3× more affordable math as raw GPU rentals.

Optimized serving.

Engine-level work on KV-cache management, scheduling, batching, and kernel-level dispatch. The 20× number lives here.

Performance

Numbers from the open benchmark suite.#

Methodology and raw data are downloadable. Re-run them on your own hardware.

Methodology

Throughput

20×

vs vLLM / TGI baseline on Llama-class models

p99 latency

< 50 ms

On production chat workloads, in-region

Cost

Lowest

Industry-leading $/M tokens at the published list price

Supported models

A curated open-source catalog. Plus your own.#

We test, quantize, calibrate, and pin every model on this list to a known-good build. Bring your own and we'll do the same for it.

Family	Sizes	Max context	Modality
Llama 3.x	8B · 70B · 405B	1M	Text
Qwen 2.5	7B · 14B · 72B	1M	Text · vision
Mixtral	8×7B · 8×22B	256K	Text
DeepSeek	V2 · V3 · R1	1M	Text · reasoning
Mistral	Small · Large	256K	Text
Custom	Bring your own	Per-model	Per-model

API

OpenAI-compatible. Drop-in replacement.#

If you have OpenAI SDK code today, change the base URL and the model. Everything else works.

quickstart.py

from iframe import Inference

client = Inference(api_key="...")
response = client.chat.completions.create(
    model="llama-3.1-70b",
    messages=[{"role": "user", "content": "Explain quantization in one paragraph."}],
)
print(response.choices[0].message.content)

SDKs: Python, Node, Go, Rust. OpenAI SDK works unmodified — point it at api.iframe.ai.

Endpoints: chat completions, completions, embeddings, fine-tunes, batch.

Streaming: SSE and websocket. First-token latency is the latency you optimize for.

Long context

Million tokens on standard hardware.#

iFrame's long-context engine handles million-token-class inputs on the same SKUs that serve everyday chat. Patent registered. Tested at 1B+ tokens.

The long-context engine

FAQ

20× faster inference. Same hardware. Same models.

Three layers, automated.#

Quantization.

Fine-tuning.

Optimized serving.

Numbers from the open benchmark suite.#

A curated open-source catalog. Plus your own.#

OpenAI-compatible. Drop-in replacement.#

Million tokens on standard hardware.#

Inference questions.#

Run the API. Or run a fine-tune. Or both.

20× faster inference. Same hardware. Same models.

Three layers, automated.#

Quantization.

Fine-tuning.

Optimized serving.

Numbers from the open benchmark suite.#

A curated open-source catalog. Plus your own.#

OpenAI-compatible. Drop-in replacement.#

Million tokens on standard hardware.#

Inference questions.#

01Is the API OpenAI-compatible?

02Can I bring my own model?

03Where is inference served from?

04What's the latency floor?

05How does pricing compare?

Run the API. Or run a fine-tune. Or both.