Inference
20× throughput vs vanilla serving20× faster inference. Same hardware. Same models.
A managed inference stack — quantization, fine-tuning, optimized kernels, smart batching — productized from years of research. Run open-source models in production at a fraction of the cost per million tokens.
What you get
Three layers, automated.#
The 20× headline number is not from one trick. It's the compound effect of work at each layer of the stack.
Quantization.
Built-in INT8 / FP8 / 4-bit paths with calibration suites. Quality preservation tested across the supported model catalog.
Fine-tuning.
LoRA, QLoRA, and full fine-tuning on your data. Training runs at the same 3× more affordable math as raw GPU rentals.
Optimized serving.
Engine-level work on KV-cache management, scheduling, batching, and kernel-level dispatch. The 20× number lives here.
Performance
Numbers from the open benchmark suite.#
Methodology and raw data are downloadable. Re-run them on your own hardware.
Supported models
A curated open-source catalog. Plus your own.#
We test, quantize, calibrate, and pin every model on this list to a known-good build. Bring your own and we'll do the same for it.
| Family | Sizes | Max context | Modality |
|---|---|---|---|
| Llama 3.x | 8B · 70B · 405B | 1M | Text |
| Qwen 2.5 | 7B · 14B · 72B | 1M | Text · vision |
| Mixtral | 8×7B · 8×22B | 256K | Text |
| DeepSeek | V2 · V3 · R1 | 1M | Text · reasoning |
| Mistral | Small · Large | 256K | Text |
| Custom | Bring your own | Per-model | Per-model |
API
OpenAI-compatible. Drop-in replacement.#
If you have OpenAI SDK code today, change the base URL and the model. Everything else works.
from iframe import Inference
client = Inference(api_key="...")
response = client.chat.completions.create(
model="llama-3.1-70b",
messages=[{"role": "user", "content": "Explain quantization in one paragraph."}],
)
print(response.choices[0].message.content)SDKs: Python, Node, Go, Rust. OpenAI SDK works unmodified — point it at api.iframe.ai.
Endpoints: chat completions, completions, embeddings, fine-tunes, batch.
Streaming: SSE and websocket. First-token latency is the latency you optimize for.
Long context
Million tokens on standard hardware.#
iFrame's long-context engine handles million-token-class inputs on the same SKUs that serve everyday chat. Patent registered. Tested at 1B+ tokens.
FAQ
Inference questions.#
Run the API. Or run a fine-tune. Or both.
Sign up free, drop in your OpenAI SDK, and ship today. Talk to sales for dedicated capacity and custom models.