Skip to content

Inference pricing

Per-token pricing across the open-source catalog.

Input and output priced separately. Volume discounts kick in automatically. Methodology for the 20× throughput claim is on /research/inference-acceleration.

Pricing table

By model and tier.

ModelInput $/M tokensOutput $/M tokensMax contextRegions
Llama 3.1 8B$0.05$0.101MAll 7
Llama 3.1 70B$0.32$0.551MAll 7
Llama 3.1 405B$1.85$3.401M4
Qwen 2.5 7B$0.05$0.101MAll 7
Qwen 2.5 72B$0.32$0.551MAll 7
Mixtral 8×7B$0.18$0.3232KAll 7
Mixtral 8×22B$0.55$0.85256KAll 7
DeepSeek V3$0.45$0.851MAll 7
DeepSeek R1$0.55$2.10256K5
Mistral Large$0.65$1.40256KAll 7
Custom modelsQuoteQuotePer-modelPer-model

Refreshed weekly. Volume discounts apply automatically once spend crosses each tier's monthly threshold.

Volume discounts

Automatic. No negotiation.#

$1K / mo
5% off
$10K / mo
12% off
$100K / mo
20% off
$1M+ / mo
Custom

Self-hosted vs serverless

Two ways to run.#

Most teams start serverless. High-throughput, latency-sensitive, or compliance-bound workloads run dedicated.

Serverless

Per-token, shared capacity.

The pricing on this page. No infra to manage. Routes to nearest healthy region by default.

Dedicated

Your model. Your GPUs.

Reserved capacity for your own model on your own GPUs without sharing. Custom SLAs and region pinning.

Reserved capacity

API

OpenAI-compatible. One line to switch.#

quickstart.py
from iframe import Inference
client = Inference(api_key="...")

response = client.chat.completions.create(
    model="llama-3.1-70b",
    messages=[{"role": "user", "content": "Hello, world."}],
)

Point an existing OpenAI SDK at api.iframe.ai and pass an iFrame model identifier. The rest of your code is unchanged.

Streaming, tool use, and structured output all work; the API surface mirrors OpenAI's, with iFrame-specific extensions namespaced under iframe_*.

FAQ

Inference pricing questions.#

Drop in your OpenAI SDK. Save on every token.

Sign up free, swap the base URL, ship today.