Inference pricing

Per-token pricing across the open-source catalog.

Input and output priced separately. Volume discounts kick in automatically. Methodology for the 20× throughput claim is on /research/inference-acceleration.

Pricing table

By model and tier.

Model	Input $/M tokens	Output $/M tokens	Max context	Regions
Llama 3.1 8B	$0.05	$0.10	1M	All 7
Llama 3.1 70B	$0.32	$0.55	1M	All 7
Llama 3.1 405B	$1.85	$3.40	1M	4
Qwen 2.5 7B	$0.05	$0.10	1M	All 7
Qwen 2.5 72B	$0.32	$0.55	1M	All 7
Mixtral 8×7B	$0.18	$0.32	32K	All 7
Mixtral 8×22B	$0.55	$0.85	256K	All 7
DeepSeek V3	$0.45	$0.85	1M	All 7
DeepSeek R1	$0.55	$2.10	256K	5
Mistral Large	$0.65	$1.40	256K	All 7
Custom models	Quote	Quote	Per-model	Per-model

Refreshed weekly. Volume discounts apply automatically once spend crosses each tier's monthly threshold.

Volume discounts

Automatic. No negotiation.#

$1K / mo

5% off

$10K / mo

12% off

$100K / mo

20% off

$1M+ / mo

Custom

Self-hosted vs serverless

Two ways to run.#

Most teams start serverless. High-throughput, latency-sensitive, or compliance-bound workloads run dedicated.

Serverless

Per-token, shared capacity.

The pricing on this page. No infra to manage. Routes to nearest healthy region by default.

Dedicated

Your model. Your GPUs.

Reserved capacity for your own model on your own GPUs without sharing. Custom SLAs and region pinning.

Reserved capacity

API

OpenAI-compatible. One line to switch.#

quickstart.py

from iframe import Inference
client = Inference(api_key="...")

response = client.chat.completions.create(
    model="llama-3.1-70b",
    messages=[{"role": "user", "content": "Hello, world."}],
)

Point an existing OpenAI SDK at api.iframe.ai and pass an iFrame model identifier. The rest of your code is unchanged.

Streaming, tool use, and structured output all work; the API surface mirrors OpenAI's, with iFrame-specific extensions namespaced under iframe_*.

FAQ

Per-token pricing across the open-source catalog.

By model and tier.

Automatic. No negotiation.#

Two ways to run.#

Per-token, shared capacity.

Your model. Your GPUs.

OpenAI-compatible. One line to switch.#

Inference pricing questions.#

Drop in your OpenAI SDK. Save on every token.

Per-token pricing across the open-source catalog.

By model and tier.

Automatic. No negotiation.#

Two ways to run.#

Per-token, shared capacity.

Your model. Your GPUs.

OpenAI-compatible. One line to switch.#

Inference pricing questions.#

01How are tokens counted?

02What about cached inputs?

03Are there free trial credits?

04Do you have batch pricing?

05What's billed when a request fails?

Drop in your OpenAI SDK. Save on every token.