Runtime

Inference Acceleration Engineer

Own end-to-end performance for one or more model families on our managed inference runtime. We are 10–20× faster than hyperscaler inference today; the bar is to widen that gap on every release.

Apply by email All open roles

The team

About the team

The runtime team is the engineering side of the research lab. Six engineers today. We write the kernels, the scheduler, and the serving layer for managed inference. The runtime ships into customer production weekly.

Reports to the head of runtime. One on-call rotation shared across runtime, cluster SRE, and customer engineering.

The role

What you'll do

Profile real customer workloads with Nsight, rocprof, and our internal tracer; turn the bottleneck into a paper-quality benchmark and a merged kernel.
Write fused, attention-aware kernels in CUDA and Triton (and ROCm/HIP for AMD MI300X) targeting H100, H200, B200, and B300.
Own one model family end-to-end: tokenizer through KV cache through speculative decoding, on every supported accelerator.
Design and ship the next round of long-context primitives — paged KV, ring attention, sliding-window cache eviction.
Co-author at least one external write-up per quarter: a blog post, a paper, or a kernel released to the open ecosystem.
Carry the runtime on-call rotation alongside cluster SRE and customer engineering — about one week per six.

The bar

What we're looking for

Five-plus years of systems-level performance work, with at least two of those years deep in GPU kernels.
Strong CUDA and at least one of Triton, CUTLASS, or HIP. Comfortable reading PTX and SASS when the profile demands it.
Track record of measurable speed-ups landed in production or in widely-used open source — bring a PR or a perf graph.
Comfort with transformer internals: attention variants, KV cache shapes, quantization (FP8, INT4 / NF4), speculative decoding.
Empirical instincts. You design experiments, hold variables fixed, and write down the numbers before the next change.

Bonus

Nice to have, not required

Published kernels in vLLM, SGLang, TensorRT-LLM, FlashAttention, or similar.
Experience tuning collectives (NCCL / RCCL) and topology-aware scheduling.
AMD MI300X / MI325X experience — most candidates do not have this; we will weight it heavily.
Prior research experience or co-authorship on systems / ML papers.

Compensation

In writing, like everything else

We publish bands. We meet them. The number you see on the offer is the same number your future peers got at the same level. We do not negotiate; we level.

Base

$220,000 – $360,000 USD (US), depending on level (IC4 / IC5 / IC6).

Equity

Meaningful early-stage equity, refreshed on tenure milestones, not review cycles.

Notes

Pay bands are published internally and shared in the manager call. We do not negotiate; we level.

How to apply

One email is enough

Send a short note to careers@iframe.ai with the role title in the subject line. Include your CV or LinkedIn, one or two links to work you're proud of, and a sentence on why this role specifically. Hiring managers reply within five business days, regardless of outcome.

01
Application
A hiring manager reads every email. Reply within five business days.
02
Manager call
30–45 minutes. Scope, role, mutual fit. We share the comp band on this call.
03
Technical loop
3–4 sessions on the same day. Real problems, no homework, no whiteboard riddles.
04
Offer
Same-week offer at the published band for your level. Start dates are flexible.

Also open

Other roles you might consider

All open roles

One last thing

If this role isn't quite right but you'd be a fit at iframe.ai, write anyway.

Senior engineers and researchers can apply outside the listed roles. The bar is the same. The reply window is the same.

Apply: Inference Acceleration Engineer General application

Inference Acceleration Engineer

About the team

What you'll do

What we're looking for

Nice to have, not required

In writing, like everything else

One email is enough

Application

Manager call

Technical loop

Offer

Other roles you might consider

Distributed Training Researcher

Cluster Site Reliability Engineer

Customer Cluster Engineer

If this role isn't quite right but you'd be a fit at iframe.ai, write anyway.