Runtime
Inference Acceleration Engineer
Own end-to-end performance for one or more model families on our managed inference runtime. We are 10–20× faster than hyperscaler inference today; the bar is to widen that gap on every release.
The team
About the team
The runtime team is the engineering side of the research lab. Six engineers today. We write the kernels, the scheduler, and the serving layer for managed inference. The runtime ships into customer production weekly.
Reports to the head of runtime. One on-call rotation shared across runtime, cluster SRE, and customer engineering.
The role
What you'll do
Profile real customer workloads with Nsight, rocprof, and our internal tracer; turn the bottleneck into a paper-quality benchmark and a merged kernel.
Write fused, attention-aware kernels in CUDA and Triton (and ROCm/HIP for AMD MI300X) targeting H100, H200, B200, and B300.
Own one model family end-to-end: tokenizer through KV cache through speculative decoding, on every supported accelerator.
Design and ship the next round of long-context primitives — paged KV, ring attention, sliding-window cache eviction.
Co-author at least one external write-up per quarter: a blog post, a paper, or a kernel released to the open ecosystem.
Carry the runtime on-call rotation alongside cluster SRE and customer engineering — about one week per six.
The bar
What we're looking for
Five-plus years of systems-level performance work, with at least two of those years deep in GPU kernels.
Strong CUDA and at least one of Triton, CUTLASS, or HIP. Comfortable reading PTX and SASS when the profile demands it.
Track record of measurable speed-ups landed in production or in widely-used open source — bring a PR or a perf graph.
Comfort with transformer internals: attention variants, KV cache shapes, quantization (FP8, INT4 / NF4), speculative decoding.
Empirical instincts. You design experiments, hold variables fixed, and write down the numbers before the next change.
Bonus
Nice to have, not required
Published kernels in vLLM, SGLang, TensorRT-LLM, FlashAttention, or similar.
Experience tuning collectives (NCCL / RCCL) and topology-aware scheduling.
AMD MI300X / MI325X experience — most candidates do not have this; we will weight it heavily.
Prior research experience or co-authorship on systems / ML papers.
Compensation
In writing, like everything else
We publish bands. We meet them. The number you see on the offer is the same number your future peers got at the same level. We do not negotiate; we level.
$220,000 – $360,000 USD (US), depending on level (IC4 / IC5 / IC6).
Meaningful early-stage equity, refreshed on tenure milestones, not review cycles.
Pay bands are published internally and shared in the manager call. We do not negotiate; we level.
How to apply
One email is enough
Send a short note to careers@iframe.ai with the role title in the subject line. Include your CV or LinkedIn, one or two links to work you're proud of, and a sentence on why this role specifically. Hiring managers reply within five business days, regardless of outcome.
- 01
Application
A hiring manager reads every email. Reply within five business days.
- 02
Manager call
30–45 minutes. Scope, role, mutual fit. We share the comp band on this call.
- 03
Technical loop
3–4 sessions on the same day. Real problems, no homework, no whiteboard riddles.
- 04
Offer
Same-week offer at the published band for your level. Start dates are flexible.
Also open
Other roles you might consider
- Research lab
Distributed Training Researcher
Lead a paper / quarter on multi-thousand-GPU pre-training. Co-appointment with a partner university available.
View role - Cluster & SRE
Cluster Site Reliability Engineer
Bring up B300 racks, drive InfiniBand fabric to spec, run capacity planning across seven regions. Pager included.
View role - Customer engineering
Customer Cluster Engineer
Embedded with a small portfolio of reserved-tier accounts. Distributed training perf, NCCL, kernel tuning.
View role
One last thing
If this role isn't quite right but you'd be a fit at iframe.ai, write anyway.
Senior engineers and researchers can apply outside the listed roles. The bar is the same. The reply window is the same.