Cluster & SRE

Cluster Site Reliability Engineer

Own the physical reality of the platform in one of our seven regions. You bring up new GPU racks, validate InfiniBand fabric end-to-end, and keep the cluster running at the SLA. This is a hands-on role; you will see the hardware.

Apply by email All open roles

The team

About the team

Cluster SRE is six engineers across the seven regions. Each region has a primary and a secondary; you will be one of those for your region. The team coordinates daily, deploys weekly, and rotates a global pager.

Reports to the head of cluster engineering. Primary on a single region; rotates secondary for one neighboring region.

The role

What you'll do

Bring up new B200 / B300 / MI300X racks: cabling, ToR config, NCCL/RCCL all-reduce validation, MFU baseline tests.
Drive InfiniBand fabric to spec — NDR / XDR depending on the rack — and chase residual bit-error budget down to zero.
Run capacity planning across seven regions: forecast demand, model power and thermal headroom, work with procurement on lead times.
Own the regional incident response. P1 incidents page within fifteen minutes; resolution target is four hours.
Build and maintain the bring-up runbook so the second hire after you can do their first rack solo.
Carry the global pager about one week per six, alongside runtime and customer engineering.

The bar

What we're looking for

Five-plus years operating large compute clusters — supercomputing centers, hyperscaler infra, or HPC at a national lab count.
Deep InfiniBand and Ethernet RoCE experience: subnet manager tuning, fabric debugging, lossless networking.
Strong Linux fundamentals: kernel parameters, NUMA topology, kernel cgroup limits.
Comfort writing Go or Python for tooling. We are not strict about which.
Calm under load. You will be the named person on a $50M-ARR account when something goes wrong.

Bonus

Nice to have, not required

DGX H100 / H200 / B200 bring-up history.
Experience with NVIDIA Bright / Base Command Manager.
Bare-metal provisioning systems: Tinkerbell, MAAS, Razor, or in-house equivalents.
Procurement / vendor management experience.

Compensation

In writing, like everything else

We publish bands. We meet them. The number you see on the offer is the same number your future peers got at the same level. We do not negotiate; we level.

Base

$210,000 – $340,000 USD (US Cologix regions) / equivalent in CA.

Equity

Meaningful early-stage equity, refreshed on tenure milestones.

Notes

On-site pay differential at Cologix regions outside SF / NYC / Bay Area is +5–10% to compensate for travel.

How to apply

One email is enough

Send a short note to careers@iframe.ai with the role title in the subject line. Include your CV or LinkedIn, one or two links to work you're proud of, and a sentence on why this role specifically. Hiring managers reply within five business days, regardless of outcome.

01
Application
A hiring manager reads every email. Reply within five business days.
02
Manager call
30–45 minutes. Scope, role, mutual fit. We share the comp band on this call.
03
Technical loop
3–4 sessions on the same day. Real problems, no homework, no whiteboard riddles.
04
Offer
Same-week offer at the published band for your level. Start dates are flexible.

Also open

Other roles you might consider

All open roles

One last thing

If this role isn't quite right but you'd be a fit at iframe.ai, write anyway.

Senior engineers and researchers can apply outside the listed roles. The bar is the same. The reply window is the same.

Apply: Cluster Site Reliability Engineer General application

Cluster Site Reliability Engineer

About the team

What you'll do

What we're looking for

Nice to have, not required

In writing, like everything else

One email is enough

Application

Manager call

Technical loop

Offer

Other roles you might consider

Inference Acceleration Engineer

Distributed Training Researcher

Customer Cluster Engineer

If this role isn't quite right but you'd be a fit at iframe.ai, write anyway.