← Back to all positions
Inference Engineer
San Francisco, CA·In-person·Full-time
We process vision models over every frame of video, in realtime. That only works if the inference stack is unreasonably fast. We need someone who finds that problem fun.
What we need from you
You're the kind of engineer who reads a model architecture and immediately sees where it'll stall. You think about memory bandwidth the way most people think about lunch. You've written CUDA kernels that actually shipped, and you know the difference between a benchmark win and a production win.
Our customers deploy on their own hardware, which means we can't just optimize for one target. NVIDIA GPUs are the primary focus, but Apple Silicon, Trainium, and TPUs all matter. Deep knowledge of one or two is plenty to start.
You should probably apply if:
- You've made production inference workloads meaningfully faster, not just turned a benchmark green
- You know why your CUDA kernel is slow before you open Nsight
- You've used TensorRT, Core ML, XLA, or Neuron SDK in anger, and you can pick up whichever one you haven't
- You have opinions about when to fuse kernels and when not to
- You've shipped things you had to support on hardware you didn't choose
You should definitely not apply if:
- Your optimization strategy is “quantize it and ship it”
- You want to own a single platform and never touch anything else
- Remote work is non-negotiable (we're in-person in San Francisco because building hard things is a team sport)
What you'll actually do
- Own Moondream's inference performance end-to-end, from profiling to custom kernels to production deployment
- Build optimized backends across GPU, Apple Silicon, Trainium, and TPU targets
- Work with the research team so new architectures are fast by design, not by afterthought
- Probably get into a heated argument about whether a 3% throughput regression is worth a cleaner abstraction (it isn't)
Details
- Location: San Francisco, CA (in-person)
- Stack: CUDA, C++, Python, and platform-specific toolchains (TensorRT, Core ML, XLA, Neuron SDK) as needed
- Compensation: $230k–$300k + meaningful equity
- Benefits: Health, dental, vision, paid parental leave, relocation support
Apply
Send your resume to hiring@moondream.ai