Back to jobs
Regional hiringpublishedExternal employer
BFBlack Forest Labs
Black Forest LabsGenerative AI

Member of Technical Staff - Infrastructure Engineer

Location

Freiburg, Germany

Work type

Hybrid

Employment

Full Time

Experience

5-10 years

Compensation

$150K - $300K per year

Posted

13h ago

Summary and responsibilities

Role overview

Summary

This role involves building and maintaining the core infrastructure that powers Black Forest Labs' generative AI models. The engineer will scale and optimize compute clusters, design research platforms, and ensure the reliability and performance of large-scale distributed systems. Key responsibilities include collaborating with research teams, resolving performance bottlenecks, and evolving telemetry and monitoring systems.

We're looking for engineers to build and maintain the engine that powers our mission to develop visual intelligence. From maintaining and scaling clusters, to building research platforms to accelerate the rate of innovation, this team operates with large breadth and depth. We build the systems to make multi-week/month long training possible, to orchestrate resources at scale, and at the same time efficiently, enabling the next breakthrough model. If you’re obsessed with distributed systems at scale, infrastructure reliability, scalability, security, and continuous improvement, this team would be perfect for you.

What You’ll Work On

  • Maintain research infrastructure, ensuring health, and optimizing components to extract peak performance from the system (both on application, and infrastructure side)

  • Scale infrastructure to meet growing research demands while maintaining reliability and performance

  • Collaborate with research teams to deeply understand their infrastructure needs, and design solutions that balance performance with cost efficiency.

  • Identify and resolve performance bottlenecks and capacity hotspots through deep analysis of distributed systems at scale.

  • Build and evolve telemetry and monitoring systems to provide deep visibility into infrastructure performance, utilization, and costs across our cloud and datacenter fleets.

  • Participate in on-call rotations and incident response to maintain system reliability

Technical Focus

  • Python, Bash, Go

  • Kubernetes

  • Nvidia GPU drivers, and operators

  • OTel, Prometheus

What We’re Looking For

  • Experience building or operating large-scale training platforms

  • Worked with large scale compute clusters (GPUs)

  • Proven ability to debug performance and reliability issues across large distributed fleets

  • Strong problem-solving skills and ability to work independently

  • Strong communication skills and the ability to work effectively with both internal and external partners

  • Deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP

  • Experience with SLURM

How We Work Together

We’re a distributed team with real offices that people actually use. Depending on your role, you’ll either join us in Freiburg or SF at least 2 days a week (or one full week every other week), or work remotely with a monthly in-person week to stay connected. We’ll cover reasonable travel costs to make this possible. We think in-person time matters, and we’ve structured things to make it accessible to all. We’ll discuss what this will look like for the role during our interview process.

Everything we do is grounded in four values:

  • Obsessed. We are a frontier research lab. The science has to be right, the understanding deep, the product beautiful.

  • Low Ego. The work speaks. The best idea wins, no matter who said it. Credit is shared. Nobody is above any task.

  • Bold. We take the ambitious bet. We ship, we do not wait for conditions to be perfect.

  • Kind. People over politics. We treat each other with genuine warmth. Agency without empathy creates chaos.

If this sounds like work you’d enjoy, we’d love to hear from you.

Updated 13h ago

Candidate fit

Skills and qualifications

Additional skills

Python • 1+ yrs
Kubernetes • 1+ yrs
Nvidia GPU drivers • 1+ yrs
Prometheus • 1+ yrs
Distributed Systems • 1+ yrs
Infrastructure as Code • 1+ yrs
AWS • 1+ yrs
GCP • 1+ yrs

Experience

5-10 years

How this role is positioned

Role classification

Job domains

Software Engineering

Industries

Technology & IT

Employment

Full Time

Contract duration

Permanent

Hiring type

Direct

Global hiring

Location specific

Offer details

Compensation and benefits

Compensation

$150K - $300K per year

VisibilityShared on listing
CurrencyUSD
PeriodYearly

Location, schedule, and role shape

Work setup

Work conditions

Primary locationFreiburg, Germany
Work typeHybrid
Global hiringNo

Bandwidth profile

peopleMedium7/10
physicalLow2/10
cognitiveHigh9/10
executionHigh9/10
creativityMedium7/10
uncertaintyMedium7/10
communicationHigh8/10

Context on the employer

Company snapshot

Company

Black Forest Labs

Team size

Growing team

Location

Freiburg, Germany

We’re the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We’re creating the generative models that power how people make images and video—tools used by millions of creators, developers, and businesses worldwide. Our FLUX models are among the most advanced in the world, and we’re just getting started.

Visit website

Member of Technical Staff - Infrastructure Engineer

Freiburg, GermanyFull Time