Regional hiringpublishedExternal employer
NNVIDIA
NVIDIASemiconductor

Senior HPC AI Cluster Engineer

Location

France

Work type

Remote

Employment

Full Time

Experience

8+ years

Compensation

Compensation not disclosed

Posted

2d ago

Summary and responsibilities

Role overview

Summary

Design, implement, and maintain large-scale HPC/AI clusters, including monitoring, logging, and alerting. Manage Linux job/workload schedules, develop CI/CD pipelines, and automate infrastructure deployment and management. Troubleshoot issues from bare metal to application level and document best practices.

NVIDIA is looking for an experienced HPC-AI Engineer to join the Networking Clusters Solutions Infrastructure team. We are focused on building supercomputers and AI clusters based on groundbreaking technologies. We are looking for an outstanding engineer, be a key player to the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and GPU computing. Provide insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialist to architect, develop and bring up large scale performance platforms.

What you will be doing:

  • Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting

  • Manage Linux job/workload schedules and orchestration tools

  • Develop and maintain continuous integration and delivery pipelines

  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources

  • Deploy monitoring solutions for the servers, network and storage

  • Perform troubleshooting bottom up from bare metal, operating system, software stack and application level

  • Being a technical resource, develop, re-define and document standard practices to share with internal teams

  • Support Research & Development activities and engage in POCs/POVs for future improvements

What we need to see:

  • A degree in Computer Science, Engineering, or a related field and 8+ years of experience

  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software

  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s

  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.

  • Experience with multiple storage solutions such as Lustre, GPFS, Weka.io. Familiarity with newer and emerging storage technologies.

  • Python programming and bash scripting experience.

  • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef

  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet

  • Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix)

  • Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)

Ways to stand out from the crowd:

  • Knowledge of CPU and/or GPU architecture

  • Knowledge of Kubernetes, container related microservice technologies

  • Experience with GPU-focused hardware/software (DGX, Cuda)

  • Experience with RDMA (InfiniBand or RoCE) fabrics

Updated 2d ago

Candidate fit

Skills and qualifications

Additional skills

HPC • 1+ yrs
AI • 1+ yrs
GPU Computing • 1+ yrs
Deep Learning • 1+ yrs
Linux • 1+ yrs
Networking • 1+ yrs
Troubleshooting • 1+ yrs
System Design • 1+ yrs
Monitoring • 1+ yrs
Logging • 1+ yrs
Alerting • 1+ yrs
Job Scheduling • 1+ yrs
Workload Orchestration • 1+ yrs
CI/CD • 1+ yrs
Automation • 1+ yrs
Configuration Management • 1+ yrs
Slurm • 1+ yrs
Kubernetes • 1+ yrs
Windows • 1+ yrs
Redhat/CentOS • 1+ yrs
Ubuntu • 1+ yrs
Sockets • 1+ yrs
Firewalld • 1+ yrs
Iptables • 1+ yrs
Wireshark • 1+ yrs
ACLs • 1+ yrs
OS Level Security • 1+ yrs
TCP • 1+ yrs
DHCP • 1+ yrs
DNS • 1+ yrs
Lustre • 1+ yrs
GPFS • 1+ yrs
Weka.io • 1+ yrs
Python • 1+ yrs
Bash Scripting • 1+ yrs
Jenkins • 1+ yrs
Ansible • 1+ yrs
Puppet • 1+ yrs
Chef • 1+ yrs
InfiniBand • 1+ yrs
Ethernet • 1+ yrs
Virtual Systems • 1+ yrs
VMware • 1+ yrs
Hyper-V • 1+ yrs
KVM • 1+ yrs
Citrix • 1+ yrs
Cloud Computing • 1+ yrs
AWS • 1+ yrs
Azure • 1+ yrs
Google Cloud • 1+ yrs
CPU Architecture • 1+ yrs
GPU Architecture • 1+ yrs
Container Technologies • 1+ yrs
Microservices • 1+ yrs
DGX • 1+ yrs
Cuda • 1+ yrs
RDMA • 1+ yrs
RoCE • 1+ yrs

Experience

8+ years

How this role is positioned

Role classification

Job domains

Software Engineering

Industries

Technology & IT

Employment

Full Time

Contract duration

Permanent

Hiring type

Direct

Global hiring

Location specific

Offer details

Compensation and benefits

Compensation

Compensation not disclosed

VisibilityShared on listing
CurrencyUSD
PeriodYearly

Location, schedule, and role shape

Work setup

Work conditions

Primary locationFrance
Work typeRemote
Global hiringNo

Bandwidth profile

peopleMedium7/10
physicalLow2/10
cognitiveHigh9/10
executionHigh9/10
creativityMedium7/10
uncertaintyMedium7/10
communicationHigh8/10

Context on the employer

Company snapshot

Company

NVIDIA

Team size

Growing team

Location

France

NVIDIA is a technology company focused on building supercomputers and AI clusters based on groundbreaking technologies, contributing to the latest breakthroughs in artificial intelligence and GPU computing.

Visit website

Senior HPC AI Cluster Engineer

FranceFull Time