L2 Datacenter Support Engineer

Pełny etat

Mirantis

Mirantis helps organizations ship code faster on public and private clouds. The company provides a public cloud experience on any infrastructure from the data center to the edge. With Lens and the Mirantis Cloud Native Platform, Mirantis empowers a new breed of Kubernetes developers by removing infrastructure and operations complexity and providing one cohesive cloud experience for complete app and devops portability, a single pane of glass, and automated full-stack lifecycle management with continuous updates.

Mirantis serves many of the world’s leading enterprises, including Adobe, DocuSign, Liberty Mutual, PayPal, Reliance Jio, Societe Generale, Splunk, and Volkswagen. Learn more at .

Job Description

We are looking for an experienced L2 Engineer to operate and support high-performance AI infrastructure platforms, including NVIDIA GPU clusters, InfiniBand fabrics, and Kubernetes-based IaaS environments.

This role focuses on deep infrastructure expertise, ensuring performance, scalability, and reliability of the platform layer that powers AI workloads — without being responsible for the workloads themselves.

You will play a key role in bare metal lifecycle management, advanced InfiniBand troubleshooting, and platform stability, working closely with engineering teams to operate cutting-edge infrastructure at scale.

Key responsibilities:

Troubleshoot and maintain InfiniBand fabrics, including performance tuning, link issues, and topology validation.
Act as the escalation point for L1 for complex infrastructure and hardware issues.
Own and maintain accurate infrastructure modeling, IPAM, and source-of-truth data in NetBox.
Own InfiniBand fabric management and advanced troubleshooting, utilizing Verity for configuration, monitoring, and optimization of high-performance interconnects.
Diagnose and resolve issues across GPU servers, networking, storage, and Kubernetes platforms.
Perform deep hardware and system-level diagnostics (GPUs, PCIe, NICs, firmware, etc.).
Support Kubernetes platform stability (node health, networking, scheduling issues).
Contribute to automation of provisioning and operational workflows.
Lead incident response, root cause analysis (RCA), and post-incident improvements.
Collaborate with vendors and internal engineering teams on complex issues.
Support infrastructure upgrades, firmware management, and capacity expansion.

Qualifications

Required Skills & Experience:

3–6+ years of experience in infrastructure operations, datacenter engineering, or cloud platforms.
Strong Linux systems expertise.
Hands-on experience with bare metal provisioning systems and lifecycle management.
Strong experience with InfiniBand networking (troubleshooting, performance, fabric management using UFM).
Experience with IPAM/DCIM tools such as NetBox and Ethernet network configuration and validation leveraging Verity.
Solid understanding of datacenter networking, storage, and hardware architecture.
Working knowledge of Kubernetes in production environments.
Strong troubleshooting skills across hardware and distributed systems.

Preferred

qualifications

Experience with NVIDIA GPU platforms and accelerated computing infrastructure.
Familiarity with automation tools (Terraform, Ansible, etc.).
Exposure to OpenStack (optional).
Experience with observability stacks (Prometheus, Grafana, ELK).

Success in this role:

Rapid resolution of complex infrastructure and networking issues.
High reliability and performance of InfiniBand and GPU infrastructure.
Scalable and efficient bare metal provisioning processes.
Strong contribution to automation and operational excellence.
Trusted escalation point and technical leader within the team.

Additional Information

We offer:

Work with an established Silicon Valley leader in the cloud infrastructure industry;
Work with exceptionally passionate, talented and engaging colleagues, helping Fortune 500 and Global 2000 customers implement next-generation cloud technologies;
Be a part of cutting-edge, open-source innovation;
Thrive in the high-energy environment of a young company where openness, collaboration, risk-taking, and continuous growth are valued;
Professional development and training;
Attend conferences and working groups;
Company outings, happy hours, hackathons, and tech talks;
Receive a competitive compensation package with a strong benefits plan.

We are a Leader for Container Management in G2 (#2 after AWS)!

Przejdź na jobs.smartrecruiters.com

Oferta pracy dodana 3 dni temu