Industrial AI Cloud - Infrastructure Engineer (REF5500F)

Deutsche Telekom

Budapest, Hungría

Ingeniería, Inglés

38 Visitas			0 Candidatos

Regístrate

Descripción del puesto:

As Hungary's most attractive employer in 2025 (according to Randstad's representative survey), Deutsche Telekom IT Solutions is a subsidiary of the Deutsche Telekom Group. The company provides a wide portfolio of IT and telecommunications services with more than 5300 employees. We have hundreds of large customers, corporations in Germany and in other European countries.
DT-ITS recieved the Best in Educational Cooperation award from HIPA in 2019, acknowledged as the the Most Ethical Multinational Company in 2019. The company continuously develops its four sites in Budapest, Debrecen, Pécs and Szeged and is looking for skilled IT professionals to join its team.

General description/ Purpose
NVIDIA and Deutsche Telekom are jointly developing the world's first industrial AI cloud for European manufacturers. This AI factory in Germany will host 10,000 GPUs across NVIDIA DGX B200 systems and RTX Pro Servers. Deutsche Telekom provides secure, sovereign and fast infrastructure, including data centers, operations, security, and AI solutions.
Role Overview
We are seeking an Infrastructure Engineer to build, automate, and operate compute, network, and storage environment of the Industrial AI Cloud. In this role you will provision and maintain servers, manage networking (on server OS level) and storage, automate deployments, implement monitoring, and ensure reliable day-to-day operations of large-scale GPU clusters. You'll be working and coordinating between multiple teams to deliver and continuously improve infrastructure services following ITIL processes.
Key Responsibilities
* Coordinate Operations with Data Center Teams: Coordinate and support hardware lifecycle activities (installs, GPU upgrades, storage expansion, firmware updates) and manage server/network interconnections and related documentation (NetBox).
* Server & Node Management: Provision and maintain bare-metal servers and GPU nodes (PXE boot, OS installs, firmware updates).
* Design & Operate NVIDIA AI related infrastructure stack
* Automation & IaC: Develop and maintain Ansible and Terraform playbooks to automate provisioning, configuration, and deployments.
* OS & Firmware Management: Maintain Debian-based environments, apply patches, and manage firmware upgrades at scale.
* Identity & Access Management: Integrate and maintain Keycloak, Entra ID / CAIMAN, and AD for user authentication and authorization.
* Run AI & HPC Workloads: Support and operate distributed AI workloads within bare metal hosts and Kubernetes environments.
* Monitoring & Observability: Operate Prometheus and Grafana stacks for proactive infrastructure monitoring and alerting.
* Storage Administration: Manage high-performance storage environments (WEKA by Hitachi).
* ITIL Processes: Follow and improve incident, problem, and change management workflows; document runbooks and standard operating procedures. Adhere to ZERO Outage guidelines.
* Consult and provide project deliverables to fulfil the project scope with focus on Nvidia technology stack.
What We Offer
* Work on Europe's first industrial AI cloud with cutting-edge technologies.
* Direct collaboration with NVIDIA and Deutsche Telekom experts.
* Hybrid working model, training opportunities, and career progression

Requerimientos del candidato/a:

Required Skills and Qualifications
* Experience in hardware installation, maintenance, and operations.
* Advanced proficiency with Linux (Debian preferred) in production environments.
* Hands-on experience with Infrastructure-as-Code (Ansible, Terraform); Redfish desirable.
* NVIDIA GPU-Accelerated server platform knowledge
* Knowledge of Nvidia AI software stack related to GPU orchestration
* GPU based Cloud platform software stack knowledge incl. its dependencies on below layers
* Solid understanding of networking fundamentals (IP, routing, VLANs, DNS, firewalls and L1, L2).
* Experience with identity and access management systems (Keycloak, Entra ID, LDAP).
* Familiarity with monitoring stacks (Prometheus, Grafana).
* Knowledge of high-performance storage systems (WEKA by Hitachi advantageous) advantage.
* Working knowledge of ITIL processes (incident, problem, change).
* Strong troubleshooting and operational support skills in a 24/7/mission-critical environment.
Preferred Attributes
* Experience with large GPU clusters, HPC, or data center environments.
* Knowledge of sovereign cloud and data security/compliance requirements.
* Familiarity with Terraform, GitOps for infrastructure changes