Site Reliability Engineer (SRE) @MarshTech

MMC

Cluj-Napoca, Rumania

Ingeniería, Inglés

44 Visitas			0 Candidatos

Regístrate

Descripción del puesto:

Company:
Mercer

Description:

We are seeking a talented Site Reliability Engineer (SRE) to join our Innovation team. This role will be based in Cluj, Romania. This is a hybrid role that has a requirement of working at least three days a week in the office.

As a Site Reliability Engineer, you will play a critical role in driving our observability and reliability strategy. You will inherit and transform a fragmented observability landscape into a scalable, enterprise-grade platform. This role focuses on building deep system visibility, enabling distributed tracing across services, and improving system reliability through modern engineering practices. You will work at the intersection of infrastructure, application performance, and AI/LLM-backed systems.

We will count on you to:
* Lead the transformation of observability capabilities across the platform, moving from fragmented tooling to a cohesive, enterprise-grade ecosystem.
* Design and implement end-to-end observability solutions using OpenTelemetry (instrumentation, collectors, exporters).
* Build and enhance distributed tracing capabilities, enabling visibility across upstream and downstream services.
* Define and operationalize SLIs/SLOs to improve system reliability and performance.
* Implement and manage APM solutions (preferably Datadog) for real-time monitoring and alerting.
* Analyze system performance, identify bottlenecks, and proactively recommend improvements.
* Collaborate with engineering teams to instrument services and improve debuggability.
* Read, debug, and propose improvements to application code (Python / TypeScript).
* Develop automation scripts using Bash/shell to improve operational efficiency.
* Support and optimize Kubernetes-based infrastructure and containerized workloads.
* Contribute to and enhance CI/CD pipelines (GitHub Actions) for reliability and scalability.
* Ensure strong implementation of cloud-native architectures (AWS preferred).
* Apply networking fundamentals (DNS, load balancing, TLS) to troubleshoot and optimize systems.
* Operate and optimize LLM-backed services, focusing on latency, cost, and performance.
* Monitor AI-specific metrics such as token throughput, inference latency, and quality drift.
* Support systems leveraging vector databases and embedding pipelines.
* Own reliability and performance outcomes for critical systems and services.
* Lead incident analysis and postmortems, focusing on learning and system improvement rather than blame.
* Partner with product, platform, and AI teams to ensure systems are observable, scalable, resilient and cost eficient
* Identify gaps in observability and reliability practices and propose pragmatic, high-impact solutions.
* Build tooling and dashboards that provide actionable insights into system health and performance.
* Continuously improve operational processes, reducing toil and increasing automation.
* Foster a culture of ownership, accountability, and continuous improvement across engineering teams.

What you need to have:
* Proven experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.
* Strong hands-on experience with observability tools and frameworks, especially OpenTelemetry.
* Experience with APM platforms (Datadog preferred).
* Solid understanding of distributed systems and tracing methodologies.
* Proficiency in Python and/or TypeScript for debugging and code improvements.
* Strong scripting skills using Bash or shell scripting.
* Experience with Kubernetes and container orchestration.
* Familiarity with CI/CD pipelines, particularly GitHub Actions.
* Deep understanding of cloud platforms (AWS preferred).
* Strong knowledge of networking fundamentals (DNS, load balancing, TLS).
* Experience operating or supporting LLM/AI-driven systems.

What makes you stand out?
* Experience working with Go or familiarity with Go-based ecosystems.
* Hands-on experience with AI/LLM observability and performance optimization.
* Knowledge of vector databases and embedding pipelines.
* Strong problem-solving mindset with the ability to navigate complex, ambiguous systems.
* Bias toward action-comfortable working in imperfect environments and driving improvements.
* Collaborative mindset with a focus on solutions rather than blame.
* Ability to clearly communicate technical insights and influence engineering decisions.

Why join our team:
* We help you be your best through professional development opportunities, interesting work, and supportive leaders;
* We foster a vibrant and inclusive culture where you can work with talented colleagues to create new solutions and have an impact for colleagues, clients, and communities;
* Our scale enables us to provide a range of career opportunities, as well as benefits and rewards to enhance your well-being;
* A yearly budget and the opportunity to build your flexible benefits package (up to 20% of your annual salary);
* 30+ days off (25 legal days off, 1 extra day off on your birthday, public holiday replacement days, extra buy/sell from your benefits budget);
* Performance Bonus scheme;
* Matching charity contributio
* ns, charity days off, and the Pay it Forward charity challenge;
* Core benefits - Pension, Life and Medical Insurance, Meal Vouchers, Travel Insurance;

Marsh (NYSE: MRSH) is a global leader in risk, reinsurance and capital, people and investments, and management consulting, advising clients in 130 countries. With annual revenue of over $27 billion and more than 95,000 colleagues, Marsh helps build the confidence to thrive through the power of perspective. For more information, visit corporate.marsh.com, or follow us on LinkedIn and X.

Marsh is committed to creating a diverse, inclusive and flexible work environment. We aim to attract and retain the best people and embrace diversity of age, background, disability, ethnic origin, family duties, gender orientation or expression, marital status, nationality, parental status, personal or social status, political affiliation, race, religion and beliefs, sex/gender, sexual orientation or expression, skin color, or any other characteristic protected by applicable law.

Marsh is committed to hybrid work, which includes the flexibility of working remotely and the collaboration, connections and professional development benefits of working together in the office. All Marsh colleagues are expected to be in their local office or working onsite with clients at least three days per week. Office-based teams will identify at least one "anchor day" per week on which their full team will be together in person