Site Reliability Engineer (AI)
Madiff
Remote
15.07.2026.
Job Description
This is a remote position.
We are looking for a Senior Site Reliability Engineer to support advanced AI platforms responsible for production-grade applications and pipelines. The role focuses on building and maintaining reliability, scalability, and operational excellence across multiple AI-driven systems.
The engineer will work on a central operational layer for monitoring and managing AI workloads, improving system stability, and reducing incidents. This is a hands-on role requiring direct involvement in diagnosing production issues, implementing fixes, and optimising monitoring, alerting, and CI/CD processes.
The position requires close collaboration with engineering teams to improve release quality, standardise telemetry, and ensure stable and predictable system behaviour in a distributed cloud environment.
Responsibilities
-
Build and maintain central monitoring and alerting layer for AI applications and pipelines
-
Define and implement SLIs, alerts, and operational dashboards
-
Manage incidents including triage, coordination, root cause analysis, and prevention
-
Standardise telemetry across systems including latency, throughput, and failures
-
Optimise CI CD pipelines and introduce quality gates for reliability
-
Work closely with engineering teams to reduce recurring issues and improve stability
Requirements
-
Minimum 5+ years of experience in SRE, Platform, or Production Engineering
-
Strong hands on experience with Kubernetes and production environments
-
Experience with Azure and Azure DevOps
-
Experience with monitoring tools such as Datadog
-
Strong understanding of incident management and root cause analysis
-
Ability to build practical monitoring and alerting systems
Nice to have
-
Experience with AI or LLM pipelines
-
Experience building monitoring platforms across multiple systems
-
Experience with Grafana
-
Experience working in large scale or distributed environments
Expectations
-
Strong ownership mindset and accountability for system stability
-
Proactive approach to identifying risks and improvements
-
Hands on engineer actively working with systems, not only coordinating
-
Comfortable working in dynamic and evolving environments
Benefits
- Solid, competitive salary
- Work in a multinational environment on international projects
- Comprehensive healthcare
- Long-term B2B contract with a stable project pipeline
- Work model: fully remote
Preporuke se učitavaju...