Service Reliability Lead
GrafanaAWS CloudWatchPagerdutyAWS EKS
3 days ago
devops
S
SPD Technology
About the Position
At SPD Technology, we bring together a team of like-minded people who are driven by the desire to bring value through their work, united in their commitment to high performance and delivering custom, cutting-edge tech solutions that drive clients’ growth. We empower our people with a culture of excellence and enable them with the opportunity to uphold their accountability to contribute on each level.
Responsibilities19
- Own the L2/L3/L4 escalation path for all incidents, serving as the senior technical point of contact and coordinating with third-party vendors when needed
- Ensure incident acknowledgement and resolution in line with SLA targets
- Make real-time decisions on hotfixes, rollbacks, and configuration changes under pressure
- Build and maintain the on-call rotation ensuring zero coverage gaps
- Manage workarounds through to permanent resolution and maintain the escalation matrix for the client
- Deliver an operational monitoring dashboard
- Configure PagerDuty for automated alerting and on-call escalation aligned to SLA targets
- Maintain instrumentation across availability, latency, and error rate metrics per service tier
- Instrument and validate SLA clocks across response, workaround, and resolution targets
- Prepare monthly service credit calculations and service performance reports
- Provide metrics evidence during any client dispute review
- Deliver monthly reports covering incident volumes, SLA performance, RCA status, and risk log
- Author Root Cause Analysis documents within 5 days of incident resolution
- Identify recurring patterns and monitor for Service Improvement Plan triggers
- Design and implement SIPs with corrective actions, owners, and delivery timelines
- Proactively reduce incident frequency and improve mean time to resolution
- Operate in line with the AWS Shared Responsibility Model
- Distinguish SPD-caused from third-party failures and maintain evidence for availability exclusion claims
- Coordinate planned and urgent maintenance windows with the client
Requirements9
- 5–8 years in production operations / SRE
- Hands-on incident command experience
- AWS operational depth (CloudWatch, EKS, RDS, networking)
- Monitoring stack: Grafana, CloudWatch, PagerDuty
- PCI DSS awareness
- RCA authorship and structured problem-solving
- SLA management and service credit mechanics
- Experience with hypercare / go-live stabilisation periods
- Experience in fintech or payment systems
Benefits5
- Flexible working schedule with fully remote work
- Stable workload and income with provided laptops and licensed software
- Performance and merit reviews
- Personal development plans and individual learnings through the corporate library and support for public speaking
- Participate in company-wide tech and cultural events and CSR initiatives
Service Reliability Lead
View Original