Skip to main content

Service Reliability Lead

GrafanaAWS CloudWatchPagerdutyAWS EKS
3 дні тому
devops
S

SPD Technology

Формат роботиremote

Про позицію

At SPD Technology, we bring together a team of like-minded people who are driven by the desire to bring value through their work, united in their commitment to high performance and delivering custom, cutting-edge tech solutions that drive clients’ growth. We empower our people with a culture of excellence and enable them with the opportunity to uphold their accountability to contribute on each level.

Обовʼязки19

  • Own the L2/L3/L4 escalation path for all incidents, serving as the senior technical point of contact and coordinating with third-party vendors when needed
  • Ensure incident acknowledgement and resolution in line with SLA targets
  • Make real-time decisions on hotfixes, rollbacks, and configuration changes under pressure
  • Build and maintain the on-call rotation ensuring zero coverage gaps
  • Manage workarounds through to permanent resolution and maintain the escalation matrix for the client
  • Deliver an operational monitoring dashboard
  • Configure PagerDuty for automated alerting and on-call escalation aligned to SLA targets
  • Maintain instrumentation across availability, latency, and error rate metrics per service tier
  • Instrument and validate SLA clocks across response, workaround, and resolution targets
  • Prepare monthly service credit calculations and service performance reports
  • Provide metrics evidence during any client dispute review
  • Deliver monthly reports covering incident volumes, SLA performance, RCA status, and risk log
  • Author Root Cause Analysis documents within 5 days of incident resolution
  • Identify recurring patterns and monitor for Service Improvement Plan triggers
  • Design and implement SIPs with corrective actions, owners, and delivery timelines
  • Proactively reduce incident frequency and improve mean time to resolution
  • Operate in line with the AWS Shared Responsibility Model
  • Distinguish SPD-caused from third-party failures and maintain evidence for availability exclusion claims
  • Coordinate planned and urgent maintenance windows with the client

Вимоги9

  • 5–8 years in production operations / SRE
  • Hands-on incident command experience
  • AWS operational depth (CloudWatch, EKS, RDS, networking)
  • Monitoring stack: Grafana, CloudWatch, PagerDuty
  • PCI DSS awareness
  • RCA authorship and structured problem-solving
  • SLA management and service credit mechanics
  • Experience with hypercare / go-live stabilisation periods
  • Experience in fintech or payment systems

Переваги5

  • Flexible working schedule with fully remote work
  • Stable workload and income with provided laptops and licensed software
  • Performance and merit reviews
  • Personal development plans and individual learnings through the corporate library and support for public speaking
  • Participate in company-wide tech and cultural events and CSR initiatives
Service Reliability Lead
Оригінал