Service Health

We had a full outage last week that was deeply rattling for our team. Before this, we hadn't had any major disruption to our services since the prior year. I think it caught most of us by surprise which is never what you want. It ate majorly into our SLA budget for the year and it was a reminder of two very important perennial truths for our business:

We are a critical service for our customers. Like a utility, they expect us to be up and running all the time. Owning pipes is a blessing and a curse.
There is always a bottle-neck lurking in the shadows that we haven't yet discovered. We need to find it before it finds us.

In response to last week, we've been combing through our services and double checking our monitors, alerts, bottle-necks, and failure modes. I'll use this space to collect my list of things a high quality, high availability service should have:

This list is only really useful for those deploying Python WSGI services on Kubernetes.

Automated Integration Tests: We make use of pytest, dependency injection and mocking to test the functionality of our code. We strive to not write "change detection tests".
Synthetic End-to-end Testing: We use DataDog Synthetics for this. It's a great way to test the availability of our services from multiple locations around the world.
Fast CI/CD with GitHub Actions, Docker, Kubernetes and ArgoCD
Basic Observability (we use DataDog for most of this)
- Logs
- Request Tracing
- Exception Tracking (we use Sentry for this)
- Memory, CPU, Disk, Network metrics
SLO(s): We're in the process of defining these for our services. We're using DataDog for this.
Health Monitors, Alerts, Escalation Policies and Runbooks: We use PagerDuty for this.
Load Testing: In the past I've used Locust for this, but it's only useful for local testing. I'm looking for a cloud-based solution that can simulate real-world traffic patterns. I've heard good things about k6 and Artillery.
Rate Limiting: We use Flask-Limiter and Redis for this.
Load Balancing: We use NGINX and Gunicorn for this.
Automatic Scaling Policies: We use HPA for this.