DNS failover
Auto-switching DNS to a backup origin when the primary fails health checks. Cheap multi-region failover, but DNS TTLs cap how fast it can be.
DNS failover is a feature where your DNS provider continuously health-checks your origins and, if the primary stops responding, swaps the DNS answer to a backup. Users start hitting the backup as soon as their resolver's cache expires.
The classic setup:
- Primary origin:
203.0.113.10(US East). - Backup origin:
203.0.113.20(US West, idle or hot-warm). - DNS provider health-checks
https://203.0.113.10/healthevery 30 seconds. - After 3 failed checks: the DNS provider stops returning
203.0.113.10. Starts returning203.0.113.20. - Resolvers eventually pick up the new answer (limited by TTL).
What this protects against
- Whole-region outages.
- Origin server crashes.
- Network partitions between resolvers and your primary region.
What it doesn't protect against
- DNS-level outages (if your DNS provider goes down, neither answer is reachable). Mitigate with secondary DNS.
- Resolver-side cache freshness. If a resolver cached the old answer with a 1-hour TTL, that resolver's users still hit the dead origin for up to 1 hour.
- Application-level failures (origin returns 500s but health check passes). Health check has to validate real behavior, not just "the port is open."
TTL trade-offs
Failover speed is gated by TTL:
| TTL | Failover speed | Trade-off |
|---|---|---|
| 60s | Fast | More DNS queries, more load on DNS, higher cost |
| 300s | Reasonable | Standard for failover-critical records |
| 3600s | Slow | Use only for stable records |
| 86400s | Too slow | Don't use for anything that might fail over |
Set failover-critical records (apex, www, API) to 60-300s. Stable records (MX, mail-auth, BIMI) can sit at 3600s.
Health check semantics
Don't just check that the origin is reachable. Check that the origin is doing its job:
GET /healthz HTTP/1.1
Host: api.example.com
Expected response:
HTTP 200
Body: {"status":"ok","db":"up","queue":"up"}
If the response says "db":"down", health check fails. Backup region takes over.
Practical implementation paths
- Route53 health checks + failover record sets. Built-in, ~$0.50/health-check/month.
- Cloudflare Load Balancing. Health checks, multi-pool weighted distribution, ~$5/month/zone plus per-check.
- NS1. Real-time, fast-failover, more expensive.
- DIY with cron + DNS API. Cheap, brittle. Don't do this for production.
In a SaaS
For a custom-domain SaaS, you control the edge layer. Failover within your edge is typically handled at the load balancer level (instant failover within a region) and via anycast at the network level (regional failure routes traffic to other PoPs). DNS-level failover is a backstop, not the primary mechanism. Useful for catastrophe (a whole edge cluster down) but not for routine recovery.