Auto-switching DNS to a backup origin when the primary fails health checks. Cheap multi-region failover, but DNS TTLs cap how fast it can be.

DNS failover is a feature where your DNS provider continuously health-checks your origins and, if the primary stops responding, swaps the DNS answer to a backup. Users start hitting the backup as soon as their resolver's cache expires.

The classic setup:

Primary origin: 203.0.113.10 (US East).
Backup origin: 203.0.113.20 (US West, idle or hot-warm).
DNS provider health-checks https://203.0.113.10/health every 30 seconds.
After 3 failed checks: the DNS provider stops returning 203.0.113.10. Starts returning 203.0.113.20.
Resolvers eventually pick up the new answer (limited by TTL).

What this protects against

Whole-region outages.
Origin server crashes.
Network partitions between resolvers and your primary region.

What it doesn't protect against

DNS-level outages (if your DNS provider goes down, neither answer is reachable). Mitigate with secondary DNS.
Resolver-side cache freshness. If a resolver cached the old answer with a 1-hour TTL, that resolver's users still hit the dead origin for up to 1 hour.
Application-level failures (origin returns 500s but health check passes). Health check has to validate real behavior, not just "the port is open."

TTL trade-offs

Failover speed is gated by TTL:

TTL	Failover speed	Trade-off
60s	Fast	More DNS queries, more load on DNS, higher cost
300s	Reasonable	Standard for failover-critical records
3600s	Slow	Use only for stable records
86400s	Too slow	Don't use for anything that might fail over

Set failover-critical records (apex, www, API) to 60-300s. Stable records (MX, mail-auth, BIMI) can sit at 3600s.

Health check semantics

Don't just check that the origin is reachable. Check that the origin is doing its job:

GET /healthz HTTP/1.1
Host: api.example.com

Expected response:
HTTP 200
Body: {"status":"ok","db":"up","queue":"up"}

If the response says "db":"down", health check fails. Backup region takes over.

Practical implementation paths

Route53 health checks + failover record sets. Built-in, ~$0.50/health-check/month.
Cloudflare Load Balancing. Health checks, multi-pool weighted distribution, ~$5/month/zone plus per-check.
NS1. Real-time, fast-failover, more expensive.
DIY with cron + DNS API. Cheap, brittle. Don't do this for production.

In a SaaS

For a custom-domain SaaS, you control the edge layer. Failover within your edge is typically handled at the load balancer level (instant failover within a region) and via anycast at the network level (regional failure routes traffic to other PoPs). DNS-level failover is a backstop, not the primary mechanism. Useful for catastrophe (a whole edge cluster down) but not for routine recovery.