DNS failover

Auto-switching DNS to a backup origin when the primary fails health checks. Cheap multi-region failover, but DNS TTLs cap how fast it can be.

DNS failover is a feature where your DNS provider continuously health-checks your origins and, if the primary stops responding, swaps the DNS answer to a backup. Users start hitting the backup as soon as their resolver's cache expires.

The classic setup:

  • Primary origin: 203.0.113.10 (US East).
  • Backup origin: 203.0.113.20 (US West, idle or hot-warm).
  • DNS provider health-checks https://203.0.113.10/health every 30 seconds.
  • After 3 failed checks: the DNS provider stops returning 203.0.113.10. Starts returning 203.0.113.20.
  • Resolvers eventually pick up the new answer (limited by TTL).

What this protects against

  • Whole-region outages.
  • Origin server crashes.
  • Network partitions between resolvers and your primary region.

What it doesn't protect against

  • DNS-level outages (if your DNS provider goes down, neither answer is reachable). Mitigate with secondary DNS.
  • Resolver-side cache freshness. If a resolver cached the old answer with a 1-hour TTL, that resolver's users still hit the dead origin for up to 1 hour.
  • Application-level failures (origin returns 500s but health check passes). Health check has to validate real behavior, not just "the port is open."

TTL trade-offs

Failover speed is gated by TTL:

TTLFailover speedTrade-off
60sFastMore DNS queries, more load on DNS, higher cost
300sReasonableStandard for failover-critical records
3600sSlowUse only for stable records
86400sToo slowDon't use for anything that might fail over

Set failover-critical records (apex, www, API) to 60-300s. Stable records (MX, mail-auth, BIMI) can sit at 3600s.

Health check semantics

Don't just check that the origin is reachable. Check that the origin is doing its job:

GET /healthz HTTP/1.1
Host: api.example.com

Expected response:
HTTP 200
Body: {"status":"ok","db":"up","queue":"up"}

If the response says "db":"down", health check fails. Backup region takes over.

Practical implementation paths

  • Route53 health checks + failover record sets. Built-in, ~$0.50/health-check/month.
  • Cloudflare Load Balancing. Health checks, multi-pool weighted distribution, ~$5/month/zone plus per-check.
  • NS1. Real-time, fast-failover, more expensive.
  • DIY with cron + DNS API. Cheap, brittle. Don't do this for production.

In a SaaS

For a custom-domain SaaS, you control the edge layer. Failover within your edge is typically handled at the load balancer level (instant failover within a region) and via anycast at the network level (regional failure routes traffic to other PoPs). DNS-level failover is a backstop, not the primary mechanism. Useful for catastrophe (a whole edge cluster down) but not for routine recovery.

Want this handled for you? Start free with Domainee — 50 custom domains + 100 GB bandwidth, no card.