← Back to blog

How to Monitor 10,000 Customer SSL Certificates Without Your On-Call Crying at 2am

Jonathan Geiger·
operationssslmonitoringcustom domainsplaybookon-call

You can monitor one SSL certificate with openssl s_client and a cron job. The lessons start at 50, hurt at 500, and break things at 5,000. This is the playbook we'd hand a new engineer on day one if we hadn't already learned it the slow way.

Why naive monitoring fails

A reasonable engineer's first instinct: cron job hits every domain hourly, parses the cert, alerts if expiry is under 30 days. Works fine through ~200 hostnames. By 2,000 you're hitting a wall:

  • You're paying for the probes. TLS handshake + cert parse is roughly 30-60ms per domain. 2,000 domains hourly = ~50,000 handshakes a day. Bandwidth, compute, and IP reputation start to matter.
  • You're noise-flooding your alerts. A transient DNS hiccup at one customer's nameservers fires expiry-unknown for that domain. With one customer it's a curiosity. With 2,000, you get five false alarms a day.
  • You don't know which fires deserve a page. Every alert looks the same. Customer canceled their CNAME? Page. ACME slow today? Page. Cert legitimately about to expire? Page. By month two, on-call has muted you.

What changes at scale: you stop asking "is this domain healthy?" per probe, and start asking "is the fleet healthy?" with sampling — plus targeted probes when something specific looks wrong.

The five failure modes worth instrumenting

In rough order of how often we see them:

  1. DNS drift — customer changed their CNAME to point somewhere else (or removed it). Your domain row still says "verified" but they're serving from another provider now. Cert renewal will silently start failing.
  2. ACME challenge failure — Let's Encrypt can't reach /.well-known/acme-challenge/... on the customer's hostname. Their CDN is probably caching the challenge endpoint, or they re-enabled Cloudflare proxy. Renewal fails until somebody intervenes.
  3. Chain broken / hostname mismatch — cert is valid but doesn't match the SNI. Usually a config error during a migration.
  4. Hard expiry — renewal didn't run on time. By the time you notice this from a probe, you're already serving an expired cert to real users.
  5. Origin returning 5xx during validation — your own backend went down during the LE retry window. Customer-facing certs are fine, but new domains can't validate.

Each has a distinct cause and a distinct fix. Don't merge their alerts.

The 30-7-1 expiry ladder

Most teams alert on a single threshold ("cert expires in < 30 days"). At scale that loses information. Use three:

BucketSignalAction
30 days outticket-worthyOpen a ticket, route to your on-call queue, don't page. ACME should have renewed by now; investigate why it didn't.
7 days outescalate to chatSlack channel alert at business hours. Auto-attempts a manual renewal; if that fails, page next business day.
1 day outpage immediatelyReal outage incoming. Page the on-call engineer right now.

The trick is that the 30-day signal isn't an emergency — it's a backlog warning. Most failures show up here, and you have plenty of time. The 1-day signal is when something is genuinely on fire.

// monitor.ts (simplified)
for await (const { hostname, expiresAt } of streamCerts()) {
  const days = (expiresAt - Date.now()) / 86_400_000;
  if (days < 1) await page("CERT_HARD_EXPIRY", hostname);
  else if (days < 7) await slack("CERT_EXPIRING_SOON", hostname);
  else if (days < 30) await ticket("CERT_RENEWAL_LATE", hostname);
}

DNS drift: the silent killer

Customer adds the CNAME, domain goes live, three months pass. They redesign their site, get advice to "use Cloudflare," flip the orange cloud on, forget you exist. The cert was renewing fine until the day Cloudflare started intercepting /.well-known/acme-challenge/... and serving its own 404.

You won't catch this from a cert probe — the cert is still valid for another 60 days. The signal is in the DNS:

// dns-monitor.ts
import { Resolver } from "node:dns/promises";
const resolver = new Resolver();
resolver.setServers(["1.1.1.1", "8.8.8.8"]); // bypass your local cache

async function detectDrift(hostname: string) {
  const cnames = await resolver.resolveCname(hostname).catch(() => []);
  const ips    = await resolver.resolve4(hostname).catch(() => []);

  const expectedCnameTargets = ["edge.domainee.dev"];
  const expectedEdgeIps      = await getOurEdgeIps(); // cached, refresh hourly

  const cnamePointsToUs = cnames.some((c) => expectedCnameTargets.includes(c));
  const ipPointsToUs    = ips.some((ip) => expectedEdgeIps.has(ip));

  if (!cnamePointsToUs && !ipPointsToUs) {
    return { drift: true, observedCnames: cnames, observedIps: ips };
  }
  return { drift: false };
}

Run this once a day per domain, not hourly. Most drift takes 24+ hours of being broken before it matters. Detected drift goes into a Slack channel — not a page, because the customer almost certainly did this themselves.

Sampling instead of probing every domain

At 10,000 domains, you don't probe every one hourly. You probe a sliding window — say 5% per hour, full rotation every 20 hours — plus a targeted probe whenever a specific signal demands it (cert renewal attempted, DNS state changed, customer hit your dashboard).

This catches genuine issues fast (most failures persist for hours; you'll see them within 20) while shedding 95% of the load.

function pickProbeSet(allDomains: Domain[], sampleRate = 0.05) {
  // deterministic shard: hash(hostname) % (1/sampleRate) === currentSlot
  const slot = Math.floor((Date.now() / 3_600_000) % (1 / sampleRate));
  return allDomains.filter((d) => hashSlot(d.hostname, sampleRate) === slot);
}

Important: deterministic sharding (hash-based, not random). Same domain probes in the same slot every cycle, which makes "this domain hasn't been probed in 24h" a meaningful alert on its own.

What to log on every probe

The bare minimum, structured:

  • hostname
  • probedAt
  • success (boolean)
  • tlsHandshakeMs
  • certSubject (first SAN)
  • certIssuer (CN of issuer)
  • certNotBefore
  • certNotAfter
  • chainValid (boolean)
  • hostnameMatchesSan (boolean)
  • responseStatus (from a HEAD request after handshake)

You'll be glad you have all of these when an incident hits. The two queries that pull weight during a real outage:

-- What did we see in the last hour for this customer's domains?
SELECT * FROM probes
WHERE hostname LIKE '%acme.com'
  AND probed_at > now() - interval '1 hour';

-- Across all domains, has anything started failing in the last 15 minutes?
SELECT issuer, count(*) FROM probes
WHERE success = false
  AND probed_at > now() - interval '15 minutes'
GROUP BY issuer;

The second query catches issuer-wide problems early. Last year we caught a Let's Encrypt staging outage 12 minutes before they posted about it.

Page-worthy vs ticket-worthy

Most teams over-page. The default should be ticket-worthy; pages should be reserved for things that are currently breaking real user traffic.

SignalAction
Cert expiry < 24h, renewal failedPage
Cert expiry < 24h, renewal succeeded, not yet propagatedTicket
TLS handshake failing > 5 min for one domainSlack, escalate to page if > 30 min
TLS handshake failing > 5 domains in 5 min (regional pattern)Page
DNS drift detected on a single domainSlack
DNS drift on > 10 domains in 10 minPage (something systemic)
ACME challenge 5xx rate > 2%Page
Origin reachability < 95% during validation windowTicket (might be your origin)

The pattern: single-customer issues are tickets unless they're imminent. Fleet-wide patterns are pages. The exception is "we're already serving an expired cert to real users" — that's always a page.

How Domainee handles it

We monitor every domain on the edge — TLS health, DNS drift, renewal status, origin reachability — and fire webhooks on every state change. You wire one endpoint and stop running cron jobs.

If you're rolling your own, the playbook above is the one we wish someone had handed us at our 1,000th hostname. If you'd rather skip to "it just works," sign up at /sign-up. First 50 domains free.