How to Monitor 10,000 Customer SSL Certificates Without Your On-Call Crying at 2am
You can monitor one SSL certificate with openssl s_client and a cron job. The lessons start at 50, hurt at 500, and break things at 5,000. This is the playbook we'd hand a new engineer on day one if we hadn't already learned it the slow way.
Why naive monitoring fails
A reasonable engineer's first instinct: cron job hits every domain hourly, parses the cert, alerts if expiry is under 30 days. Works fine through ~200 hostnames. By 2,000 you're hitting a wall:
- You're paying for the probes. TLS handshake + cert parse is roughly 30-60ms per domain. 2,000 domains hourly = ~50,000 handshakes a day. Bandwidth, compute, and IP reputation start to matter.
- You're noise-flooding your alerts. A transient DNS hiccup at one customer's nameservers fires expiry-unknown for that domain. With one customer it's a curiosity. With 2,000, you get five false alarms a day.
- You don't know which fires deserve a page. Every alert looks the same. Customer canceled their CNAME? Page. ACME slow today? Page. Cert legitimately about to expire? Page. By month two, on-call has muted you.
What changes at scale: you stop asking "is this domain healthy?" per probe, and start asking "is the fleet healthy?" with sampling — plus targeted probes when something specific looks wrong.
The five failure modes worth instrumenting
In rough order of how often we see them:
- DNS drift — customer changed their CNAME to point somewhere else (or removed it). Your domain row still says "verified" but they're serving from another provider now. Cert renewal will silently start failing.
- ACME challenge failure — Let's Encrypt can't reach
/.well-known/acme-challenge/...on the customer's hostname. Their CDN is probably caching the challenge endpoint, or they re-enabled Cloudflare proxy. Renewal fails until somebody intervenes. - Chain broken / hostname mismatch — cert is valid but doesn't match the SNI. Usually a config error during a migration.
- Hard expiry — renewal didn't run on time. By the time you notice this from a probe, you're already serving an expired cert to real users.
- Origin returning 5xx during validation — your own backend went down during the LE retry window. Customer-facing certs are fine, but new domains can't validate.
Each has a distinct cause and a distinct fix. Don't merge their alerts.
The 30-7-1 expiry ladder
Most teams alert on a single threshold ("cert expires in < 30 days"). At scale that loses information. Use three:
| Bucket | Signal | Action |
|---|---|---|
| 30 days out | ticket-worthy | Open a ticket, route to your on-call queue, don't page. ACME should have renewed by now; investigate why it didn't. |
| 7 days out | escalate to chat | Slack channel alert at business hours. Auto-attempts a manual renewal; if that fails, page next business day. |
| 1 day out | page immediately | Real outage incoming. Page the on-call engineer right now. |
The trick is that the 30-day signal isn't an emergency — it's a backlog warning. Most failures show up here, and you have plenty of time. The 1-day signal is when something is genuinely on fire.
// monitor.ts (simplified)
for await (const { hostname, expiresAt } of streamCerts()) {
const days = (expiresAt - Date.now()) / 86_400_000;
if (days < 1) await page("CERT_HARD_EXPIRY", hostname);
else if (days < 7) await slack("CERT_EXPIRING_SOON", hostname);
else if (days < 30) await ticket("CERT_RENEWAL_LATE", hostname);
}
DNS drift: the silent killer
Customer adds the CNAME, domain goes live, three months pass. They redesign their site, get advice to "use Cloudflare," flip the orange cloud on, forget you exist. The cert was renewing fine until the day Cloudflare started intercepting /.well-known/acme-challenge/... and serving its own 404.
You won't catch this from a cert probe — the cert is still valid for another 60 days. The signal is in the DNS:
// dns-monitor.ts
import { Resolver } from "node:dns/promises";
const resolver = new Resolver();
resolver.setServers(["1.1.1.1", "8.8.8.8"]); // bypass your local cache
async function detectDrift(hostname: string) {
const cnames = await resolver.resolveCname(hostname).catch(() => []);
const ips = await resolver.resolve4(hostname).catch(() => []);
const expectedCnameTargets = ["edge.domainee.dev"];
const expectedEdgeIps = await getOurEdgeIps(); // cached, refresh hourly
const cnamePointsToUs = cnames.some((c) => expectedCnameTargets.includes(c));
const ipPointsToUs = ips.some((ip) => expectedEdgeIps.has(ip));
if (!cnamePointsToUs && !ipPointsToUs) {
return { drift: true, observedCnames: cnames, observedIps: ips };
}
return { drift: false };
}
Run this once a day per domain, not hourly. Most drift takes 24+ hours of being broken before it matters. Detected drift goes into a Slack channel — not a page, because the customer almost certainly did this themselves.
Sampling instead of probing every domain
At 10,000 domains, you don't probe every one hourly. You probe a sliding window — say 5% per hour, full rotation every 20 hours — plus a targeted probe whenever a specific signal demands it (cert renewal attempted, DNS state changed, customer hit your dashboard).
This catches genuine issues fast (most failures persist for hours; you'll see them within 20) while shedding 95% of the load.
function pickProbeSet(allDomains: Domain[], sampleRate = 0.05) {
// deterministic shard: hash(hostname) % (1/sampleRate) === currentSlot
const slot = Math.floor((Date.now() / 3_600_000) % (1 / sampleRate));
return allDomains.filter((d) => hashSlot(d.hostname, sampleRate) === slot);
}
Important: deterministic sharding (hash-based, not random). Same domain probes in the same slot every cycle, which makes "this domain hasn't been probed in 24h" a meaningful alert on its own.
What to log on every probe
The bare minimum, structured:
hostnameprobedAtsuccess(boolean)tlsHandshakeMscertSubject(first SAN)certIssuer(CN of issuer)certNotBeforecertNotAfterchainValid(boolean)hostnameMatchesSan(boolean)responseStatus(from a HEAD request after handshake)
You'll be glad you have all of these when an incident hits. The two queries that pull weight during a real outage:
-- What did we see in the last hour for this customer's domains?
SELECT * FROM probes
WHERE hostname LIKE '%acme.com'
AND probed_at > now() - interval '1 hour';
-- Across all domains, has anything started failing in the last 15 minutes?
SELECT issuer, count(*) FROM probes
WHERE success = false
AND probed_at > now() - interval '15 minutes'
GROUP BY issuer;
The second query catches issuer-wide problems early. Last year we caught a Let's Encrypt staging outage 12 minutes before they posted about it.
Page-worthy vs ticket-worthy
Most teams over-page. The default should be ticket-worthy; pages should be reserved for things that are currently breaking real user traffic.
| Signal | Action |
|---|---|
| Cert expiry < 24h, renewal failed | Page |
| Cert expiry < 24h, renewal succeeded, not yet propagated | Ticket |
| TLS handshake failing > 5 min for one domain | Slack, escalate to page if > 30 min |
| TLS handshake failing > 5 domains in 5 min (regional pattern) | Page |
| DNS drift detected on a single domain | Slack |
| DNS drift on > 10 domains in 10 min | Page (something systemic) |
| ACME challenge 5xx rate > 2% | Page |
| Origin reachability < 95% during validation window | Ticket (might be your origin) |
The pattern: single-customer issues are tickets unless they're imminent. Fleet-wide patterns are pages. The exception is "we're already serving an expired cert to real users" — that's always a page.
How Domainee handles it
We monitor every domain on the edge — TLS health, DNS drift, renewal status, origin reachability — and fire webhooks on every state change. You wire one endpoint and stop running cron jobs.
If you're rolling your own, the playbook above is the one we wish someone had handed us at our 1,000th hostname. If you'd rather skip to "it just works," sign up at /sign-up. First 50 domains free.