Axelspire

Part of the Certificate Automation Guide

High Availability and Disaster Recovery for PKI

Your certificates can't take a day off. Neither can the infrastructure that manages them.

When a certificate expires unexpectedly, the outage is immediate and visible. When your PKI infrastructure fails, the outage is slower but potentially worse — you can't issue new certificates, you can't revoke compromised ones, and you may not be able to validate existing ones.

High availability and disaster recovery for PKI are insurance policies most enterprises buy after their first major incident. The smart ones buy before.


High Availability vs Disaster Recovery

These terms are often used interchangeably. They're not the same thing.

High availability (HA) is about continuous uptime during normal operations. Redundant components, automatic failover, no single points of failure. HA answers: "What happens when a server dies?"

Disaster recovery (DR) is about restoration after catastrophic failure. Backup sites, data recovery, business continuity. DR answers: "What happens when the data centre floods?"

PKI needs both, but they require different architectural decisions.


High Availability for PKI

HA for PKI means certificate services remain available when individual components fail.

What needs to stay available:

Certificate validation (OCSP, CRL distribution). Every TLS connection validates certificates. If validation fails, connections fail. OCSP responders and CRL distribution points must be highly available.

Certificate issuance. Less critical for short-term outages — you can survive hours without issuing new certificates. But if you're doing short-lived certificates (24-hour TTLs in a Vault PKI model), issuance availability becomes critical.

Certificate management platforms. Venafi, Keyfactor, your automation infrastructure. If the platform is down, automation stops. Manual workarounds are possible but painful.

HA patterns:

Active-active for validation. Multiple OCSP responders behind a load balancer. CRL distribution through CDN or replicated file servers. Validation traffic is read-only, so active-active is straightforward.

Active-passive for issuance. CA infrastructure is typically stateful (database, key material). Active-passive with automated failover is simpler than active-active and sufficient for most issuance workloads.

Geographic distribution. OCSP responders in multiple regions. CRL caches close to consumers. Issuance may be centralised, but validation should be distributed.


Disaster Recovery for PKI

DR for PKI means restoring certificate services after infrastructure destruction.

What needs to be recoverable:

CA private keys. Obviously. If you lose the CA private key, you need to reissue everything that CA signed. This is why root CA keys belong in HSMs with backup procedures.

Certificate database. Issued certificates, revocation status, validity information. Losing this means losing the ability to revoke or validate.

Configuration and policies. Certificate profiles, automation rules, integration configurations. Rebuilding from scratch is painful and error-prone.

DR patterns:

HSM key backup. HSMs support backup to cloned HSMs or encrypted files. Test your restore procedure. Quarterly. Untested backups aren't backups.

Database replication. Synchronous replication to the DR site for zero data loss (RPO=0). Asynchronous for lower operational complexity with some data loss risk.

Infrastructure as code. CA configuration, certificate profiles, automation rules — all version controlled and deployable. If you can't rebuild from scratch with a script, your DR plan is manual.


The Business Case for PKI HA/DR

PKI infrastructure is "invisible until it breaks." Executives don't see the value of redundancy until they're in a war room explaining why the website is down.

Calculate your exposure:

Annual PKI Risk Exposure = 
  (Probability of outage) × (Hours of downtime) × (Cost per hour)

For an e-commerce company:

  • Probability of significant PKI incident: 15-25% per year (industry data)
  • Average downtime without HA/DR: 4-8 hours
  • Revenue impact: £50K-£500K per hour depending on scale
  • Reputation and customer trust impact: harder to quantify, often larger

A £500K investment in HA/DR that prevents a single 8-hour outage at £100K/hour has paid for itself immediately.

The costs nobody budgets:

Incident response labour. 40+ person-hours per significant PKI incident. Senior engineers, managers, executives. Not cheap.

Root cause analysis. Post-incident reviews, compliance documentation, customer communications. Weeks of effort.

Expedited remediation. Emergency CA ceremonies, weekend HSM vendor support, rushed changes with elevated risk.

Audit findings. Incidents generate compliance findings. Findings require remediation projects. Projects consume budget.

The infrastructure investment is visible. The absence of incidents is invisible. This is why PKI HA/DR is chronically underfunded.


HA/DR for Specific Platforms

Microsoft Active Directory Certificate Services

ADCS HA options are limited by Microsoft's architecture.

Enterprise CAs can be clustered using Windows Failover Clustering. The CA role moves between nodes on failure. This provides HA for issuance.

OCSP responders should be deployed on multiple servers with load balancing. Microsoft's Online Responder supports array configuration.

CRL distribution should use redundant CDP locations — DFS replicated shares, IIS arrays, or external CDN.

DR for ADCS means backing up the CA database and private keys, and having procedures to restore onto new infrastructure. The CA certificate itself is the critical asset — as long as clients trust it, you can restore operations.

HashiCorp Vault PKI

Vault Enterprise supports multi-node clusters with raft or consul storage. HA is a core Vault capability.

Cluster configuration. 3 or 5 node clusters across availability zones. Vault handles leader election and failover.

Performance replication. Enterprise feature for geographic distribution. Performance standbys can serve PKI read requests.

Disaster recovery replication. Asynchronous replication to a DR cluster in another region. Manual promotion on primary failure.

For Vault PKI specifically, note that the PKI secrets engine stores its CA in Vault's encrypted storage. Vault's DR protects the CA key material. HSM integration (Vault Enterprise) adds hardware-backed key protection.

Venafi

Venafi Trust Protection Platform supports HA through clustering.

Platform clustering. Multiple TPP servers behind a load balancer. Shared database (SQL Server with availability groups recommended).

Distributed agents. Venafi agents connect to any available platform server. Agent HA is built-in.

HSM considerations. If using HSM-backed trust anchors, HSM availability becomes part of the PKI HA story. Luna HSM clustering or HSM-as-a-Service addresses this.

For DR, Venafi's database contains certificate inventory, policy configuration, and automation state. Replicate the database to DR site; restore platform servers from configuration management.

Keyfactor

Keyfactor Command supports similar patterns to Venafi.

Application server clustering. Multiple Command servers, load balanced, shared database.

Database HA. SQL Server availability groups, PostgreSQL streaming replication, or cloud-managed database with built-in HA.

EJBCA HA (if using Keyfactor's CA). EJBCA supports active-passive clustering with database replication.

3AM

3AM is designed for resilience from the ground up.

Distributed architecture. Discovery agents, issuance bridge, intelligence layer all designed for horizontal scaling and geographic distribution.

99.99% SLA. Built-in HA/DR with automatic failover. Certificate operations continue even during infrastructure events.

Zero-touch recovery. If something fails, it recovers without manual intervention. The goal is operations that require oversight, not operations that require heroes.


Recovery Objectives: RTO and RPO

Recovery Time Objective (RTO): How long can you be down?

For certificate validation: minutes. Users can't tolerate failed TLS connections. OCSP must recover quickly or have always-on redundancy.

For certificate issuance: hours to days, depending on certificate lifetime. If your certificates live for a year, being unable to issue for 8 hours is annoying, not catastrophic. If your certificates live for 24 hours, issuance is on the critical path.

For certificate management: days. You can survive without the management platform longer than you can survive without the operational infrastructure. But longer outages create backlogs and risk.

Recovery Point Objective (RPO): How much data can you lose?

For CA private keys: zero. You cannot lose CA keys. HSM backup procedures must guarantee this.

For certificate database: minutes to hours. Losing recent issuance records is painful but recoverable — you can reissue. Losing revocation records is worse — you can't prove what was revoked.

For configuration: hours to days. If infrastructure is code, you rebuild from source control. The risk is configuration drift — production state that's not in source control.


Testing Your HA/DR

Untested HA/DR is hope, not infrastructure.

Test quarterly:

Failover testing. Kill a server, verify services failover. Time the recovery. Document the process.

DR drills. Simulate primary site loss. Can you actually restore to DR? How long does it take? What breaks that nobody expected?

Key recovery. Restore CA keys from HSM backup. Verify the restored keys can sign. This is the test nobody does until they need it.

Runbook validation. Follow the documented procedures exactly. Where are they wrong? Where are they incomplete?

Chaos engineering for PKI:

If you're mature enough, intentionally introduce failures in production. Netflix-style chaos engineering applied to certificate infrastructure. Most organisations aren't ready for this, but it's the gold standard.


Common HA/DR Failures

"The backup was there, but nobody tested restore." HSM backup procedures that were never validated. Database backups that can't actually be restored. Recovery documentation that's outdated.

"Failover worked, but nobody knew it happened." Monitoring gaps that hide silent failures. Primary systems fail over to secondary, secondary runs for weeks, secondary fails and there's no tertiary.

"DR site couldn't handle production load." DR infrastructure that's undersized because it's "just for emergencies." Emergency happens, DR can't cope.

"Configuration drift." Primary and secondary diverged over time. Failover introduces subtle differences. Production behaviour changes unexpectedly.

"The DR site is in the same failure domain." "Geographic redundancy" that's actually two racks in the same data centre. Same power, same network, same flood zone.


Building Resilient Certificate Infrastructure

HA/DR is necessary but not sufficient. It protects against infrastructure failure, not operational failure.

Most certificate outages aren't caused by server crashes or data centre floods. They're caused by certificates nobody knew about expiring, renewal processes that failed silently, and changes that broke automation.

3AM addresses both layers:

Infrastructure resilience. Built-in HA/DR, 99.99% availability, automatic recovery.

Operational resilience. Visibility into your entire certificate estate, predictive analytics that identify risks before they become incidents, dependency mapping that shows blast radius.

The goal isn't just "infrastructure stays up." It's "certificates don't cause outages." Those require different solutions.


Assess Your PKI Resilience

How confident are you in your current HA/DR posture? When did you last test failover? Do you know your actual RTO and RPO?

3AM includes certificate infrastructure assessment as part of discovery. See your resilience gaps alongside your certificate inventory.

Book a discovery conversation →

Further Reading