Certificate Expiry Outage Response: Emergency Runbook & Prevention
Respond to certificate outage and expired-certificate incidents. Emergency runbook, triage, renewal failure handling (including Certbot renew exit codes), and prevention so expiry outages don’t recur. This page gives you a clear response path and prevention steps so expiry stops causing incidents.
Expired Certificate Outages
Section titled “Expired Certificate Outages”Certificate expiry outages are preventable disasters that occur when certificates exceed their validity period. Despite being entirely predictable (every certificate has a known expiration date), they remain one of the most common causes of production incidents. This page covers emergency response procedures, root cause analysis, and prevention strategies to eliminate expiry-related outages.
Emergency first step: Identify expired certificates, issue emergency replacements, and deploy immediately. Prevention requires monitoring, automation, and organizational accountability.
Overview
Section titled “Overview”Certificate expiry is unique among infrastructure failures: it’s completely predictable yet continues to cause major outages across the industry. High-profile incidents include LinkedIn (2023), Microsoft Teams (2023), Spotify (2022), and Ericsson’s cellular network outage (2018) affecting millions of users.
The paradox: organizations know exactly when certificates will expire, yet still experience outages. This stems from organizational failures, not technical limitations.
Anatomy of an Expiry Outage
Section titled “Anatomy of an Expiry Outage”Timeline of a Typical Incident
Section titled “Timeline of a Typical Incident”T-90 days: Certificate approaches expiration
- Monitoring alerts generated (if monitoring exists)
- Alerts potentially ignored, filtered, or routed incorrectly
- No action taken
T-30 days: Escalation threshold
- Higher-severity alerts should trigger
- Ownership unclear: who is responsible for renewal?
- Competing priorities delay action
T-7 days: Critical threshold
- Emergency procedures should activate
- Change freeze policies may block deployment
- Testing requirements conflict with urgency
T-0 (Expiry): Certificate expires
- TLS handshakes begin failing
- Clients reject connections
- Service becomes unavailable
- War room activated
T+0 to T+4 hours: Emergency response
- Identify expired certificate (harder than expected)
- Issue emergency certificate
- Navigate change control processes
- Deploy across all affected systems
- Validate restoration
Why Expiry Outages Happen
Section titled “Why Expiry Outages Happen”Organizational failures:
- No ownership: “Someone else’s job” mentality
- Alert fatigue: Too many low-priority alerts drown critical ones
- Process gaps: Manual renewal processes don’t scale
- Change control conflicts: Security policies block emergency deployments
- Knowledge silos: Only one person knows how to renew specific certificates
Technical failures:
- Discovery gaps: Unknown certificates that can’t be renewed
- Deployment complexity: Renewal requires coordinating dozens of systems
- Testing requirements: Can’t validate renewals without production-like environment
- Automation failures: Automated renewal fails silently weeks before expiry
Communication failures:
- Stakeholder notifications: Service owners unaware of expiring certificates
- Cross-team dependencies: Certificate used by multiple teams
- Vendor coordination: Third-party systems need advance notice
- Documentation gaps: Renewal procedures outdated or incomplete
Emergency Response Procedures
Section titled “Emergency Response Procedures”Phase 1: Immediate Triage (0-15 minutes)
Section titled “Phase 1: Immediate Triage (0-15 minutes)”Step 1: Confirm certificate expiry
# Quick verification of expired certificateecho | openssl s_client -connect api.example.com:443 -servername api.example.com 2>/dev/null | openssl x509 -noout -dates
# Output shows:# notBefore=Jan 1 00:00:00 2024 GMT# notAfter=Nov 8 23:59:59 2025 GMT # If this is in the past, certificate expired
# Check current time vs expirycurrent_time=$(date -u +%s)cert_expiry=$(date -d "Nov 8 23:59:59 2025" +%s)if [ $current_time -gt $cert_expiry ]; then echo "EXPIRED: Certificate expired $((($current_time - $cert_expiry) / 86400)) days ago"fiStep 2: Assess blast radius
def assess_outage_scope(expired_cert_fingerprint: str) -> OutageScope: """ Determine which services are affected by expired certificate """ scope = OutageScope()
# Query inventory for all locations using this certificate locations = certificate_inventory.find_by_fingerprint( expired_cert_fingerprint )
for location in locations: # Check if service is down health_status = check_service_health(location.hostname, location.port)
if health_status.down: scope.affected_services.append({ 'hostname': location.hostname, 'port': location.port, 'service_name': location.service_name, 'criticality': location.criticality, 'last_working': health_status.last_successful_check })
# Estimate business impact scope.affected_users = sum(s['user_count'] for s in scope.affected_services) scope.revenue_impact_per_hour = sum( s['revenue_per_hour'] for s in scope.affected_services )
return scopeStep 3: Activate incident response
incident_response: severity: P1 # Certificate expiry affecting production is always P1
immediate_actions: - page: platform-sre-oncall - page: security-oncall - notify: service-owner - create: incident-channel # #incident-cert-expiry-2025-11-09
communication_plan: internal: - post_to: #incidents - notify: engineering-leadership - update: status_page
external: - update: customer_status_page - notify: enterprise_customers # If contractual SLA breach
roles: incident_commander: platform-sre-oncall technical_lead: pki-team-lead communications: customer-support-leadPhase 2: Emergency Certificate Issuance (15-45 minutes)
Section titled “Phase 2: Emergency Certificate Issuance (15-45 minutes)”Option A: Standard CA (fastest for known certificates)
# Generate CSR from existing private key (if available)openssl req -new \ -key /backup/api.example.com.key \ -out emergency-renewal.csr \ -subj "/CN=api.example.com" \ -addext "subjectAltName=DNS:api.example.com,DNS:www.api.example.com"
# Submit to CA (automated ACME if available)certbot certonly \ --manual \ --preferred-challenges dns \ --domain api.example.com \ --domain www.api.example.com \ --csr emergency-renewal.csr
# Or manual submission to enterprise CAcurl -X POST https://ca.corp.example.com/issue \ -H "Authorization: Bearer $EMERGENCY_TOKEN" \ -F "csr=@emergency-renewal.csr" \ -F "profile=tls-server-emergency" \ -F "validity_days=90" \ -o new-cert.pemOption B: Generate new keypair (if private key unavailable)
from cryptography import x509from cryptography.x509.oid import NameOID, ExtensionOIDfrom cryptography.hazmat.primitives import hashes, serializationfrom cryptography.hazmat.primitives.asymmetric import rsaimport datetime
def generate_emergency_certificate( common_name: str, san_list: List[str], validity_days: int = 90) -> Tuple[bytes, bytes]: """ Generate emergency self-signed certificate or CSR Use only as last resort when CA unavailable """ # Generate private key private_key = rsa.generate_private_key( public_exponent=65537, key_size=2048, # Use 2048 for speed in emergency )
# Build CSR csr = x509.CertificateSigningRequestBuilder().subject_name( x509.Name([ x509.NameAttribute(NameOID.COMMON_NAME, common_name), x509.NameAttribute(NameOID.ORGANIZATION_NAME, "Emergency"), ]) ).add_extension( x509.SubjectAlternativeName([ x509.DNSName(name) for name in san_list ]), critical=False, ).sign(private_key, hashes.SHA256())
# Serialize private key private_key_pem = private_key.private_bytes( encoding=serialization.Encoding.PEM, format=serialization.PrivateFormat.PKCS8, encryption_algorithm=serialization.NoEncryption() # Emergency: no passphrase )
# Serialize CSR csr_pem = csr.public_bytes(serialization.Encoding.PEM)
return private_key_pem, csr_pem
# Usage in emergencyprivate_key, csr = generate_emergency_certificate( common_name="api.example.com", san_list=["api.example.com", "www.api.example.com"])
# Save for emergency CA submissionwith open("/tmp/emergency.key", "wb") as f: f.write(private_key)with open("/tmp/emergency.csr", "wb") as f: f.write(csr)Option C: Temporary self-signed certificate (absolute last resort)
# ONLY use if:# 1. CA completely unavailable# 2. Clients can temporarily accept self-signed# 3. Will replace with CA-signed within 24 hours
openssl req -x509 -newkey rsa:2048 -nodes \ -keyout emergency-selfsigned.key \ -out emergency-selfsigned.crt \ -days 7 \ -subj "/CN=api.example.com/O=Emergency Temporary" \ -addext "subjectAltName=DNS:api.example.com"
# WARNING: Clients will reject unless configured to trust# Document this as technical debt requiring immediate fixPhase 3: Emergency Deployment (45-90 minutes)
Section titled “Phase 3: Emergency Deployment (45-90 minutes)”Deployment strategies by risk level:
High-risk (revenue-critical services):
#!/bin/bash# emergency-cert-deploy.sh - Coordinated deployment for critical services
set -euo pipefail
CERT_FILE="/emergency/new-cert.pem"KEY_FILE="/emergency/new-cert.key"CHAIN_FILE="/emergency/ca-chain.pem"
# Validation before deploymentecho "=== Pre-deployment validation ==="openssl x509 -in "$CERT_FILE" -noout -subject -dates -fingerprint
# Verify certificate and key matchcert_modulus=$(openssl x509 -noout -modulus -in "$CERT_FILE" | openssl md5)key_modulus=$(openssl rsa -noout -modulus -in "$KEY_FILE" | openssl md5)
if [ "$cert_modulus" != "$key_modulus" ]; then echo "ERROR: Certificate and key don't match!" exit 1fi
# Verify certificate not expiredif ! openssl x509 -checkend 0 -noout -in "$CERT_FILE"; then echo "ERROR: New certificate is already expired!" exit 1fi
# Backup existing certificatesecho "=== Backing up existing certificates ==="BACKUP_DIR="/backup/certs/$(date +%Y%m%d-%H%M%S)"mkdir -p "$BACKUP_DIR"
for target in "${DEPLOYMENT_TARGETS[@]}"; do ssh "$target" "cp /etc/ssl/certs/server.crt $BACKUP_DIR/server.crt.$target" ssh "$target" "cp /etc/ssl/private/server.key $BACKUP_DIR/server.key.$target"done
# Deploy to canary firstecho "=== Deploying to canary ==="CANARY="${DEPLOYMENT_TARGETS[0]}"
scp "$CERT_FILE" "$CANARY:/etc/ssl/certs/server.crt"scp "$KEY_FILE" "$CANARY:/etc/ssl/private/server.key"scp "$CHAIN_FILE" "$CANARY:/etc/ssl/certs/ca-chain.pem"
# Reload service on canaryssh "$CANARY" "systemctl reload nginx"
# Wait and verify canaryecho "=== Verifying canary ==="sleep 10
if ! curl -vI "https://$CANARY" 2>&1 | grep -q "SSL certificate verify ok"; then echo "ERROR: Canary verification failed!" # Rollback canary ssh "$CANARY" "cp $BACKUP_DIR/server.crt.$CANARY /etc/ssl/certs/server.crt" ssh "$CANARY" "cp $BACKUP_DIR/server.key.$CANARY /etc/ssl/private/server.key" ssh "$CANARY" "systemctl reload nginx" exit 1fi
echo "=== Canary successful, deploying to fleet ==="
# Deploy to remaining targets in parallelfor target in "${DEPLOYMENT_TARGETS[@]:1}"; do ( scp "$CERT_FILE" "$target:/etc/ssl/certs/server.crt" scp "$KEY_FILE" "$target:/etc/ssl/private/server.key" scp "$CHAIN_FILE" "$target:/etc/ssl/certs/ca-chain.pem" ssh "$target" "systemctl reload nginx"
# Verify each target if curl -vI "https://$target" 2>&1 | grep -q "SSL certificate verify ok"; then echo "✓ $target deployed successfully" else echo "✗ $target verification failed" fi ) &done
wait
echo "=== Deployment complete ==="Medium-risk (internal services):
import asyncioimport aiohttpfrom typing import List, Dict
async def deploy_certificate_parallel( targets: List[Dict], cert_path: str, key_path: str) -> Dict[str, bool]: """ Deploy certificate to multiple targets in parallel """ results = {}
async def deploy_to_target(target: Dict) -> bool: """Deploy to single target""" try: # Copy certificate files await run_ssh_command( target['hostname'], f"cat > /tmp/new-cert.pem", stdin=open(cert_path, 'rb') )
await run_ssh_command( target['hostname'], f"cat > /tmp/new-key.pem", stdin=open(key_path, 'rb') )
# Install certificates await run_ssh_command( target['hostname'], "sudo cp /tmp/new-cert.pem /etc/ssl/certs/server.crt && " "sudo cp /tmp/new-key.pem /etc/ssl/private/server.key && " "sudo systemctl reload nginx" )
# Verify async with aiohttp.ClientSession() as session: async with session.get( f"https://{target['hostname']}/health", timeout=aiohttp.ClientTimeout(total=10) ) as response: return response.status == 200
except Exception as e: logger.error(f"Deployment to {target['hostname']} failed: {e}") return False
# Deploy in parallel tasks = [deploy_to_target(target) for target in targets] results_list = await asyncio.gather(*tasks, return_exceptions=True)
# Map results for target, result in zip(targets, results_list): results[target['hostname']] = ( result if isinstance(result, bool) else False )
return resultsPhase 4: Verification and Monitoring (90-120 minutes)
Section titled “Phase 4: Verification and Monitoring (90-120 minutes)”Comprehensive verification checklist:
class EmergencyDeploymentVerification: """ Verify emergency certificate deployment was successful """
def verify_all(self, deployment: EmergencyDeployment) -> VerificationReport: """ Run all verification checks """ report = VerificationReport()
# 1. Certificate validity report.add_check( "Certificate Validity", self.verify_certificate_valid(deployment.cert_path) )
# 2. Certificate matches private key report.add_check( "Key Match", self.verify_key_match(deployment.cert_path, deployment.key_path) )
# 3. Trust chain complete report.add_check( "Trust Chain", self.verify_trust_chain(deployment.cert_path, deployment.chain_path) )
# 4. All targets serving new certificate report.add_check( "Deployment Coverage", self.verify_all_targets_updated(deployment.targets) )
# 5. TLS handshake successful report.add_check( "TLS Handshake", self.verify_tls_handshake(deployment.targets) )
# 6. No certificate warnings report.add_check( "No Warnings", self.verify_no_warnings(deployment.targets) )
# 7. Application health restored report.add_check( "Application Health", self.verify_application_health(deployment.targets) )
# 8. Error rates normalized report.add_check( "Error Rates", self.verify_error_rates_normal(deployment.service_name) )
return report
def verify_certificate_valid(self, cert_path: str) -> CheckResult: """Verify certificate is currently valid""" cert = load_certificate(cert_path) now = datetime.now(timezone.utc)
if now < cert.not_before: return CheckResult( passed=False, message=f"Certificate not yet valid (starts {cert.not_before})" )
if now > cert.not_after: return CheckResult( passed=False, message=f"Certificate expired at {cert.not_after}" )
days_until_expiry = (cert.not_after - now).days
return CheckResult( passed=True, message=f"Certificate valid for {days_until_expiry} more days", details={'expires': cert.not_after.isoformat()} )
def verify_all_targets_updated( self, targets: List[str] ) -> CheckResult: """Verify all targets serving new certificate""" new_fingerprint = self.get_certificate_fingerprint(self.cert_path)
mismatches = [] for target in targets: live_fingerprint = self.get_live_certificate_fingerprint(target)
if live_fingerprint != new_fingerprint: mismatches.append({ 'target': target, 'expected': new_fingerprint, 'actual': live_fingerprint })
if mismatches: return CheckResult( passed=False, message=f"{len(mismatches)} targets not serving new certificate", details={'mismatches': mismatches} )
return CheckResult( passed=True, message=f"All {len(targets)} targets serving new certificate" )Post-deployment monitoring:
monitoring_checklist: immediate: - metric: tls_handshake_success_rate threshold: "> 99.9%" duration: 15_minutes
- metric: http_error_rate_5xx threshold: "< 0.1%" duration: 15_minutes
- metric: certificate_expiry_seconds threshold: "> 7_days" alert_if_below: true
sustained: - metric: application_request_latency_p99 threshold: "< baseline + 10%" duration: 1_hour
- metric: customer_reported_issues threshold: "< 5" duration: 4_hours
- metric: ssl_verification_errors threshold: "0" duration: 4_hoursPhase 5: Communication and Documentation
Section titled “Phase 5: Communication and Documentation”Incident timeline documentation:
# Incident Report: Certificate Expiry - api.example.com
## Timeline (All times UTC)
**2025-11-09 14:23** - Certificate expired**2025-11-09 14:31** - First customer reports received**2025-11-09 14:35** - Monitoring alerts triggered (8 min delay)**2025-11-09 14:42** - Incident declared, war room activated**2025-11-09 14:50** - Expired certificate identified**2025-11-09 15:05** - Emergency CSR generated and submitted**2025-11-09 15:23** - New certificate issued by CA**2025-11-09 15:35** - Certificate deployed to canary**2025-11-09 15:42** - Canary validated, fleet deployment started**2025-11-09 16:08** - All targets updated and verified**2025-11-09 16:30** - Service fully restored, incident closed**2025-11-09 18:00** - Post-mortem scheduled
## Impact
- **Duration**: 1 hour 45 minutes (14:23 - 16:08 UTC)- **Affected Services**: api.example.com (REST API)- **User Impact**: ~15,000 API requests failed- **Revenue Impact**: Estimated $45,000 (based on transaction volume)- **Customer Reports**: 37 support tickets
## Root Cause
Certificate api.example.com expired on 2025-11-09 14:23 UTC.
**Contributing factors**:1. Automated renewal failed 45 days prior (logs show CA rate limit)2. Backup manual process not executed3. Monitoring alerts were routed to unmaintained email alias4. Certificate not included in weekly expiry reports (discovery gap)
## Resolution
Emergency certificate issued and deployed across all 23 API gateway instances.
## Follow-up Actions
| Action | Owner | Due Date | Status ||--------|-------|----------|--------|| Add cert to discovery inventory | PKI Team | 2025-11-11 | Done || Fix monitoring alert routing | SRE Team | 2025-11-12 | In Progress || Implement auto-renewal retry logic | Platform Team | 2025-11-20 | Planned || Audit all certificates for discovery gaps | Security Team | 2025-11-30 | Planned |Root Cause Analysis
Section titled “Root Cause Analysis”The Five Whys
Section titled “The Five Whys”Problem: Production certificate expired causing outage
Why 1: Why did the certificate expire?
- Renewal process was not executed
Why 2: Why was renewal not executed?
- Automated renewal failed 45 days earlier
Why 3: Why did automated renewal fail?
- CA rate limit exceeded due to retry loop bug
Why 4: Why didn’t backup manual process trigger?
- Manual process depends on email alerts to unmaintained alias
Why 5: Why were alerts going to unmaintained alias?
- Team reorganization changed ownership, documentation not updated
Root cause: Organizational failure to maintain alert routing and backup processes, compounded by insufficient monitoring of automation health.
Common Root Causes
Section titled “Common Root Causes”Technical:
@dataclassclass TechnicalRootCause: """Common technical causes of expiry outages"""
# Automation failures automation_failure = "ACME client stopped working, fails silently" rate_limiting = "Hit CA rate limits, renewal blocked" credential_expiry = "API credentials for CA expired"
# Discovery failures unknown_certificate = "Certificate not in inventory system" shadow_it = "Certificate created outside standard process" forgotten_certificate = "Certificate on decommissioned-but-running system"
# Deployment failures deployment_complexity = "Renewal requires coordinating 50+ systems" change_freeze = "Change control policy blocked deployment" testing_requirements = "No test environment to validate renewal"Organizational:
@dataclassclass OrganizationalRootCause: """Common organizational causes"""
# Ownership failures unclear_ownership = "No one knows who owns this certificate" team_turnover = "Person who knew how to renew left company" cross_team_dependencies = "Certificate used by 3 teams, coordination failed"
# Process failures manual_processes = "Renewal requires manual steps that were skipped" competing_priorities = "Renewal deprioritized for feature work" alert_fatigue = "Too many false positives, real alert ignored"
# Communication failures notification_gaps = "Stakeholders not notified of upcoming expiry" documentation_rot = "Runbook outdated, incorrect procedures" knowledge_silos = "Only one person knows renewal procedure"Prevention Strategies
Section titled “Prevention Strategies”Layer 1: Monitoring and Alerting
Section titled “Layer 1: Monitoring and Alerting”Multi-tier alert strategy:
alert_tiers: # Tier 1: Early warning (90 days) early_warning: trigger: 90_days_before_expiry severity: info action: create_renewal_ticket recipients: - certificate_owner - pki_team
# Tier 2: Action required (60 days) action_required: trigger: 60_days_before_expiry severity: warning action: escalate_to_manager recipients: - certificate_owner - owner_manager - pki_team frequency: weekly
# Tier 3: Urgent (30 days) urgent: trigger: 30_days_before_expiry severity: high action: page_oncall recipients: - certificate_owner - owner_manager - director_infrastructure - pki_team frequency: daily
# Tier 4: Critical (14 days) critical: trigger: 14_days_before_expiry severity: critical action: emergency_process recipients: - all_engineering - vp_engineering - cto frequency: every_6_hours
# Tier 5: Emergency (7 days) emergency: trigger: 7_days_before_expiry severity: p1 action: immediate_renewal_required recipients: - incident_response_team - executive_team frequency: every_hourAlert validation:
class AlertValidation: """ Ensure alerts are actually working """
def validate_alert_pipeline(self): """ Test alert delivery end-to-end """ # Create test certificate expiring in 89 days test_cert = self.create_test_certificate(days_valid=89)
# Register in monitoring self.register_certificate(test_cert)
# Wait for 90-day alert alert_received = self.wait_for_alert( cert_id=test_cert.id, threshold="90_days", timeout=timedelta(hours=2) )
if not alert_received: raise AlertPipelineFailure( "90-day alert not received within 2 hours" )
# Validate alert content assert alert_received.severity == "info" assert alert_received.contains_renewal_instructions assert test_cert.owner in alert_received.recipients
# Cleanup test certificate self.remove_certificate(test_cert)Layer 2: Automation
Section titled “Layer 2: Automation”Certbot renew exit codes: When automated renewal uses Certbot, a non-zero certbot renew exit code usually means rate limit, HTTP/DNS validation failure, or config error. Use certbot renew --dry-run to test without hitting production limits, and check Certbot Renewal Automation for exit code meanings and deploy hooks.
Automated renewal with retry logic:
class ResilientRenewalSystem: """ Certificate renewal with comprehensive error handling """
def __init__(self): self.max_retries = 5 self.retry_delays = [300, 900, 3600, 14400, 86400] # 5m, 15m, 1h, 4h, 24h self.notification_system = NotificationSystem()
async def renew_certificate( self, cert: Certificate ) -> RenewalResult: """ Attempt certificate renewal with retry logic """ attempt = 0 last_error = None
while attempt < self.max_retries: try: # Check if already renewed if self.recently_renewed(cert): return RenewalResult( success=True, message="Certificate already renewed" )
# Pre-flight checks self.validate_renewal_prerequisites(cert)
# Attempt renewal new_cert = await self.request_new_certificate(cert)
# Validate new certificate self.validate_new_certificate(new_cert, cert)
# Stage for deployment await self.stage_certificate(new_cert)
# Success await self.notification_system.notify_success(cert, new_cert)
return RenewalResult( success=True, certificate=new_cert, attempt=attempt + 1 )
except RateLimitError as e: # Rate limited by CA last_error = e await self.notification_system.notify_rate_limit( cert, attempt, self.retry_delays[attempt] )
if attempt < self.max_retries - 1: await asyncio.sleep(self.retry_delays[attempt]) attempt += 1 else: break
except CAUnavailableError as e: # CA temporarily down last_error = e await self.notification_system.notify_ca_unavailable( cert, attempt )
if attempt < self.max_retries - 1: await asyncio.sleep(self.retry_delays[attempt]) attempt += 1 else: break
except ValidationError as e: # Validation failed (likely configuration issue) last_error = e await self.notification_system.notify_validation_failure( cert, str(e) ) # Don't retry validation errors break
except Exception as e: # Unexpected error last_error = e await self.notification_system.notify_unexpected_error( cert, str(e) )
if attempt < self.max_retries - 1: await asyncio.sleep(self.retry_delays[attempt]) attempt += 1 else: break
# All retries exhausted await self.notification_system.notify_renewal_failure( cert, last_error, attempt + 1 )
# Escalate to human await self.create_urgent_ticket(cert, last_error)
return RenewalResult( success=False, error=last_error, attempts=attempt + 1 )Layer 3: Organizational Controls
Section titled “Layer 3: Organizational Controls”Ownership assignment:
certificate_ownership: assignment_rules: # Automatic assignment based on certificate attributes - condition: subject_cn.endswith('.api.example.com') owner_team: platform-api-team backup_owner: platform-sre
- condition: subject_cn.endswith('.internal.example.com') owner_team: infrastructure-team backup_owner: security-team
- condition: usage == 'code_signing' owner_team: security-team backup_owner: release-engineering
ownership_validation: frequency: quarterly process: - export_all_certificates_with_owners - send_to_each_team_lead - require_acknowledgment_within_14_days - escalate_unacknowledged_to_director
handoff_process: when_person_leaves: - identify_owned_certificates - reassign_to_team - update_documentation - test_renewal_procedures - confirm_new_owner_trainedChange control exemptions:
emergency_exemptions: certificate_expiry: automatic_approval: conditions: - certificate_expires_within: 7_days - service_criticality: high_or_critical - certificate_type: production
requirements: - notify: change_advisory_board - document: emergency_justification - require: post_incident_review
deployment_restrictions: - must_use: canary_deployment - must_verify: before_full_rollout - must_monitor: for_4_hours_minimum
expedited_approval: conditions: - certificate_expires_within: 30_days - normal_change_window_conflict: true
process: - submit: expedited_change_request - approval_required_within: 4_hours - approved_by: service_owner_and_security_leadLayer 4: Testing and Validation
Section titled “Layer 4: Testing and Validation”Chaos engineering for certificate expiry:
class CertificateExpiryGameDay: """ Practice certificate expiry scenarios """
def run_game_day(self): """ Execute game day exercise """ # Select non-critical test service test_service = self.select_test_service( criticality="low", has_redundancy=True )
# Replace certificate with one expiring in 1 hour original_cert = test_service.get_certificate() expiring_cert = self.create_short_lived_certificate( template=original_cert, validity=timedelta(hours=1) )
# Deploy expiring certificate test_service.deploy_certificate(expiring_cert)
print(f""" === Certificate Expiry Game Day ===
Service: {test_service.name} Expires: {expiring_cert.not_after} Time until expiry: 1 hour
Objectives: 1. Detect expiring certificate via monitoring 2. Generate emergency renewal 3. Deploy renewed certificate 4. Restore service to normal
Success criteria: - Detection within 5 minutes of alert threshold - Renewal completed before expiry - Service experiences < 1 minute downtime
Starting game day timer... """)
# Monitor team response response_metrics = self.monitor_team_response( test_service, expiring_cert, timeout=timedelta(hours=2) )
# Generate report self.generate_game_day_report(response_metrics)
# Restore original certificate test_service.deploy_certificate(original_cert)Lessons from Major Incidents
Section titled “Lessons from Major Incidents”Case Study: LinkedIn (2023)
Section titled “Case Study: LinkedIn (2023)”What happened:
- Certificate expired during business hours
- Global outage affecting millions of users
- Duration: Several hours
Contributing factors:
- Automated renewal process had undiscovered bug
- Backup manual process not executed
- Monitoring alerts didn’t escalate appropriately
Lessons learned:
- Test automated renewal regularly in production-like environment
- Require manual validation that automated renewal succeeded
- Implement escalation for failures in automated processes
Case Study: Ericsson (2018)
Section titled “Case Study: Ericsson (2018)”What happened:
- Software certificate expired in mobile network equipment
- Affected 32 million mobile subscribers across UK and Japan
- Duration: 12+ hours
Contributing factors:
- Certificate embedded in software
- No monitoring for embedded certificate expiry
- Global deployment meant rolling back affected millions
Lessons learned:
- Inventory ALL certificates, including embedded in software/firmware
- Monitor certificate expiry in deployed code, not just infrastructure
- Test certificate rotation in production-representative environment
Case Study: Microsoft Teams (2023)
Section titled “Case Study: Microsoft Teams (2023)”What happened:
- Authentication certificate expired
- Users unable to access Teams
- Duration: ~4 hours
Contributing factors:
- Certificate in authentication flow
- Complex deployment requiring coordination across multiple services
Lessons learned:
- Authentication certificates require extra scrutiny (affect all users)
- Practice emergency deployment procedures for authentication services
- Maintain emergency communication channels that don’t depend on affected service
Runbook: Certificate Expiry Response
Section titled “Runbook: Certificate Expiry Response”# Runbook: Certificate Expiry Incident Response
## Detection
**Symptoms**:
- TLS handshake failures- "Certificate expired" errors in logs- HTTP 502/503 errors from load balancers- Monitoring alerts: "certificate_expired"
**Verification**:
```bash# Check certificate expiryecho | openssl s_client -connect $HOSTNAME:443 -servername $HOSTNAME 2>/dev/null | openssl x509 -noout -dates
# Check current time vs expirydate -uImmediate Actions (First 15 minutes)
Section titled “Immediate Actions (First 15 minutes)”- Declare incident: P1 severity
- Page: Platform SRE, Security team, Service owner
- Create: Incident channel (#incident-cert-expiry-YYYYMMDD)
- Identify: Which certificate expired
- Assess: Blast radius - what services affected
Resolution Steps
Section titled “Resolution Steps”Option 1: Cached Valid Certificate Available
Section titled “Option 1: Cached Valid Certificate Available”# Check if valid certificate exists in backupfind /backup/certs -name "*$HOSTNAME*" -mtime -90
# Verify backup certificate is validopenssl x509 -in /backup/certs/hostname.crt -noout -dates
# Deploy backup certificate./scripts/deploy-certificate.sh --cert /backup/certs/hostname.crt --key /backup/certs/hostname.keyOption 2: Request Emergency Certificate
Section titled “Option 2: Request Emergency Certificate”# Generate CSRopenssl req -new -key /etc/ssl/private/server.key -out emergency.csr -subj "/CN=$HOSTNAME"
# Submit to CA (ACME or manual)certbot certonly --csr emergency.csr
# Deploy new certificate./scripts/deploy-certificate.sh --cert ./0001_cert.pem --key /etc/ssl/private/server.keyOption 3: Self-Signed Temporary (Last Resort)
Section titled “Option 3: Self-Signed Temporary (Last Resort)”# Generate temporary self-signed (7 day validity)openssl req -x509 -newkey rsa:2048 -nodes -keyout temp.key -out temp.crt -days 7 -subj "/CN=$HOSTNAME"
# Deploy temporary./scripts/deploy-certificate.sh --cert temp.crt --key temp.key
# CREATE URGENT TICKET: Replace temporary with CA-signed within 24 hoursVerification
Section titled “Verification”# Verify certificate deployedecho | openssl s_client -connect $HOSTNAME:443 -servername $HOSTNAME 2>/dev/null | openssl x509 -noout -subject -dates
# Verify service healthcurl -I [$hostname - Health](https://$HOSTNAME/health)
# Check error rates returned to normal./scripts/check-metrics.sh --service $SERVICE_NAME --metric error_rate_5xxCommunication
Section titled “Communication”- Update incident channel every 15 minutes
- Update status page
- Notify stakeholders when resolved
- Schedule post-mortem within 48 hours
Post-Incident
Section titled “Post-Incident”- Document timeline in incident ticket
- Identify root cause
- Create prevention action items
- Update runbook with lessons learned
- Conduct post-mortem meeting
## Conclusion
Certificate expiry outages are preventable through proper monitoring, automation, and organizational discipline. The technical solutions exist and are well-understood; failures are almost always organizational. Prevention requires:
1. **Comprehensive discovery**: Can't renew what you don't know exists2. **Reliable monitoring**: Alerts must reach responsible parties3. **Automated renewal**: Manual processes don't scale4. **Clear ownership**: Someone must be accountable5. **Regular testing**: Practice emergency response procedures
The investment in prevention is orders of magnitude smaller than the cost of outages. A certificate expiry outage is always a failure of process, never a failure of technology.
## Related Documentation
**Troubleshooting**:
- [Common PKI Misconfigurations](/vault/troubleshooting/common-misconfigurations/) - Configuration errors leading to renewal failures- [Chain Validation Errors](/vault/troubleshooting/chain-validation-errors/) - Certificate verification issues post-renewal- [Performance Bottlenecks](/vault/troubleshooting/performance-bottlenecks/) - Renewal performance at scale
**Operations & Automation**:
- [Certificate Lifecycle Management](/vault/operations/certificate-lifecycle-management/) - Comprehensive lifecycle automation- [Monitoring and Alerting](/vault/operations/monitoring-and-alerting/) - Expiration monitoring strategies- [Renewal Automation](/vault/operations/renewal-automation/) - Automated renewal implementation- [Certbot Renewal Automation](/vault/acme-clients/certbot-renewal-automation/) - Certbot-specific renewal setup