DocsOperationsDisaster Recovery
Operations

Disaster Recovery Guide

Comprehensive procedures for recovering MCPSafe services from failures and maintaining business continuity.

30 min
Max RTO (Full System)
1 hour
Max RPO (Database)
6 hours
Backup Frequency
99.9%
Target Uptime

Key Terms

RTO (Recovery Time Objective)

Maximum acceptable time to restore service after a failure.

RPO (Recovery Point Objective)

Maximum acceptable data loss measured in time (e.g., 1 hour of data).

Recovery Procedures

Database Recovery

Critical

Procedures for recovering the PostgreSQL database from backups or point-in-time recovery.

RTO: < 15 minutesRPO: < 1 hour

Recovery Steps

  1. 1Assess the nature of the database failure (corruption, data loss, hardware failure)
  2. 2Stop all services connected to the database to prevent further writes
  3. 3Identify the most recent valid backup or WAL position for recovery
  4. 4Restore from the latest backup using pg_restore or point-in-time recovery
  5. 5Validate data integrity after restoration
  6. 6Restart dependent services and verify functionality

Commands

Terminal
# Stop services
docker-compose stop api scanner

# Restore PostgreSQL from backup
pg_restore -h localhost -U postgres -d mcpsafe \
  --clean --if-exists backup_latest.dump

# For point-in-time recovery (if WAL archiving is enabled)
# Edit postgresql.conf with recovery settings
recovery_target_time = '2024-01-15 10:30:00'

# Verify data integrity
psql -h localhost -U postgres -d mcpsafe \
  -c "SELECT COUNT(*) FROM servers;"

# Restart services
docker-compose up -d api scanner

Redis Cache Recovery

High

Procedures for recovering Redis cache and job queue functionality.

RTO: < 5 minutesRPO: N/A (cache)

Recovery Steps

  1. 1Identify the Redis failure type (memory, persistence, connectivity)
  2. 2Check Redis AOF and RDB persistence files for recovery options
  3. 3Restart Redis container or restore from persistence files
  4. 4Clear stale cache entries if needed
  5. 5Verify job queue functionality
  6. 6Monitor for any queued jobs that need reprocessing

Commands

Terminal
# Check Redis health
docker exec mcpsafe-redis redis-cli ping

# Restart Redis container
docker-compose restart redis

# If data recovery needed from AOF
docker exec mcpsafe-redis redis-cli BGREWRITEAOF

# Clear all cache (if starting fresh)
docker exec mcpsafe-redis redis-cli FLUSHALL

# Verify Redis connectivity
docker exec mcpsafe-redis redis-cli INFO replication

Scanner Service Recovery

High

Procedures for recovering the security scanner service.

RTO: < 10 minutesRPO: N/A (stateless)

Recovery Steps

  1. 1Check scanner container logs for error diagnosis
  2. 2Verify Redis connectivity (scanner depends on Redis for job queue)
  3. 3Restart the scanner container
  4. 4Verify health endpoint is responding
  5. 5Test a sample scan to confirm functionality
  6. 6Check for any queued scans that need reprocessing

Commands

Terminal
# Check scanner logs
docker logs mcpsafe-scanner --tail 100

# Verify Redis connection
docker exec mcpsafe-scanner curl -f http://localhost:8001/health

# Restart scanner
docker-compose restart scanner

# Test health endpoint
curl http://localhost:8001/health

# Test a sample scan
curl -X POST http://localhost:8001/scan \
  -H "Content-Type: application/json" \
  -d '{"url": "https://github.com/test/repo"}'

Full System Recovery

Critical

Complete system recovery procedure for catastrophic failures.

RTO: < 30 minutesRPO: < 1 hour

Recovery Steps

  1. 1Assess the scope of the failure and affected components
  2. 2Provision new infrastructure if hardware failure occurred
  3. 3Restore PostgreSQL database from latest backup
  4. 4Restore Redis data or accept cache rebuild
  5. 5Deploy application containers with verified images
  6. 6Run integration tests to verify system functionality
  7. 7Restore DNS/load balancer configuration if affected
  8. 8Notify stakeholders of recovery completion

Commands

Terminal
# Pull latest verified images
docker-compose pull

# Restore database first (most critical)
pg_restore -h localhost -U postgres -d mcpsafe \
  --clean --if-exists /backups/latest.dump

# Start all services
docker-compose up -d

# Wait for health checks
sleep 30

# Verify all services are healthy
docker-compose ps

# Run smoke tests
curl http://localhost:3000/health
curl http://localhost:3001/health
curl http://localhost:8001/health

# Check database connectivity
docker exec mcpsafe-postgres pg_isready -U postgres

Backup Schedule

ComponentFrequencyRetentionType
PostgreSQLEvery 6 hours30 daysFull backup + WAL archiving
RedisEvery hour (AOF)24 hoursAOF + RDB snapshots
ConfigurationOn changeIndefinite (Git)Version controlled
SecretsOn changeIndefiniteEncrypted vault backup

Service Degradation Procedures

When full recovery is not immediately possible, use these procedures to maintain partial service availability.

Database Overload

Symptoms

  • !Slow query responses
  • !Connection pool exhaustion
  • !High CPU usage

Mitigation Actions

  • Enable read replicas if available
  • Implement connection pooling (PgBouncer)
  • Temporarily disable non-critical features
  • Scale up database resources

Scanner Service Failure

Symptoms

  • !Scan requests timing out
  • !Health checks failing
  • !High error rates

Mitigation Actions

  • Route traffic to backup scanner instance
  • Queue incoming scans for later processing
  • Display maintenance message to users
  • Investigate and restart failed containers

Redis Failure

Symptoms

  • !Cache misses
  • !Job queue not processing
  • !Session issues

Mitigation Actions

  • Fall back to direct database queries
  • Queue jobs in memory temporarily
  • Restart Redis with persistence recovery
  • Clear and rebuild cache if necessary

Network Partition

Symptoms

  • !Intermittent connectivity
  • !Service-to-service failures
  • !DNS issues

Mitigation Actions

  • Identify affected network segments
  • Fail over to healthy availability zones
  • Enable circuit breakers for failing services
  • Communicate with cloud provider if infrastructure issue

Emergency Response

Incident Classification

  • P0
    Complete service outage - All hands response
  • P1
    Major feature unavailable - Primary on-call response
  • P2
    Degraded performance - Standard escalation

Response Checklist

  1. 1. Acknowledge the incident and assess severity
  2. 2. Notify relevant stakeholders
  3. 3. Begin investigation and mitigation
  4. 4. Document actions taken in incident log
  5. 5. Communicate status updates every 15 minutes for P0/P1
  6. 6. Complete post-incident review within 48 hours

Recovery Testing

Regular Testing Schedule

Regular testing ensures recovery procedures work when needed.

Backup Verification

Weekly automated restore tests to staging environment.

Weekly

Failover Drills

Quarterly full failover exercises to secondary region.

Quarterly

Runbook Reviews

Monthly review and update of recovery procedures.

Monthly

Full DR Test

Annual complete disaster recovery simulation.

Annually