Disaster Recovery Guide
Comprehensive procedures for recovering MCPSafe services from failures and maintaining business continuity.
Key Terms
RTO (Recovery Time Objective)
Maximum acceptable time to restore service after a failure.
RPO (Recovery Point Objective)
Maximum acceptable data loss measured in time (e.g., 1 hour of data).
Recovery Procedures
Database Recovery
Procedures for recovering the PostgreSQL database from backups or point-in-time recovery.
Recovery Steps
- 1Assess the nature of the database failure (corruption, data loss, hardware failure)
- 2Stop all services connected to the database to prevent further writes
- 3Identify the most recent valid backup or WAL position for recovery
- 4Restore from the latest backup using pg_restore or point-in-time recovery
- 5Validate data integrity after restoration
- 6Restart dependent services and verify functionality
Commands
# Stop services
docker-compose stop api scanner
# Restore PostgreSQL from backup
pg_restore -h localhost -U postgres -d mcpsafe \
--clean --if-exists backup_latest.dump
# For point-in-time recovery (if WAL archiving is enabled)
# Edit postgresql.conf with recovery settings
recovery_target_time = '2024-01-15 10:30:00'
# Verify data integrity
psql -h localhost -U postgres -d mcpsafe \
-c "SELECT COUNT(*) FROM servers;"
# Restart services
docker-compose up -d api scannerRedis Cache Recovery
Procedures for recovering Redis cache and job queue functionality.
Recovery Steps
- 1Identify the Redis failure type (memory, persistence, connectivity)
- 2Check Redis AOF and RDB persistence files for recovery options
- 3Restart Redis container or restore from persistence files
- 4Clear stale cache entries if needed
- 5Verify job queue functionality
- 6Monitor for any queued jobs that need reprocessing
Commands
# Check Redis health
docker exec mcpsafe-redis redis-cli ping
# Restart Redis container
docker-compose restart redis
# If data recovery needed from AOF
docker exec mcpsafe-redis redis-cli BGREWRITEAOF
# Clear all cache (if starting fresh)
docker exec mcpsafe-redis redis-cli FLUSHALL
# Verify Redis connectivity
docker exec mcpsafe-redis redis-cli INFO replicationScanner Service Recovery
Procedures for recovering the security scanner service.
Recovery Steps
- 1Check scanner container logs for error diagnosis
- 2Verify Redis connectivity (scanner depends on Redis for job queue)
- 3Restart the scanner container
- 4Verify health endpoint is responding
- 5Test a sample scan to confirm functionality
- 6Check for any queued scans that need reprocessing
Commands
# Check scanner logs
docker logs mcpsafe-scanner --tail 100
# Verify Redis connection
docker exec mcpsafe-scanner curl -f http://localhost:8001/health
# Restart scanner
docker-compose restart scanner
# Test health endpoint
curl http://localhost:8001/health
# Test a sample scan
curl -X POST http://localhost:8001/scan \
-H "Content-Type: application/json" \
-d '{"url": "https://github.com/test/repo"}'Full System Recovery
Complete system recovery procedure for catastrophic failures.
Recovery Steps
- 1Assess the scope of the failure and affected components
- 2Provision new infrastructure if hardware failure occurred
- 3Restore PostgreSQL database from latest backup
- 4Restore Redis data or accept cache rebuild
- 5Deploy application containers with verified images
- 6Run integration tests to verify system functionality
- 7Restore DNS/load balancer configuration if affected
- 8Notify stakeholders of recovery completion
Commands
# Pull latest verified images
docker-compose pull
# Restore database first (most critical)
pg_restore -h localhost -U postgres -d mcpsafe \
--clean --if-exists /backups/latest.dump
# Start all services
docker-compose up -d
# Wait for health checks
sleep 30
# Verify all services are healthy
docker-compose ps
# Run smoke tests
curl http://localhost:3000/health
curl http://localhost:3001/health
curl http://localhost:8001/health
# Check database connectivity
docker exec mcpsafe-postgres pg_isready -U postgresBackup Schedule
| Component | Frequency | Retention | Type |
|---|---|---|---|
| PostgreSQL | Every 6 hours | 30 days | Full backup + WAL archiving |
| Redis | Every hour (AOF) | 24 hours | AOF + RDB snapshots |
| Configuration | On change | Indefinite (Git) | Version controlled |
| Secrets | On change | Indefinite | Encrypted vault backup |
Service Degradation Procedures
When full recovery is not immediately possible, use these procedures to maintain partial service availability.
Database Overload
Symptoms
- !Slow query responses
- !Connection pool exhaustion
- !High CPU usage
Mitigation Actions
- Enable read replicas if available
- Implement connection pooling (PgBouncer)
- Temporarily disable non-critical features
- Scale up database resources
Scanner Service Failure
Symptoms
- !Scan requests timing out
- !Health checks failing
- !High error rates
Mitigation Actions
- Route traffic to backup scanner instance
- Queue incoming scans for later processing
- Display maintenance message to users
- Investigate and restart failed containers
Redis Failure
Symptoms
- !Cache misses
- !Job queue not processing
- !Session issues
Mitigation Actions
- Fall back to direct database queries
- Queue jobs in memory temporarily
- Restart Redis with persistence recovery
- Clear and rebuild cache if necessary
Network Partition
Symptoms
- !Intermittent connectivity
- !Service-to-service failures
- !DNS issues
Mitigation Actions
- Identify affected network segments
- Fail over to healthy availability zones
- Enable circuit breakers for failing services
- Communicate with cloud provider if infrastructure issue
Emergency Response
Incident Classification
- P0Complete service outage - All hands response
- P1Major feature unavailable - Primary on-call response
- P2Degraded performance - Standard escalation
Response Checklist
- 1. Acknowledge the incident and assess severity
- 2. Notify relevant stakeholders
- 3. Begin investigation and mitigation
- 4. Document actions taken in incident log
- 5. Communicate status updates every 15 minutes for P0/P1
- 6. Complete post-incident review within 48 hours
Recovery Testing
Regular Testing Schedule
Regular testing ensures recovery procedures work when needed.
Backup Verification
Weekly automated restore tests to staging environment.
Failover Drills
Quarterly full failover exercises to secondary region.
Runbook Reviews
Monthly review and update of recovery procedures.
Full DR Test
Annual complete disaster recovery simulation.