Docs OperationsDisaster Recovery

Operations

Disaster Recovery Guide

Comprehensive procedures for recovering MCPSafe services from failures and maintaining business continuity.

30 min

Max RTO (Full System)

1 hour

Max RPO (Database)

6 hours

Backup Frequency

99.9%

Target Uptime

Key Terms

RTO (Recovery Time Objective)

Maximum acceptable time to restore service after a failure.

RPO (Recovery Point Objective)

Maximum acceptable data loss measured in time (e.g., 1 hour of data).

Recovery Procedures

Database Recovery

Critical

Procedures for recovering the PostgreSQL database from backups or point-in-time recovery.

RTO: < 15 minutesRPO: < 1 hour

Recovery Steps

1Assess the nature of the database failure (corruption, data loss, hardware failure)
2Stop all services connected to the database to prevent further writes
3Identify the most recent valid backup or WAL position for recovery
4Restore from the latest backup using pg_restore or point-in-time recovery
5Validate data integrity after restoration
6Restart dependent services and verify functionality

Commands

Terminal

# Stop services
docker-compose stop api scanner

# Restore PostgreSQL from backup
pg_restore -h localhost -U postgres -d mcpsafe \
  --clean --if-exists backup_latest.dump

# For point-in-time recovery (if WAL archiving is enabled)
# Edit postgresql.conf with recovery settings
recovery_target_time = '2024-01-15 10:30:00'

# Verify data integrity
psql -h localhost -U postgres -d mcpsafe \
  -c "SELECT COUNT(*) FROM servers;"

# Restart services
docker-compose up -d api scanner

Redis Cache Recovery

High

Procedures for recovering Redis cache and job queue functionality.

RTO: < 5 minutesRPO: N/A (cache)

Recovery Steps

1Identify the Redis failure type (memory, persistence, connectivity)
2Check Redis AOF and RDB persistence files for recovery options
3Restart Redis container or restore from persistence files
4Clear stale cache entries if needed
5Verify job queue functionality
6Monitor for any queued jobs that need reprocessing

Commands

Terminal

# Check Redis health
docker exec mcpsafe-redis redis-cli ping

# Restart Redis container
docker-compose restart redis

# If data recovery needed from AOF
docker exec mcpsafe-redis redis-cli BGREWRITEAOF

# Clear all cache (if starting fresh)
docker exec mcpsafe-redis redis-cli FLUSHALL

# Verify Redis connectivity
docker exec mcpsafe-redis redis-cli INFO replication

Scanner Service Recovery

High

Procedures for recovering the security scanner service.

RTO: < 10 minutesRPO: N/A (stateless)

Recovery Steps

1Check scanner container logs for error diagnosis
2Verify Redis connectivity (scanner depends on Redis for job queue)
3Restart the scanner container
4Verify health endpoint is responding
5Test a sample scan to confirm functionality
6Check for any queued scans that need reprocessing

Commands

Terminal

# Check scanner logs
docker logs mcpsafe-scanner --tail 100

# Verify Redis connection
docker exec mcpsafe-scanner curl -f http://localhost:8001/health

# Restart scanner
docker-compose restart scanner

# Test health endpoint
curl http://localhost:8001/health

# Test a sample scan
curl -X POST http://localhost:8001/scan \
  -H "Content-Type: application/json" \
  -d '{"url": "https://github.com/test/repo"}'

Full System Recovery

Critical

Complete system recovery procedure for catastrophic failures.

RTO: < 30 minutesRPO: < 1 hour

Recovery Steps

1Assess the scope of the failure and affected components
2Provision new infrastructure if hardware failure occurred
3Restore PostgreSQL database from latest backup
4Restore Redis data or accept cache rebuild
5Deploy application containers with verified images
6Run integration tests to verify system functionality
7Restore DNS/load balancer configuration if affected
8Notify stakeholders of recovery completion

Commands

Terminal

# Pull latest verified images
docker-compose pull

# Restore database first (most critical)
pg_restore -h localhost -U postgres -d mcpsafe \
  --clean --if-exists /backups/latest.dump

# Start all services
docker-compose up -d

# Wait for health checks
sleep 30

# Verify all services are healthy
docker-compose ps

# Run smoke tests
curl http://localhost:3000/health
curl http://localhost:3001/health
curl http://localhost:8001/health

# Check database connectivity
docker exec mcpsafe-postgres pg_isready -U postgres

Backup Schedule

Component	Frequency	Retention	Type
PostgreSQL	Every 6 hours	30 days	Full backup + WAL archiving
Redis	Every hour (AOF)	24 hours	AOF + RDB snapshots
Configuration	On change	Indefinite (Git)	Version controlled
Secrets	On change	Indefinite	Encrypted vault backup

Service Degradation Procedures

When full recovery is not immediately possible, use these procedures to maintain partial service availability.

Database Overload

Symptoms

!Slow query responses
!Connection pool exhaustion
!High CPU usage

Mitigation Actions

Enable read replicas if available
Implement connection pooling (PgBouncer)
Temporarily disable non-critical features
Scale up database resources

Scanner Service Failure

Symptoms

!Scan requests timing out
!Health checks failing
!High error rates

Mitigation Actions

Route traffic to backup scanner instance
Queue incoming scans for later processing
Display maintenance message to users
Investigate and restart failed containers

Redis Failure

Symptoms

!Cache misses
!Job queue not processing
!Session issues

Mitigation Actions

Fall back to direct database queries
Queue jobs in memory temporarily
Restart Redis with persistence recovery
Clear and rebuild cache if necessary

Network Partition

Symptoms

!Intermittent connectivity
!Service-to-service failures
!DNS issues

Mitigation Actions

Identify affected network segments
Fail over to healthy availability zones
Enable circuit breakers for failing services
Communicate with cloud provider if infrastructure issue

Emergency Response

Incident Classification

P0
Complete service outage - All hands response
P1
Major feature unavailable - Primary on-call response
P2
Degraded performance - Standard escalation

Response Checklist

1. Acknowledge the incident and assess severity
2. Notify relevant stakeholders
3. Begin investigation and mitigation
4. Document actions taken in incident log
5. Communicate status updates every 15 minutes for P0/P1
6. Complete post-incident review within 48 hours

Recovery Testing

Regular Testing Schedule

Regular testing ensures recovery procedures work when needed.

Backup Verification

Weekly automated restore tests to staging environment.

Weekly

Failover Drills

Quarterly full failover exercises to secondary region.

Quarterly

Runbook Reviews

Monthly review and update of recovery procedures.

Monthly

Full DR Test

Annual complete disaster recovery simulation.

Annually

Disaster Recovery Guide

Key Terms

RTO (Recovery Time Objective)

RPO (Recovery Point Objective)

Recovery Procedures

Database Recovery

Recovery Steps

Commands

Redis Cache Recovery

Recovery Steps

Commands

Scanner Service Recovery

Recovery Steps

Commands

Full System Recovery

Recovery Steps

Commands

Backup Schedule

Service Degradation Procedures

Database Overload

Symptoms

Mitigation Actions

Scanner Service Failure

Symptoms

Mitigation Actions

Redis Failure

Symptoms

Mitigation Actions

Network Partition

Symptoms

Mitigation Actions

Emergency Response

Incident Classification

Response Checklist

Recovery Testing

Regular Testing Schedule

Backup Verification

Failover Drills

Runbook Reviews

Full DR Test

Related Documentation