Skip to main content

Production Checklist

Complete this checklist before deploying OpenPrime to production.

Security​

Authentication & Authorization​

  • Keycloak hardened

    • Admin console secured (IP whitelist or VPN)
    • Admin password changed from default
    • Brute force protection enabled
    • Password policies configured
  • HTTPS everywhere

    • Valid SSL certificates installed
    • HTTP redirects to HTTPS
    • HSTS headers enabled
  • CORS configured

    • Only allowed origins specified
    • No wildcards in production
  • Rate limiting enabled

    • API rate limits configured
    • Login attempt limits set

Secrets Management​

  • Encryption key secured

    • 32-byte random key generated
    • Stored in secret manager (Vault, AWS Secrets, etc.)
    • Key rotation procedure documented
  • Database credentials

    • Strong passwords generated
    • Stored in secret manager
    • Not in environment files
  • No secrets in code

    • Git history reviewed
    • .env files gitignored
    • Secrets scanning enabled

Network Security​

  • Database not publicly accessible

    • Private subnet only
    • Security groups configured
  • Network policies

    • Pod-to-pod communication restricted
    • Egress rules defined
  • WAF configured (if applicable)

    • SQL injection protection
    • XSS protection
    • Rate limiting

Infrastructure​

High Availability​

  • Multiple replicas

    • Frontend: 2+ replicas
    • Backend: 3+ replicas
    • Database: Primary + replicas
  • Multi-zone deployment

    • Pods spread across availability zones
    • Database in multi-AZ configuration
  • Pod disruption budgets

    • Minimum available pods defined
    • Rolling update strategy configured

Scalability​

  • Autoscaling configured

    • HPA for frontend/backend
    • CPU/memory thresholds defined
    • Max replicas set appropriately
  • Resource limits

    • CPU requests/limits set
    • Memory requests/limits set
    • Tested under load

Database​

  • Backups configured

    • Automated daily backups
    • Point-in-time recovery enabled
    • Backup retention policy (30+ days)
    • Backup restoration tested
  • Connection pooling

    • PgBouncer or similar configured
    • Max connections appropriate
  • Monitoring

    • Slow query logging enabled
    • Connection monitoring
    • Storage alerts

Monitoring & Observability​

Logging​

  • Centralized logging

    • All services log to central location
    • Log retention policy defined
    • Sensitive data filtered from logs
  • Log levels appropriate

    • Production: info level
    • Debug logs disabled

Metrics​

  • Application metrics

    • Request latency tracked
    • Error rates monitored
    • Custom business metrics
  • Infrastructure metrics

    • CPU/memory utilization
    • Disk usage
    • Network I/O

Alerting​

  • Critical alerts configured

    • Service down
    • High error rate (>1%)
    • High latency (p99 > 2s)
    • Database connection failures
  • Alert routing

    • On-call schedule defined
    • Escalation policy configured
    • PagerDuty/Opsgenie integrated

Health Checks​

  • Liveness probes

    • All services have liveness probes
    • Appropriate thresholds set
  • Readiness probes

    • All services have readiness probes
    • Dependencies checked

Performance​

Load Testing​

  • Load tests completed

    • Expected peak load tested
    • 2x peak load tested
    • Response times acceptable
  • Stress tests completed

    • Breaking point identified
    • Graceful degradation verified

Optimization​

  • Database queries optimized

    • Indexes in place
    • Slow queries identified and fixed
    • N+1 queries eliminated
  • Caching configured

    • Static assets cached (CDN)
    • API responses cached where appropriate

Operations​

Deployment​

  • CI/CD pipeline

    • Automated tests run
    • Security scanning
    • Automated deployment
  • Rollback procedure

    • Quick rollback tested
    • Database rollback plan
    • Documented procedure
  • Blue-green or canary

    • Zero-downtime deployments
    • Traffic shifting capability

Disaster Recovery​

  • Recovery plan documented

    • RTO defined (Recovery Time Objective)
    • RPO defined (Recovery Point Objective)
    • Step-by-step procedures
  • DR tested

    • Backup restoration tested
    • Failover tested
    • Recovery time measured

Documentation​

  • Runbooks created

    • Common issues documented
    • Escalation procedures
    • Contact information
  • Architecture documented

    • Network diagrams
    • Data flow diagrams
    • Integration points

Compliance​

Data Protection​

  • Data classification

    • Sensitive data identified
    • Encryption requirements met
  • Retention policies

    • Data retention defined
    • Deletion procedures documented

Audit​

  • Audit logging

    • User actions logged
    • Admin actions logged
    • Logs tamper-proof
  • Access reviews

    • Regular access reviews scheduled
    • Principle of least privilege applied

Final Verification​

Pre-Launch​

  • All checklist items completed
  • Security review completed
  • Load testing passed
  • DR test completed
  • Team trained on operations

Launch Day​

  • Monitoring dashboards ready
  • On-call schedule confirmed
  • Rollback plan ready
  • Communication plan ready
  • Support channels ready