CalejoControl/docs/OPERATIONS_MAINTENANCE.md

12 KiB

Calejo Control Adapter - Operations & Maintenance Guide

Overview

This guide provides comprehensive procedures for daily operations, monitoring, troubleshooting, and maintenance of the Calejo Control Adapter system.

Daily Operations

System Startup and Shutdown

Normal Startup Procedure

# Start all services
docker-compose up -d

# Verify services are running
docker-compose ps

# Check health status
curl http://localhost:8080/api/v1/health

Graceful Shutdown Procedure

# Stop services gracefully
docker-compose down

# Verify all services stopped
docker-compose ps

Emergency Shutdown

# Immediate shutdown (use only in emergencies)
docker-compose down --timeout 0

Daily Health Checks

Automated Health Monitoring

# Run automated health check
./scripts/health-check.sh

# Check specific components
curl http://localhost:8080/api/v1/health/detailed

Manual Health Verification

# Check database connectivity
psql "${DATABASE_URL}" -c "SELECT 1;"

# Check protocol servers
opcua-client connect opc.tcp://localhost:4840
modbus-tcp read 127.0.0.1 502 40001 10
curl http://localhost:8080/api/v1/status

Performance Monitoring

Key Performance Indicators

Metric Target Alert Threshold
Response Time < 100ms > 500ms
CPU Usage < 70% > 90%
Memory Usage < 80% > 95%
Database Connections < 50% of max > 80% of max
Network Latency < 10ms > 50ms

Performance Monitoring Commands

# Monitor system resources
docker stats

# Check application performance
curl http://localhost:8080/api/v1/metrics

# Monitor database performance
psql "${DATABASE_URL}" -c "SELECT * FROM pg_stat_activity;"

Monitoring & Alerting

Real-time Monitoring

Application Monitoring

# View application logs in real-time
docker-compose logs -f control-adapter

# Monitor specific components
docker-compose logs -f control-adapter | grep -E "(ERROR|WARNING|CRITICAL)"

# Check service status
systemctl status calejo-control-adapter

Database Monitoring

# Monitor database performance
psql "${DATABASE_URL}" -c "SELECT * FROM pg_stat_database WHERE datname='calejo';"

# Check connection pool
psql "${DATABASE_URL}" -c "SELECT count(*) FROM pg_stat_activity WHERE datname='calejo';"

Alert Configuration

Email Alerts

# Email alert configuration
alerts:
  email:
    enabled: true
    smtp_server: smtp.example.com
    smtp_port: 587
    from_address: alerts@calejo.com
    to_addresses:
      - operations@calejo.com
      - engineering@calejo.com

SMS Alerts

# SMS alert configuration
alerts:
  sms:
    enabled: true
    provider: twilio
    account_sid: ${TWILIO_ACCOUNT_SID}
    auth_token: ${TWILIO_AUTH_TOKEN}
    from_number: +1234567890
    to_numbers:
      - +1234567891
      - +1234567892

Webhook Alerts

# Webhook alert configuration
alerts:
  webhook:
    enabled: true
    url: https://monitoring.example.com/webhook
    secret: ${WEBHOOK_SECRET}

Alert Severity Levels

Severity Description Response Time Notification Channels
Critical System failure, safety violation Immediate (< 15 min) SMS, Email, Webhook
High Performance degradation, security event Urgent (< 1 hour) Email, Webhook
Medium Configuration issues, warnings Standard (< 4 hours) Email
Low Informational events Routine (< 24 hours) Dashboard only

Maintenance Procedures

Regular Maintenance Tasks

Daily Tasks

# Check system health
./scripts/health-check.sh

# Review error logs
docker-compose logs control-adapter --since "24h" | grep ERROR

# Verify backups
ls -la /var/backup/calejo/

Weekly Tasks

# Database maintenance
psql "${DATABASE_URL}" -c "VACUUM ANALYZE;"

# Log rotation
find /var/log/calejo -name "*.log" -mtime +7 -delete

# Backup verification
./scripts/verify-backup.sh latest-backup.tar.gz

Monthly Tasks

# Security updates
docker-compose pull
docker-compose build --no-cache

# Performance analysis
./scripts/performance-analysis.sh

# Compliance audit
./scripts/compliance-audit.sh

Backup and Recovery

Automated Backups

# Create full backup
./scripts/backup-full.sh

# Create configuration-only backup
./scripts/backup-config.sh

# Create database-only backup
./scripts/backup-database.sh

Backup Schedule

Backup Type Frequency Retention Location
Full System Daily 7 days /var/backup/calejo/
Database Hourly 24 hours /var/backup/calejo/database/
Configuration Weekly 4 weeks /var/backup/calejo/config/

Recovery Procedures

# Full system recovery
./scripts/restore-full.sh /var/backup/calejo/calejo-backup-20231026.tar.gz

# Database recovery
./scripts/restore-database.sh /var/backup/calejo/database/backup.sql

# Configuration recovery
./scripts/restore-config.sh /var/backup/calejo/config/config-backup.tar.gz

Software Updates

Update Procedure

# 1. Create backup
./scripts/backup-full.sh

# 2. Stop services
docker-compose down

# 3. Update application
git pull origin main

# 4. Rebuild services
docker-compose build --no-cache

# 5. Start services
docker-compose up -d

# 6. Verify update
./scripts/health-check.sh

Rollback Procedure

# 1. Stop services
docker-compose down

# 2. Restore from backup
./scripts/restore-full.sh /var/backup/calejo/calejo-backup-pre-update.tar.gz

# 3. Start services
docker-compose up -d

# 4. Verify rollback
./scripts/health-check.sh

Troubleshooting

Common Issues and Solutions

Database Connection Issues

Symptoms:

  • "Connection refused" errors
  • Slow response times
  • Connection pool exhaustion

Solutions:

# Check PostgreSQL status
systemctl status postgresql

# Verify connection parameters
psql "${DATABASE_URL}" -c "SELECT version();"

# Check connection pool
psql "${DATABASE_URL}" -c "SELECT count(*) FROM pg_stat_activity;"

Protocol Server Issues

OPC UA Server Problems:

# Test OPC UA connectivity
opcua-client connect opc.tcp://localhost:4840

# Check OPC UA logs
docker-compose logs control-adapter | grep opcua

# Verify certificate validity
openssl x509 -in /app/certs/server.pem -text -noout

Modbus TCP Issues:

# Test Modbus connectivity
modbus-tcp read 127.0.0.1 502 40001 10

# Check Modbus logs
docker-compose logs control-adapter | grep modbus

# Verify port availability
netstat -tulpn | grep :502

Performance Issues

High CPU Usage:

# Identify resource usage
docker stats

# Check for runaway processes
ps aux | grep python

# Analyze database queries
psql "${DATABASE_URL}" -c "SELECT query, calls, total_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"

Memory Issues:

# Check memory usage
free -h

# Monitor application memory
docker stats control-adapter

# Check for memory leaks
journalctl -u docker --since "1 hour ago" | grep -i memory

Diagnostic Tools

Log Analysis

# View recent errors
docker-compose logs control-adapter --since "1h" | grep -E "(ERROR|CRITICAL)"

# Search for specific patterns
docker-compose logs control-adapter | grep -i "connection"

# Export logs for analysis
docker-compose logs control-adapter > application-logs-$(date +%Y%m%d).log

Performance Analysis

# Run performance tests
./scripts/performance-test.sh

# Generate performance report
./scripts/performance-report.sh

# Monitor real-time performance
./scripts/monitor-performance.sh

Security Analysis

# Run security scan
./scripts/security-scan.sh

# Check compliance status
./scripts/compliance-check.sh

# Audit user activity
./scripts/audit-report.sh

Security Operations

Access Control

User Management

# List current users
curl -H "Authorization: Bearer ${TOKEN}" http://localhost:8080/api/v1/users

# Create new user
curl -X POST -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
  -d '{"username":"newuser","role":"operator","email":"user@example.com"}' \
  http://localhost:8080/api/v1/users

# Deactivate user
curl -X DELETE -H "Authorization: Bearer ${TOKEN}" \
  http://localhost:8080/api/v1/users/user123

Role Management

# View role permissions
curl -H "Authorization: Bearer ${TOKEN}" http://localhost:8080/api/v1/roles

# Update role permissions
curl -X PUT -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
  -d '{"permissions":["read_pump_status","emergency_stop"]}' \
  http://localhost:8080/api/v1/roles/operator

Security Monitoring

Audit Log Review

# View recent security events
psql "${DATABASE_URL}" -c "SELECT * FROM compliance_audit_log WHERE severity IN ('HIGH','CRITICAL') ORDER BY timestamp DESC LIMIT 10;"

# Generate security report
./scripts/security-report.sh

# Monitor failed login attempts
psql "${DATABASE_URL}" -c "SELECT COUNT(*) FROM compliance_audit_log WHERE event_type='INVALID_AUTHENTICATION' AND timestamp > NOW() - INTERVAL '1 hour';"

Certificate Management

# Check certificate expiration
openssl x509 -in /app/certs/server.pem -enddate -noout

# Rotate certificates
./scripts/rotate-certificates.sh

# Verify certificate chain
openssl verify -CAfile /app/certs/ca.crt /app/certs/server.pem

Compliance Operations

Regulatory Compliance

IEC 62443 Compliance

# Generate compliance report
./scripts/iec62443-report.sh

# Verify security controls
./scripts/security-controls-check.sh

# Audit trail verification
./scripts/audit-trail-verification.sh

ISO 27001 Compliance

# ISO 27001 controls check
./scripts/iso27001-check.sh

# Risk assessment
./scripts/risk-assessment.sh

# Security policy compliance
./scripts/security-policy-check.sh

Documentation and Reporting

Compliance Reports

# Generate monthly compliance report
./scripts/generate-compliance-report.sh

# Export audit logs
./scripts/export-audit-logs.sh

# Create security assessment
./scripts/security-assessment.sh

Emergency Procedures

Emergency Stop Operations

Manual Emergency Stop

# Activate emergency stop for station
curl -X POST -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
  -d '{"reason":"Emergency maintenance","operator":"operator001"}' \
  http://localhost:8080/api/v1/pump-stations/station001/emergency-stop

# Clear emergency stop
curl -X DELETE -H "Authorization: Bearer ${TOKEN}" \
  http://localhost:8080/api/v1/pump-stations/station001/emergency-stop

System Recovery

# Check emergency stop status
curl -H "Authorization: Bearer ${TOKEN}" \
  http://localhost:8080/api/v1/pump-stations/station001/emergency-stop-status

# Verify system recovery
./scripts/emergency-recovery-check.sh

Disaster Recovery

Full System Recovery

# 1. Stop all services
docker-compose down

# 2. Restore from latest backup
./scripts/restore-full.sh /var/backup/calejo/calejo-backup-latest.tar.gz

# 3. Start services
docker-compose up -d

# 4. Verify recovery
./scripts/health-check.sh
./scripts/emergency-recovery-verification.sh

Database Recovery

# 1. Stop database-dependent services
docker-compose stop control-adapter

# 2. Restore database
./scripts/restore-database.sh /var/backup/calejo/database/backup-latest.sql

# 3. Start services
docker-compose up -d

# 4. Verify data integrity
./scripts/database-integrity-check.sh

This operations and maintenance guide provides comprehensive procedures for managing the Calejo Control Adapter system. Always follow documented procedures and maintain proper change control for all operational activities.