CalejoControl/docs/OPERATIONS_MAINTENANCE.md

576 lines
12 KiB
Markdown

# Calejo Control Adapter - Operations & Maintenance Guide
## Overview
This guide provides comprehensive procedures for daily operations, monitoring, troubleshooting, and maintenance of the Calejo Control Adapter system.
## Daily Operations
### System Startup and Shutdown
#### Normal Startup Procedure
```bash
# Start all services
docker-compose up -d
# Verify services are running
docker-compose ps
# Check health status
curl http://localhost:8080/api/v1/health
```
#### Graceful Shutdown Procedure
```bash
# Stop services gracefully
docker-compose down
# Verify all services stopped
docker-compose ps
```
#### Emergency Shutdown
```bash
# Immediate shutdown (use only in emergencies)
docker-compose down --timeout 0
```
### Daily Health Checks
#### Automated Health Monitoring
```bash
# Run automated health check
./scripts/health-check.sh
# Check specific components
curl http://localhost:8080/api/v1/health/detailed
```
#### Manual Health Verification
```python
# Check database connectivity
psql "${DATABASE_URL}" -c "SELECT 1;"
# Check protocol servers
opcua-client connect opc.tcp://localhost:4840
modbus-tcp read 127.0.0.1 502 40001 10
curl http://localhost:8080/api/v1/status
```
### Performance Monitoring
#### Key Performance Indicators
| Metric | Target | Alert Threshold |
|--------|--------|-----------------|
| **Response Time** | < 100ms | > 500ms |
| **CPU Usage** | < 70% | > 90% |
| **Memory Usage** | < 80% | > 95% |
| **Database Connections** | < 50% of max | > 80% of max |
| **Network Latency** | < 10ms | > 50ms |
#### Performance Monitoring Commands
```bash
# Monitor system resources
docker stats
# Check application performance
curl http://localhost:8080/api/v1/metrics
# Monitor database performance
psql "${DATABASE_URL}" -c "SELECT * FROM pg_stat_activity;"
```
## Monitoring & Alerting
### Real-time Monitoring
#### Application Monitoring
```bash
# View application logs in real-time
docker-compose logs -f control-adapter
# Monitor specific components
docker-compose logs -f control-adapter | grep -E "(ERROR|WARNING|CRITICAL)"
# Check service status
systemctl status calejo-control-adapter
```
#### Database Monitoring
```bash
# Monitor database performance
psql "${DATABASE_URL}" -c "SELECT * FROM pg_stat_database WHERE datname='calejo';"
# Check connection pool
psql "${DATABASE_URL}" -c "SELECT count(*) FROM pg_stat_activity WHERE datname='calejo';"
```
### Alert Configuration
#### Email Alerts
```yaml
# Email alert configuration
alerts:
email:
enabled: true
smtp_server: smtp.example.com
smtp_port: 587
from_address: alerts@calejo.com
to_addresses:
- operations@calejo.com
- engineering@calejo.com
```
#### SMS Alerts
```yaml
# SMS alert configuration
alerts:
sms:
enabled: true
provider: twilio
account_sid: ${TWILIO_ACCOUNT_SID}
auth_token: ${TWILIO_AUTH_TOKEN}
from_number: +1234567890
to_numbers:
- +1234567891
- +1234567892
```
#### Webhook Alerts
```yaml
# Webhook alert configuration
alerts:
webhook:
enabled: true
url: https://monitoring.example.com/webhook
secret: ${WEBHOOK_SECRET}
```
### Alert Severity Levels
| Severity | Description | Response Time | Notification Channels |
|----------|-------------|---------------|----------------------|
| **Critical** | System failure, safety violation | Immediate (< 15 min) | SMS, Email, Webhook |
| **High** | Performance degradation, security event | Urgent (< 1 hour) | Email, Webhook |
| **Medium** | Configuration issues, warnings | Standard (< 4 hours) | Email |
| **Low** | Informational events | Routine (< 24 hours) | Dashboard only |
## Maintenance Procedures
### Regular Maintenance Tasks
#### Daily Tasks
```bash
# Check system health
./scripts/health-check.sh
# Review error logs
docker-compose logs control-adapter --since "24h" | grep ERROR
# Verify backups
ls -la /var/backup/calejo/
```
#### Weekly Tasks
```bash
# Database maintenance
psql "${DATABASE_URL}" -c "VACUUM ANALYZE;"
# Log rotation
find /var/log/calejo -name "*.log" -mtime +7 -delete
# Backup verification
./scripts/verify-backup.sh latest-backup.tar.gz
```
#### Monthly Tasks
```bash
# Security updates
docker-compose pull
docker-compose build --no-cache
# Performance analysis
./scripts/performance-analysis.sh
# Compliance audit
./scripts/compliance-audit.sh
```
### Backup and Recovery
#### Automated Backups
```bash
# Create full backup
./scripts/backup-full.sh
# Create configuration-only backup
./scripts/backup-config.sh
# Create database-only backup
./scripts/backup-database.sh
```
#### Backup Schedule
| Backup Type | Frequency | Retention | Location |
|-------------|-----------|-----------|----------|
| **Full System** | Daily | 7 days | /var/backup/calejo/ |
| **Database** | Hourly | 24 hours | /var/backup/calejo/database/ |
| **Configuration** | Weekly | 4 weeks | /var/backup/calejo/config/ |
#### Recovery Procedures
```bash
# Full system recovery
./scripts/restore-full.sh /var/backup/calejo/calejo-backup-20231026.tar.gz
# Database recovery
./scripts/restore-database.sh /var/backup/calejo/database/backup.sql
# Configuration recovery
./scripts/restore-config.sh /var/backup/calejo/config/config-backup.tar.gz
```
### Software Updates
#### Update Procedure
```bash
# 1. Create backup
./scripts/backup-full.sh
# 2. Stop services
docker-compose down
# 3. Update application
git pull origin main
# 4. Rebuild services
docker-compose build --no-cache
# 5. Start services
docker-compose up -d
# 6. Verify update
./scripts/health-check.sh
```
#### Rollback Procedure
```bash
# 1. Stop services
docker-compose down
# 2. Restore from backup
./scripts/restore-full.sh /var/backup/calejo/calejo-backup-pre-update.tar.gz
# 3. Start services
docker-compose up -d
# 4. Verify rollback
./scripts/health-check.sh
```
## Troubleshooting
### Common Issues and Solutions
#### Database Connection Issues
**Symptoms**:
- "Connection refused" errors
- Slow response times
- Connection pool exhaustion
**Solutions**:
```bash
# Check PostgreSQL status
systemctl status postgresql
# Verify connection parameters
psql "${DATABASE_URL}" -c "SELECT version();"
# Check connection pool
psql "${DATABASE_URL}" -c "SELECT count(*) FROM pg_stat_activity;"
```
#### Protocol Server Issues
**OPC UA Server Problems**:
```bash
# Test OPC UA connectivity
opcua-client connect opc.tcp://localhost:4840
# Check OPC UA logs
docker-compose logs control-adapter | grep opcua
# Verify certificate validity
openssl x509 -in /app/certs/server.pem -text -noout
```
**Modbus TCP Issues**:
```bash
# Test Modbus connectivity
modbus-tcp read 127.0.0.1 502 40001 10
# Check Modbus logs
docker-compose logs control-adapter | grep modbus
# Verify port availability
netstat -tulpn | grep :502
```
#### Performance Issues
**High CPU Usage**:
```bash
# Identify resource usage
docker stats
# Check for runaway processes
ps aux | grep python
# Analyze database queries
psql "${DATABASE_URL}" -c "SELECT query, calls, total_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"
```
**Memory Issues**:
```bash
# Check memory usage
free -h
# Monitor application memory
docker stats control-adapter
# Check for memory leaks
journalctl -u docker --since "1 hour ago" | grep -i memory
```
### Diagnostic Tools
#### Log Analysis
```bash
# View recent errors
docker-compose logs control-adapter --since "1h" | grep -E "(ERROR|CRITICAL)"
# Search for specific patterns
docker-compose logs control-adapter | grep -i "connection"
# Export logs for analysis
docker-compose logs control-adapter > application-logs-$(date +%Y%m%d).log
```
#### Performance Analysis
```bash
# Run performance tests
./scripts/performance-test.sh
# Generate performance report
./scripts/performance-report.sh
# Monitor real-time performance
./scripts/monitor-performance.sh
```
#### Security Analysis
```bash
# Run security scan
./scripts/security-scan.sh
# Check compliance status
./scripts/compliance-check.sh
# Audit user activity
./scripts/audit-report.sh
```
## Security Operations
### Access Control
#### User Management
```bash
# List current users
curl -H "Authorization: Bearer ${TOKEN}" http://localhost:8080/api/v1/users
# Create new user
curl -X POST -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
-d '{"username":"newuser","role":"operator","email":"user@example.com"}' \
http://localhost:8080/api/v1/users
# Deactivate user
curl -X DELETE -H "Authorization: Bearer ${TOKEN}" \
http://localhost:8080/api/v1/users/user123
```
#### Role Management
```bash
# View role permissions
curl -H "Authorization: Bearer ${TOKEN}" http://localhost:8080/api/v1/roles
# Update role permissions
curl -X PUT -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
-d '{"permissions":["read_pump_status","emergency_stop"]}' \
http://localhost:8080/api/v1/roles/operator
```
### Security Monitoring
#### Audit Log Review
```bash
# View recent security events
psql "${DATABASE_URL}" -c "SELECT * FROM compliance_audit_log WHERE severity IN ('HIGH','CRITICAL') ORDER BY timestamp DESC LIMIT 10;"
# Generate security report
./scripts/security-report.sh
# Monitor failed login attempts
psql "${DATABASE_URL}" -c "SELECT COUNT(*) FROM compliance_audit_log WHERE event_type='INVALID_AUTHENTICATION' AND timestamp > NOW() - INTERVAL '1 hour';"
```
#### Certificate Management
```bash
# Check certificate expiration
openssl x509 -in /app/certs/server.pem -enddate -noout
# Rotate certificates
./scripts/rotate-certificates.sh
# Verify certificate chain
openssl verify -CAfile /app/certs/ca.crt /app/certs/server.pem
```
## Compliance Operations
### Regulatory Compliance
#### IEC 62443 Compliance
```bash
# Generate compliance report
./scripts/iec62443-report.sh
# Verify security controls
./scripts/security-controls-check.sh
# Audit trail verification
./scripts/audit-trail-verification.sh
```
#### ISO 27001 Compliance
```bash
# ISO 27001 controls check
./scripts/iso27001-check.sh
# Risk assessment
./scripts/risk-assessment.sh
# Security policy compliance
./scripts/security-policy-check.sh
```
### Documentation and Reporting
#### Compliance Reports
```bash
# Generate monthly compliance report
./scripts/generate-compliance-report.sh
# Export audit logs
./scripts/export-audit-logs.sh
# Create security assessment
./scripts/security-assessment.sh
```
## Emergency Procedures
### Emergency Stop Operations
#### Manual Emergency Stop
```bash
# Activate emergency stop for station
curl -X POST -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
-d '{"reason":"Emergency maintenance","operator":"operator001"}' \
http://localhost:8080/api/v1/pump-stations/station001/emergency-stop
# Clear emergency stop
curl -X DELETE -H "Authorization: Bearer ${TOKEN}" \
http://localhost:8080/api/v1/pump-stations/station001/emergency-stop
```
#### System Recovery
```bash
# Check emergency stop status
curl -H "Authorization: Bearer ${TOKEN}" \
http://localhost:8080/api/v1/pump-stations/station001/emergency-stop-status
# Verify system recovery
./scripts/emergency-recovery-check.sh
```
### Disaster Recovery
#### Full System Recovery
```bash
# 1. Stop all services
docker-compose down
# 2. Restore from latest backup
./scripts/restore-full.sh /var/backup/calejo/calejo-backup-latest.tar.gz
# 3. Start services
docker-compose up -d
# 4. Verify recovery
./scripts/health-check.sh
./scripts/emergency-recovery-verification.sh
```
#### Database Recovery
```bash
# 1. Stop database-dependent services
docker-compose stop control-adapter
# 2. Restore database
./scripts/restore-database.sh /var/backup/calejo/database/backup-latest.sql
# 3. Start services
docker-compose up -d
# 4. Verify data integrity
./scripts/database-integrity-check.sh
```
---
*This operations and maintenance guide provides comprehensive procedures for managing the Calejo Control Adapter system. Always follow documented procedures and maintain proper change control for all operational activities.*