CalejoControl/docs/SAFETY_FRAMEWORK.md

440 lines
14 KiB
Markdown
Raw Normal View History

# Calejo Control Adapter - Safety Framework
## Overview
The Calejo Control Adapter implements a comprehensive multi-layer safety framework designed to prevent equipment damage, operational hazards, and ensure reliable pump station operation under all conditions, including system failures, communication loss, and cyber attacks.
**Safety Philosophy**: "Safety First" - All setpoints must pass through safety enforcement before reaching SCADA systems.
## Multi-Layer Safety Architecture
### Three-Layer Safety Model
```
┌─────────────────────────────────────────────────────────┐
│ Layer 3: Optimization Constraints (Calejo Optimize) │
│ - Economic optimization bounds: 25-45 Hz │
│ - Energy efficiency constraints │
│ - Production optimization limits │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Layer 2: Station Safety Limits (Control Adapter) │
│ - Database-enforced limits: 20-50 Hz │
│ - Rate of change limiting │
│ - Emergency stop integration │
│ - Failsafe mechanisms │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Layer 1: Physical Hard Limits (PLC/VFD) │
│ - Hardware-enforced limits: 15-55 Hz │
│ - Physical safety mechanisms │
│ - Equipment protection │
└─────────────────────────────────────────────────────────┘
```
## Safety Components
### 1. Safety Limit Enforcer (`src/core/safety.py`)
#### Purpose
The Safety Limit Enforcer is the **LAST line of defense** before setpoints are exposed to SCADA systems. ALL setpoints MUST pass through this enforcer.
#### Key Features
- **Multi-Layer Limit Enforcement**:
- Hard operational limits (speed, level, power, flow)
- Rate of change limiting
- Emergency stop integration
- Failsafe mode activation
- **Safety Limit Types**:
```python
@dataclass
class SafetyLimits:
hard_min_speed_hz: float # Minimum speed limit (Hz)
hard_max_speed_hz: float # Maximum speed limit (Hz)
hard_min_level_m: Optional[float] # Minimum level limit (meters)
hard_max_level_m: Optional[float] # Maximum level limit (meters)
hard_max_power_kw: Optional[float] # Maximum power limit (kW)
max_speed_change_hz_per_min: float # Rate of change limit
```
#### Enforcement Process
```python
def enforce_setpoint(station_id: str, pump_id: str, setpoint: float) -> Tuple[float, List[str]]:
"""
Enforce safety limits on setpoint.
Returns:
Tuple of (enforced_setpoint, violations)
- enforced_setpoint: Safe setpoint (clamped if necessary)
- violations: List of safety violations (for logging/alerting)
"""
# 1. Check emergency stop first (highest priority)
if emergency_stop_active:
return (0.0, ["EMERGENCY_STOP_ACTIVE"])
# 2. Enforce hard speed limits
if setpoint < hard_min_speed_hz:
enforced_setpoint = hard_min_speed_hz
violations.append("BELOW_MIN_SPEED")
elif setpoint > hard_max_speed_hz:
enforced_setpoint = hard_max_speed_hz
violations.append("ABOVE_MAX_SPEED")
# 3. Enforce rate of change limits
rate_violation = check_rate_of_change(previous_setpoint, enforced_setpoint)
if rate_violation:
enforced_setpoint = limit_rate_of_change(previous_setpoint, enforced_setpoint)
violations.append("RATE_OF_CHANGE_VIOLATION")
# 4. Return safe setpoint
return (enforced_setpoint, violations)
```
### 2. Emergency Stop Manager (`src/core/emergency_stop.py`)
#### Purpose
Provides manual override capability for emergency situations with highest priority override of all other controls.
#### Emergency Stop Levels
1. **Station-Level Emergency Stop**:
- Stops all pumps in a station
- Activated by station operators
- Requires manual reset
2. **Pump-Level Emergency Stop**:
- Stops individual pumps
- Activated for specific equipment issues
- Individual reset capability
#### Emergency Stop Features
- **Immediate Action**: Setpoints forced to 0 Hz immediately
- **Audit Logging**: All emergency operations logged
- **Manual Reset**: Requires explicit operator action to clear
- **Status Monitoring**: Real-time emergency stop status
- **Integration**: Seamless integration with safety framework
#### Emergency Stop API
```python
class EmergencyStopManager:
def activate_emergency_stop(self, station_id: str, pump_id: Optional[str] = None):
"""Activate emergency stop for station or specific pump."""
def clear_emergency_stop(self, station_id: str, pump_id: Optional[str] = None):
"""Clear emergency stop condition."""
def is_emergency_stop_active(self, station_id: str, pump_id: Optional[str] = None) -> bool:
"""Check if emergency stop is active."""
```
### 3. Database Watchdog (`src/monitoring/watchdog.py`)
#### Purpose
Ensures database connectivity and activates failsafe mode if updates stop, preventing stale or unsafe setpoints.
#### Watchdog Features
- **Periodic Health Checks**: Continuous database connectivity monitoring
- **Failsafe Activation**: Automatic activation on connectivity loss
- **Graceful Degradation**: Safe fallback to default setpoints
- **Alert Generation**: Immediate notification on watchdog activation
- **Auto-Recovery**: Automatic recovery when connectivity restored
#### Watchdog Configuration
```python
class DatabaseWatchdog:
def __init__(self, db_client, alert_manager, timeout_seconds: int):
"""
Args:
timeout_seconds: Time without updates before failsafe activation
"""
```
### 4. Rate of Change Limiting
#### Purpose
Prevents sudden speed changes that could damage pumps or cause operational issues.
#### Implementation
```python
def check_rate_of_change(self, previous_setpoint: float, new_setpoint: float) -> bool:
"""Check if rate of change exceeds limits."""
change_per_minute = abs(new_setpoint - previous_setpoint) * 60
return change_per_minute > self.max_speed_change_hz_per_min
def limit_rate_of_change(self, previous_setpoint: float, new_setpoint: float) -> float:
"""Limit setpoint change to safe rate."""
max_change = self.max_speed_change_hz_per_min / 60 # Convert to per-second
if new_setpoint > previous_setpoint:
return min(new_setpoint, previous_setpoint + max_change)
else:
return max(new_setpoint, previous_setpoint - max_change)
```
## Safety Configuration
### Database Schema for Safety Limits
```sql
-- Safety limits table
CREATE TABLE safety_limits (
station_id VARCHAR(50) NOT NULL,
pump_id VARCHAR(50) NOT NULL,
hard_min_speed_hz DECIMAL(5,2) NOT NULL,
hard_max_speed_hz DECIMAL(5,2) NOT NULL,
hard_min_level_m DECIMAL(6,2),
hard_max_level_m DECIMAL(6,2),
hard_max_power_kw DECIMAL(8,2),
max_speed_change_hz_per_min DECIMAL(5,2) NOT NULL,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (station_id, pump_id)
);
-- Emergency stop status table
CREATE TABLE emergency_stop_status (
station_id VARCHAR(50) NOT NULL,
pump_id VARCHAR(50),
active BOOLEAN NOT NULL DEFAULT FALSE,
activated_at TIMESTAMP,
activated_by VARCHAR(100),
reason TEXT,
PRIMARY KEY (station_id, COALESCE(pump_id, 'STATION'))
);
```
### Configuration Parameters
#### Safety Limits Configuration
```yaml
safety_limits:
default_hard_min_speed_hz: 20.0
default_hard_max_speed_hz: 50.0
default_max_speed_change_hz_per_min: 30.0
# Per-station overrides
station_overrides:
station_001:
hard_min_speed_hz: 25.0
hard_max_speed_hz: 48.0
station_002:
hard_min_speed_hz: 22.0
hard_max_speed_hz: 52.0
```
#### Watchdog Configuration
```yaml
watchdog:
timeout_seconds: 1200 # 20 minutes
check_interval_seconds: 60
failsafe_setpoints:
default_speed_hz: 30.0
station_overrides:
station_001: 35.0
station_002: 28.0
```
## Safety Procedures
### Emergency Stop Procedures
#### Activation Procedure
1. **Operator Action**:
- Access emergency stop control via REST API or dashboard
- Select station and/or specific pump
- Provide reason for emergency stop
- Confirm activation
2. **System Response**:
- Immediate setpoint override to 0 Hz
- Audit log entry with timestamp and operator
- Alert notification to configured channels
- Safety status update in all protocol servers
#### Clearance Procedure
1. **Operator Action**:
- Access emergency stop control
- Verify safe conditions for restart
- Clear emergency stop condition
- Confirm clearance
2. **System Response**:
- Resume normal setpoint calculation
- Audit log entry for clearance
- Alert notification of system restoration
- Safety status update
### Failsafe Mode Activation
#### Automatic Activation Conditions
1. **Database Connectivity Loss**:
- Watchdog timeout exceeded
- No successful database updates
- Automatic failsafe activation
2. **Safety Framework Failure**:
- Safety limit enforcer unresponsive
- Emergency stop manager failure
- Component health check failures
#### Failsafe Behavior
- **Default Setpoints**: Pre-configured safe setpoints
- **Limited Functionality**: Basic operational mode
- **Alert Generation**: Immediate notification of failsafe activation
- **Auto-Recovery**: Automatic return to normal operation when safe
## Safety Testing & Validation
### Unit Testing
```python
class TestSafetyFramework:
def test_emergency_stop_override(self):
"""Test that emergency stop overrides all other controls."""
def test_speed_limit_enforcement(self):
"""Test that speed limits are properly enforced."""
def test_rate_of_change_limiting(self):
"""Test that rate of change limits are enforced."""
def test_failsafe_activation(self):
"""Test failsafe mode activation on watchdog timeout."""
```
### Integration Testing
```python
class TestSafetyIntegration:
def test_end_to_end_safety_workflow(self):
"""Test complete safety workflow from optimization to SCADA."""
def test_emergency_stop_integration(self):
"""Test emergency stop integration with all components."""
def test_watchdog_integration(self):
"""Test watchdog integration with alert system."""
```
### Validation Procedures
#### Safety Validation Checklist
- [ ] All setpoints pass through safety enforcer
- [ ] Emergency stop overrides all controls
- [ ] Rate of change limits are enforced
- [ ] Failsafe mode activates on connectivity loss
- [ ] Audit logging captures all safety events
- [ ] Alert system notifies on safety violations
#### Performance Validation
- **Response Time**: Safety enforcement < 10ms per setpoint
- **Emergency Stop**: Immediate activation (< 100ms)
- **Watchdog**: Timely detection of connectivity issues
- **Recovery**: Graceful recovery from failure conditions
## Safety Compliance & Certification
### Regulatory Compliance
#### IEC 61508 / IEC 61511
- **Safety Integrity Level (SIL)**: Designed for SIL 2 requirements
- **Fault Tolerance**: Redundant safety mechanisms
- **Failure Analysis**: Comprehensive failure mode analysis
- **Safety Validation**: Rigorous testing and validation
#### Industry Standards
- **Water/Wastewater**: Compliance with industry safety standards
- **Municipal Operations**: Alignment with municipal safety requirements
- **Equipment Protection**: Protection of pump and motor equipment
### Safety Certification Process
#### Documentation Requirements
- Safety Requirements Specification (SRS)
- Safety Manual
- Validation Test Reports
- Safety Case Documentation
#### Testing & Validation
- Safety Function Testing
- Failure Mode Testing
- Integration Testing
- Operational Testing
## Safety Monitoring & Reporting
### Real-Time Safety Monitoring
#### Safety Status Dashboard
- Current safety limits for each pump
- Emergency stop status
- Rate of change monitoring
- Watchdog status
- Safety violation history
#### Safety Metrics
- Safety enforcement statistics
- Emergency stop activations
- Rate of change violations
- Failsafe mode activations
- Response time metrics
### Safety Reporting
#### Daily Safety Reports
- Safety violations summary
- Emergency stop events
- System health status
- Compliance metrics
#### Compliance Reports
- Safety framework performance
- Regulatory compliance status
- Certification maintenance
- Audit trail verification
## Incident Response & Recovery
### Safety Incident Response
#### Incident Classification
- **Critical**: Equipment damage risk or safety hazard
- **Major**: Operational impact or safety violation
- **Minor**: Safety system warnings or alerts
#### Response Procedures
1. **Immediate Action**: Activate emergency stop if required
2. **Investigation**: Analyze safety violation details
3. **Correction**: Implement corrective actions
4. **Documentation**: Complete incident report
5. **Prevention**: Update safety procedures if needed
### System Recovery
#### Recovery Procedures
- Verify safety system integrity
- Clear emergency stop conditions
- Resume normal operations
- Monitor system performance
- Validate safety enforcement
---
*This safety framework documentation provides comprehensive guidance on the safety mechanisms, procedures, and compliance requirements for the Calejo Control Adapter. All safety-critical operations must follow these documented procedures.*