CalejoControl/docs/SAFETY_FRAMEWORK.md

14 KiB

Calejo Control Adapter - Safety Framework

Overview

The Calejo Control Adapter implements a comprehensive multi-layer safety framework designed to prevent equipment damage, operational hazards, and ensure reliable pump station operation under all conditions, including system failures, communication loss, and cyber attacks.

Safety Philosophy: "Safety First" - All setpoints must pass through safety enforcement before reaching SCADA systems.

Multi-Layer Safety Architecture

Three-Layer Safety Model

┌─────────────────────────────────────────────────────────┐
│  Layer 3: Optimization Constraints (Calejo Optimize)    │
│  - Economic optimization bounds: 25-45 Hz               │
│  - Energy efficiency constraints                        │
│  - Production optimization limits                       │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│  Layer 2: Station Safety Limits (Control Adapter)       │
│  - Database-enforced limits: 20-50 Hz                   │
│  - Rate of change limiting                              │
│  - Emergency stop integration                           │
│  - Failsafe mechanisms                                  │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│  Layer 1: Physical Hard Limits (PLC/VFD)                │
│  - Hardware-enforced limits: 15-55 Hz                   │
│  - Physical safety mechanisms                           │
│  - Equipment protection                                 │
└─────────────────────────────────────────────────────────┘

Safety Components

1. Safety Limit Enforcer (src/core/safety.py)

Purpose

The Safety Limit Enforcer is the LAST line of defense before setpoints are exposed to SCADA systems. ALL setpoints MUST pass through this enforcer.

Key Features

  • Multi-Layer Limit Enforcement:

    • Hard operational limits (speed, level, power, flow)
    • Rate of change limiting
    • Emergency stop integration
    • Failsafe mode activation
  • Safety Limit Types:

    @dataclass
    class SafetyLimits:
        hard_min_speed_hz: float          # Minimum speed limit (Hz)
        hard_max_speed_hz: float          # Maximum speed limit (Hz)
        hard_min_level_m: Optional[float] # Minimum level limit (meters)
        hard_max_level_m: Optional[float] # Maximum level limit (meters)
        hard_max_power_kw: Optional[float] # Maximum power limit (kW)
        max_speed_change_hz_per_min: float # Rate of change limit
    

Enforcement Process

def enforce_setpoint(station_id: str, pump_id: str, setpoint: float) -> Tuple[float, List[str]]:
    """
    Enforce safety limits on setpoint.
    
    Returns:
        Tuple of (enforced_setpoint, violations)
        - enforced_setpoint: Safe setpoint (clamped if necessary)
        - violations: List of safety violations (for logging/alerting)
    """
    
    # 1. Check emergency stop first (highest priority)
    if emergency_stop_active:
        return (0.0, ["EMERGENCY_STOP_ACTIVE"])
    
    # 2. Enforce hard speed limits
    if setpoint < hard_min_speed_hz:
        enforced_setpoint = hard_min_speed_hz
        violations.append("BELOW_MIN_SPEED")
    elif setpoint > hard_max_speed_hz:
        enforced_setpoint = hard_max_speed_hz
        violations.append("ABOVE_MAX_SPEED")
    
    # 3. Enforce rate of change limits
    rate_violation = check_rate_of_change(previous_setpoint, enforced_setpoint)
    if rate_violation:
        enforced_setpoint = limit_rate_of_change(previous_setpoint, enforced_setpoint)
        violations.append("RATE_OF_CHANGE_VIOLATION")
    
    # 4. Return safe setpoint
    return (enforced_setpoint, violations)

2. Emergency Stop Manager (src/core/emergency_stop.py)

Purpose

Provides manual override capability for emergency situations with highest priority override of all other controls.

Emergency Stop Levels

  1. Station-Level Emergency Stop:

    • Stops all pumps in a station
    • Activated by station operators
    • Requires manual reset
  2. Pump-Level Emergency Stop:

    • Stops individual pumps
    • Activated for specific equipment issues
    • Individual reset capability

Emergency Stop Features

  • Immediate Action: Setpoints forced to 0 Hz immediately
  • Audit Logging: All emergency operations logged
  • Manual Reset: Requires explicit operator action to clear
  • Status Monitoring: Real-time emergency stop status
  • Integration: Seamless integration with safety framework

Emergency Stop API

class EmergencyStopManager:
    def activate_emergency_stop(self, station_id: str, pump_id: Optional[str] = None):
        """Activate emergency stop for station or specific pump."""
        
    def clear_emergency_stop(self, station_id: str, pump_id: Optional[str] = None):
        """Clear emergency stop condition."""
        
    def is_emergency_stop_active(self, station_id: str, pump_id: Optional[str] = None) -> bool:
        """Check if emergency stop is active."""

3. Database Watchdog (src/monitoring/watchdog.py)

Purpose

Ensures database connectivity and activates failsafe mode if updates stop, preventing stale or unsafe setpoints.

Watchdog Features

  • Periodic Health Checks: Continuous database connectivity monitoring
  • Failsafe Activation: Automatic activation on connectivity loss
  • Graceful Degradation: Safe fallback to default setpoints
  • Alert Generation: Immediate notification on watchdog activation
  • Auto-Recovery: Automatic recovery when connectivity restored

Watchdog Configuration

class DatabaseWatchdog:
    def __init__(self, db_client, alert_manager, timeout_seconds: int):
        """
        Args:
            timeout_seconds: Time without updates before failsafe activation
        """

4. Rate of Change Limiting

Purpose

Prevents sudden speed changes that could damage pumps or cause operational issues.

Implementation

def check_rate_of_change(self, previous_setpoint: float, new_setpoint: float) -> bool:
    """Check if rate of change exceeds limits."""
    change_per_minute = abs(new_setpoint - previous_setpoint) * 60
    return change_per_minute > self.max_speed_change_hz_per_min

def limit_rate_of_change(self, previous_setpoint: float, new_setpoint: float) -> float:
    """Limit setpoint change to safe rate."""
    max_change = self.max_speed_change_hz_per_min / 60  # Convert to per-second
    if new_setpoint > previous_setpoint:
        return min(new_setpoint, previous_setpoint + max_change)
    else:
        return max(new_setpoint, previous_setpoint - max_change)

Safety Configuration

Database Schema for Safety Limits

-- Safety limits table
CREATE TABLE safety_limits (
    station_id VARCHAR(50) NOT NULL,
    pump_id VARCHAR(50) NOT NULL,
    hard_min_speed_hz DECIMAL(5,2) NOT NULL,
    hard_max_speed_hz DECIMAL(5,2) NOT NULL,
    hard_min_level_m DECIMAL(6,2),
    hard_max_level_m DECIMAL(6,2),
    hard_max_power_kw DECIMAL(8,2),
    max_speed_change_hz_per_min DECIMAL(5,2) NOT NULL,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (station_id, pump_id)
);

-- Emergency stop status table
CREATE TABLE emergency_stop_status (
    station_id VARCHAR(50) NOT NULL,
    pump_id VARCHAR(50),
    active BOOLEAN NOT NULL DEFAULT FALSE,
    activated_at TIMESTAMP,
    activated_by VARCHAR(100),
    reason TEXT,
    PRIMARY KEY (station_id, COALESCE(pump_id, 'STATION'))
);

Configuration Parameters

Safety Limits Configuration

safety_limits:
  default_hard_min_speed_hz: 20.0
  default_hard_max_speed_hz: 50.0
  default_max_speed_change_hz_per_min: 30.0
  
  # Per-station overrides
  station_overrides:
    station_001:
      hard_min_speed_hz: 25.0
      hard_max_speed_hz: 48.0
    station_002:
      hard_min_speed_hz: 22.0
      hard_max_speed_hz: 52.0

Watchdog Configuration

watchdog:
  timeout_seconds: 1200  # 20 minutes
  check_interval_seconds: 60
  failsafe_setpoints:
    default_speed_hz: 30.0
    station_overrides:
      station_001: 35.0
      station_002: 28.0

Safety Procedures

Emergency Stop Procedures

Activation Procedure

  1. Operator Action:

    • Access emergency stop control via REST API or dashboard
    • Select station and/or specific pump
    • Provide reason for emergency stop
    • Confirm activation
  2. System Response:

    • Immediate setpoint override to 0 Hz
    • Audit log entry with timestamp and operator
    • Alert notification to configured channels
    • Safety status update in all protocol servers

Clearance Procedure

  1. Operator Action:

    • Access emergency stop control
    • Verify safe conditions for restart
    • Clear emergency stop condition
    • Confirm clearance
  2. System Response:

    • Resume normal setpoint calculation
    • Audit log entry for clearance
    • Alert notification of system restoration
    • Safety status update

Failsafe Mode Activation

Automatic Activation Conditions

  1. Database Connectivity Loss:

    • Watchdog timeout exceeded
    • No successful database updates
    • Automatic failsafe activation
  2. Safety Framework Failure:

    • Safety limit enforcer unresponsive
    • Emergency stop manager failure
    • Component health check failures

Failsafe Behavior

  • Default Setpoints: Pre-configured safe setpoints
  • Limited Functionality: Basic operational mode
  • Alert Generation: Immediate notification of failsafe activation
  • Auto-Recovery: Automatic return to normal operation when safe

Safety Testing & Validation

Unit Testing

class TestSafetyFramework:
    def test_emergency_stop_override(self):
        """Test that emergency stop overrides all other controls."""
        
    def test_speed_limit_enforcement(self):
        """Test that speed limits are properly enforced."""
        
    def test_rate_of_change_limiting(self):
        """Test that rate of change limits are enforced."""
        
    def test_failsafe_activation(self):
        """Test failsafe mode activation on watchdog timeout."""

Integration Testing

class TestSafetyIntegration:
    def test_end_to_end_safety_workflow(self):
        """Test complete safety workflow from optimization to SCADA."""
        
    def test_emergency_stop_integration(self):
        """Test emergency stop integration with all components."""
        
    def test_watchdog_integration(self):
        """Test watchdog integration with alert system."""

Validation Procedures

Safety Validation Checklist

  • All setpoints pass through safety enforcer
  • Emergency stop overrides all controls
  • Rate of change limits are enforced
  • Failsafe mode activates on connectivity loss
  • Audit logging captures all safety events
  • Alert system notifies on safety violations

Performance Validation

  • Response Time: Safety enforcement < 10ms per setpoint
  • Emergency Stop: Immediate activation (< 100ms)
  • Watchdog: Timely detection of connectivity issues
  • Recovery: Graceful recovery from failure conditions

Safety Compliance & Certification

Regulatory Compliance

IEC 61508 / IEC 61511

  • Safety Integrity Level (SIL): Designed for SIL 2 requirements
  • Fault Tolerance: Redundant safety mechanisms
  • Failure Analysis: Comprehensive failure mode analysis
  • Safety Validation: Rigorous testing and validation

Industry Standards

  • Water/Wastewater: Compliance with industry safety standards
  • Municipal Operations: Alignment with municipal safety requirements
  • Equipment Protection: Protection of pump and motor equipment

Safety Certification Process

Documentation Requirements

  • Safety Requirements Specification (SRS)
  • Safety Manual
  • Validation Test Reports
  • Safety Case Documentation

Testing & Validation

  • Safety Function Testing
  • Failure Mode Testing
  • Integration Testing
  • Operational Testing

Safety Monitoring & Reporting

Real-Time Safety Monitoring

Safety Status Dashboard

  • Current safety limits for each pump
  • Emergency stop status
  • Rate of change monitoring
  • Watchdog status
  • Safety violation history

Safety Metrics

  • Safety enforcement statistics
  • Emergency stop activations
  • Rate of change violations
  • Failsafe mode activations
  • Response time metrics

Safety Reporting

Daily Safety Reports

  • Safety violations summary
  • Emergency stop events
  • System health status
  • Compliance metrics

Compliance Reports

  • Safety framework performance
  • Regulatory compliance status
  • Certification maintenance
  • Audit trail verification

Incident Response & Recovery

Safety Incident Response

Incident Classification

  • Critical: Equipment damage risk or safety hazard
  • Major: Operational impact or safety violation
  • Minor: Safety system warnings or alerts

Response Procedures

  1. Immediate Action: Activate emergency stop if required
  2. Investigation: Analyze safety violation details
  3. Correction: Implement corrective actions
  4. Documentation: Complete incident report
  5. Prevention: Update safety procedures if needed

System Recovery

Recovery Procedures

  • Verify safety system integrity
  • Clear emergency stop conditions
  • Resume normal operations
  • Monitor system performance
  • Validate safety enforcement

This safety framework documentation provides comprehensive guidance on the safety mechanisms, procedures, and compliance requirements for the Calejo Control Adapter. All safety-critical operations must follow these documented procedures.