central/docs/BUG-CADENCE-DECREASE.md

211 lines
6.7 KiB
Markdown
Raw Permalink Normal View History

2026-05-16 21:27:30 +00:00
# Bug Investigation: Cadence Decrease Hot-Reload
**Date:** 2026-05-16
**Component:** central-supervisor
**File:** `supervisor.py`
---
## 1. Reproduction
### Test Case: Decrease 60s → 30s
```
Tlast (poll completed): 04:18:24Z
Config change applied: 04:18:30Z (approx)
Expected next poll: 04:18:54Z (Tlast + 30s)
Actual next poll: 04:19:24Z (Tlast + 60s - OLD cadence)
Subsequent polls: Also at 60s intervals
```
### Log Evidence
```json
{"ts": "...", "msg": "Rescheduled adapter", "adapter": "nws", "old_cadence_s": 60, "new_cadence_s": 30, "next_poll": "2026-05-16T04:18:54+00:00"}
```
- "Rescheduled adapter" log fires with **correct** calculated next_poll
- Actual poll occurs at OLD cadence time
- Subsequent polls continue at OLD cadence
### Contrast: Increase 60s → 90s (WORKS)
```
Tlast: 03:16:34Z
Config change: 03:16:36Z
Expected next poll: 03:18:04Z (Tlast + 90s)
Actual next poll: 03:18:04Z ✅
```
---
## 2. Root Cause
### Location
`supervisor.py` lines 395-450 (`_reschedule_adapter`) and lines 144-181 (`_run_adapter_loop`)
### The Bug
The `cancel_event.set()` call in `_reschedule_adapter` does not reliably wake the `asyncio.wait_for()` in the adapter loop when the cadence is **decreased**.
### Why It Happens
1. **Event handler holds lock during signal:**
```python
# _on_config_change (line 466)
async with self._lock:
new_config = await self._config_source.get_adapter(adapter_name)
# ...
await self._reschedule_adapter(adapter_name, new_config) # sets cancel_event here
```
2. **Reschedule updates config then signals:**
```python
# _reschedule_adapter
state.config = new_config # Line 420
state.adapter.cadence_s = new_cadence # Line 423
# ... logging ...
state.cancel_event.set() # Line 450 - inside lock context
```
3. **Asyncio event delivery delay:**
The `asyncio.Event.set()` queues a wakeup for waiting tasks, but the signal delivery is subject to asyncio's task scheduler. When called from within an `async with` block, the event may not be processed until the current task yields or the lock context exits.
4. **Timing difference between increase and decrease:**
- **Increase (60→90):** Loop has ~30-50s remaining sleep. Event signal arrives well before timeout.
- **Decrease (90→60):** Loop may be ~10s from timeout. By the time event signal is processed, timeout has already fired.
5. **Why subsequent polls use old cadence:**
When the loop times out naturally (rather than being woken by event), it proceeds to poll. After poll completes, `state.last_completed_poll` is updated. The loop then reads `state.config.cadence_s` for the NEXT iteration — but if `state.config` was somehow not durably updated (or there's a stale reference), it uses the old value.
**Alternative theory:** The `state.config = new_config` assignment creates a new config object, but the loop may be reading from a captured reference to the old object if there's any closure behavior we're not seeing.
---
## 3. Proposed Fix
### Option A: Force immediate reschedule (Recommended)
Move the cancel logic OUTSIDE the lock, and use a more aggressive wake pattern:
```python
async def _reschedule_adapter(self, name: str, new_config: AdapterConfig) -> None:
state = self._adapter_states.get(name)
if state is None or not state.is_running:
await self._start_adapter(new_config)
return
old_cadence = state.config.cadence_s
new_cadence = new_config.cadence_s
# Update config atomically
state.config = new_config
state.adapter.cadence_s = new_cadence
# ... (NWS-specific updates, logging) ...
# Cancel and wait for acknowledgment
state.cancel_event.set()
await asyncio.sleep(0) # Force task switch to process event
```
### Option B: Stop and restart the loop task
For cadence changes, stop the current loop task and create a new one:
```python
async def _reschedule_adapter(self, name: str, new_config: AdapterConfig) -> None:
state = self._adapter_states.get(name)
if state is None:
await self._start_adapter(new_config)
return
# Preserve last_completed_poll
preserved_poll = state.last_completed_poll
# Stop current loop
await self._stop_adapter(name)
# Update config
state.config = new_config
state.last_completed_poll = preserved_poll
# Restart loop
await self._start_adapter(new_config)
```
### Option C: Double-signal pattern
Set the event, yield, then set again to ensure delivery:
```python
state.cancel_event.set()
await asyncio.sleep(0)
state.cancel_event.set() # Redundant but ensures visibility
```
---
## 4. Test Gap
### Missing Tests
The test file `test_config_source_new.py` only tests ConfigSource behavior (list, get, protocol compliance). There are **no tests** for:
1. `_reschedule_adapter` interrupting a sleeping loop
2. Cadence decrease being applied mid-sleep
3. Cadence increase being applied mid-sleep
4. Rate-limit guarantee after reschedule
5. `cancel_event` mechanism in isolation
### Recommended Tests
```python
@pytest.mark.asyncio
async def test_cadence_decrease_applies_immediately():
"""Cadence decrease should wake sleeping loop and reschedule."""
# Setup: Adapter polling at 60s cadence
# Action: Change cadence to 30s while sleeping
# Assert: Next poll at last_poll + 30s, not last_poll + 60s
@pytest.mark.asyncio
async def test_cadence_increase_applies_on_next_cycle():
"""Cadence increase should wake sleeping loop and extend wait."""
# Setup: Adapter polling at 60s cadence
# Action: Change cadence to 90s while sleeping
# Assert: Next poll at last_poll + 90s
@pytest.mark.asyncio
async def test_cancel_event_wakes_sleeping_loop():
"""cancel_event.set() should interrupt asyncio.wait_for()."""
# Unit test for the event mechanism in isolation
```
---
## 5. State at End
### LXC State (Reverted)
- **Cadence in DB:** 60s ✅
- **Actual poll interval:** 60s ✅
- **Supervisor restarted:** 2026-05-16T04:43:40Z
- **Verified polls:**
```
04:43:40.964 - First poll after restart
04:44:41.171 - Second poll (61s later) ✅
```
### Mitigation Until Fix
After any cadence change (especially decrease), verify actual poll intervals. If incorrect, restart supervisor:
```bash
systemctl restart central-supervisor
```
---
## Summary
| Item | Details |
|------|---------|
| **Bug** | Cadence decrease hot-reload doesn't apply without restart |
| **Root cause** | `cancel_event.set()` inside lock context has delayed delivery |
| **Affects** | Cadence decreases only; increases work correctly |
| **Workaround** | Restart supervisor after cadence decrease |
| **Fix effort** | Low - add `await asyncio.sleep(0)` after event.set() |
| **Test coverage** | None for hot-reload mechanism |