mirror of
https://github.com/zvx-echo6/central.git
synced 2026-05-21 18:14:44 +02:00
environment.md: - Documents CT104 as the active development location - Lists SSH access, repository paths, and service commands - Notes that cortex clone is parked, matt-desktop has no clones BUG-CADENCE-DECREASE.md: - Full investigation of the cadence-decrease hot-reload bug - Root cause analysis: cancel_event.set() inside lock context - Proposed fix (Option A - structural) - Test gap identification - Production verification steps Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
211 lines
6.9 KiB
Markdown
211 lines
6.9 KiB
Markdown
# Bug Investigation: Cadence Decrease Hot-Reload
|
|
|
|
**Date:** 2026-05-16
|
|
**Component:** central-supervisor
|
|
**File:** `supervisor.py`
|
|
|
|
---
|
|
|
|
## 1. Reproduction
|
|
|
|
### Test Case: Decrease 60s → 30s
|
|
```
|
|
Tlast (poll completed): 04:18:24Z
|
|
Config change applied: 04:18:30Z (approx)
|
|
Expected next poll: 04:18:54Z (Tlast + 30s)
|
|
Actual next poll: 04:19:24Z (Tlast + 60s - OLD cadence)
|
|
Subsequent polls: Also at 60s intervals
|
|
```
|
|
|
|
### Log Evidence
|
|
```json
|
|
{"ts": "...", "msg": "Rescheduled adapter", "adapter": "nws", "old_cadence_s": 60, "new_cadence_s": 30, "next_poll": "2026-05-16T04:18:54+00:00"}
|
|
```
|
|
- "Rescheduled adapter" log fires with **correct** calculated next_poll
|
|
- Actual poll occurs at OLD cadence time
|
|
- Subsequent polls continue at OLD cadence
|
|
|
|
### Contrast: Increase 60s → 90s (WORKS)
|
|
```
|
|
Tlast: 03:16:34Z
|
|
Config change: 03:16:36Z
|
|
Expected next poll: 03:18:04Z (Tlast + 90s)
|
|
Actual next poll: 03:18:04Z ✅
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Root Cause
|
|
|
|
### Location
|
|
`supervisor.py` lines 395-450 (`_reschedule_adapter`) and lines 144-181 (`_run_adapter_loop`)
|
|
|
|
### The Bug
|
|
The `cancel_event.set()` call in `_reschedule_adapter` does not reliably wake the `asyncio.wait_for()` in the adapter loop when the cadence is **decreased**.
|
|
|
|
### Why It Happens
|
|
|
|
1. **Event handler holds lock during signal:**
|
|
```python
|
|
# _on_config_change (line 466)
|
|
async with self._lock:
|
|
new_config = await self._config_source.get_adapter(adapter_name)
|
|
# ...
|
|
await self._reschedule_adapter(adapter_name, new_config) # sets cancel_event here
|
|
```
|
|
|
|
2. **Reschedule updates config then signals:**
|
|
```python
|
|
# _reschedule_adapter
|
|
state.config = new_config # Line 420
|
|
state.adapter.cadence_s = new_cadence # Line 423
|
|
# ... logging ...
|
|
state.cancel_event.set() # Line 450 - inside lock context
|
|
```
|
|
|
|
3. **Asyncio event delivery delay:**
|
|
The `asyncio.Event.set()` queues a wakeup for waiting tasks, but the signal delivery is subject to asyncio's task scheduler. When called from within an `async with` block, the event may not be processed until the current task yields or the lock context exits.
|
|
|
|
4. **Timing difference between increase and decrease:**
|
|
- **Increase (60→90):** Loop has ~30-50s remaining sleep. Event signal arrives well before timeout.
|
|
- **Decrease (90→60):** Loop may be ~10s from timeout. By the time event signal is processed, timeout has already fired.
|
|
|
|
5. **Why subsequent polls use old cadence:**
|
|
When the loop times out naturally (rather than being woken by event), it proceeds to poll. After poll completes, `state.last_completed_poll` is updated. The loop then reads `state.config.cadence_s` for the NEXT iteration — but if `state.config` was somehow not durably updated (or there's a stale reference), it uses the old value.
|
|
|
|
**Alternative theory:** The `state.config = new_config` assignment creates a new config object, but the loop may be reading from a captured reference to the old object if there's any closure behavior we're not seeing.
|
|
|
|
---
|
|
|
|
## 3. Proposed Fix
|
|
|
|
### Option A: Force immediate reschedule (Recommended)
|
|
|
|
Move the cancel logic OUTSIDE the lock, and use a more aggressive wake pattern:
|
|
|
|
```python
|
|
async def _reschedule_adapter(self, name: str, new_config: AdapterConfig) -> None:
|
|
state = self._adapter_states.get(name)
|
|
if state is None or not state.is_running:
|
|
await self._start_adapter(new_config)
|
|
return
|
|
|
|
old_cadence = state.config.cadence_s
|
|
new_cadence = new_config.cadence_s
|
|
|
|
# Update config atomically
|
|
state.config = new_config
|
|
state.adapter.cadence_s = new_cadence
|
|
|
|
# ... (NWS-specific updates, logging) ...
|
|
|
|
# Cancel and wait for acknowledgment
|
|
state.cancel_event.set()
|
|
await asyncio.sleep(0) # Force task switch to process event
|
|
```
|
|
|
|
### Option B: Stop and restart the loop task
|
|
|
|
For cadence changes, stop the current loop task and create a new one:
|
|
|
|
```python
|
|
async def _reschedule_adapter(self, name: str, new_config: AdapterConfig) -> None:
|
|
state = self._adapter_states.get(name)
|
|
if state is None:
|
|
await self._start_adapter(new_config)
|
|
return
|
|
|
|
# Preserve last_completed_poll
|
|
preserved_poll = state.last_completed_poll
|
|
|
|
# Stop current loop
|
|
await self._stop_adapter(name)
|
|
|
|
# Update config
|
|
state.config = new_config
|
|
state.last_completed_poll = preserved_poll
|
|
|
|
# Restart loop
|
|
await self._start_adapter(new_config)
|
|
```
|
|
|
|
### Option C: Double-signal pattern
|
|
|
|
Set the event, yield, then set again to ensure delivery:
|
|
|
|
```python
|
|
state.cancel_event.set()
|
|
await asyncio.sleep(0)
|
|
state.cancel_event.set() # Redundant but ensures visibility
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Test Gap
|
|
|
|
### Missing Tests
|
|
|
|
The test file `test_config_source_new.py` only tests ConfigSource behavior (list, get, protocol compliance). There are **no tests** for:
|
|
|
|
1. `_reschedule_adapter` interrupting a sleeping loop
|
|
2. Cadence decrease being applied mid-sleep
|
|
3. Cadence increase being applied mid-sleep
|
|
4. Rate-limit guarantee after reschedule
|
|
5. `cancel_event` mechanism in isolation
|
|
|
|
### Recommended Tests
|
|
|
|
```python
|
|
@pytest.mark.asyncio
|
|
async def test_cadence_decrease_applies_immediately():
|
|
"""Cadence decrease should wake sleeping loop and reschedule."""
|
|
# Setup: Adapter polling at 60s cadence
|
|
# Action: Change cadence to 30s while sleeping
|
|
# Assert: Next poll at last_poll + 30s, not last_poll + 60s
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_cadence_increase_applies_on_next_cycle():
|
|
"""Cadence increase should wake sleeping loop and extend wait."""
|
|
# Setup: Adapter polling at 60s cadence
|
|
# Action: Change cadence to 90s while sleeping
|
|
# Assert: Next poll at last_poll + 90s
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_cancel_event_wakes_sleeping_loop():
|
|
"""cancel_event.set() should interrupt asyncio.wait_for()."""
|
|
# Unit test for the event mechanism in isolation
|
|
```
|
|
|
|
---
|
|
|
|
## 5. State at End
|
|
|
|
### LXC State (Reverted)
|
|
- **Cadence in DB:** 60s ✅
|
|
- **Actual poll interval:** 60s ✅
|
|
- **Supervisor restarted:** 2026-05-16T04:43:40Z
|
|
- **Verified polls:**
|
|
```
|
|
04:43:40.964 - First poll after restart
|
|
04:44:41.171 - Second poll (61s later) ✅
|
|
```
|
|
|
|
### Mitigation Until Fix
|
|
After any cadence change (especially decrease), verify actual poll intervals. If incorrect, restart supervisor:
|
|
```bash
|
|
systemctl restart central-supervisor
|
|
```
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
| Item | Details |
|
|
|------|---------|
|
|
| **Bug** | Cadence decrease hot-reload doesn't apply without restart |
|
|
| **Root cause** | `cancel_event.set()` inside lock context has delayed delivery |
|
|
| **Affects** | Cadence decreases only; increases work correctly |
|
|
| **Workaround** | Restart supervisor after cadence decrease |
|
|
| **Fix effort** | Low - add `await asyncio.sleep(0)` after event.set() |
|
|
| **Test coverage** | None for hot-reload mechanism |
|
|
|