central/docs/BUG-CADENCE-DECREASE.md
Matt Johnson d210c980fd docs: add environment reference and bug investigation report
environment.md:
- Documents CT104 as the active development location
- Lists SSH access, repository paths, and service commands
- Notes that cortex clone is parked, matt-desktop has no clones

BUG-CADENCE-DECREASE.md:
- Full investigation of the cadence-decrease hot-reload bug
- Root cause analysis: cancel_event.set() inside lock context
- Proposed fix (Option A - structural)
- Test gap identification
- Production verification steps

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-05-16 05:59:53 +00:00

6.9 KiB

Bug Investigation: Cadence Decrease Hot-Reload

Date: 2026-05-16 Component: central-supervisor File: supervisor.py


1. Reproduction

Test Case: Decrease 60s → 30s

Tlast (poll completed): 04:18:24Z
Config change applied:  04:18:30Z (approx)
Expected next poll:     04:18:54Z (Tlast + 30s)
Actual next poll:       04:19:24Z (Tlast + 60s - OLD cadence)
Subsequent polls:       Also at 60s intervals

Log Evidence

{"ts": "...", "msg": "Rescheduled adapter", "adapter": "nws", "old_cadence_s": 60, "new_cadence_s": 30, "next_poll": "2026-05-16T04:18:54+00:00"}
  • "Rescheduled adapter" log fires with correct calculated next_poll
  • Actual poll occurs at OLD cadence time
  • Subsequent polls continue at OLD cadence

Contrast: Increase 60s → 90s (WORKS)

Tlast:              03:16:34Z
Config change:      03:16:36Z
Expected next poll: 03:18:04Z (Tlast + 90s)
Actual next poll:   03:18:04Z ✅

2. Root Cause

Location

supervisor.py lines 395-450 (_reschedule_adapter) and lines 144-181 (_run_adapter_loop)

The Bug

The cancel_event.set() call in _reschedule_adapter does not reliably wake the asyncio.wait_for() in the adapter loop when the cadence is decreased.

Why It Happens

  1. Event handler holds lock during signal:

    # _on_config_change (line 466)
    async with self._lock:
        new_config = await self._config_source.get_adapter(adapter_name)
        # ...
        await self._reschedule_adapter(adapter_name, new_config)  # sets cancel_event here
    
  2. Reschedule updates config then signals:

    # _reschedule_adapter
    state.config = new_config     # Line 420
    state.adapter.cadence_s = new_cadence  # Line 423
    # ... logging ...
    state.cancel_event.set()      # Line 450 - inside lock context
    
  3. Asyncio event delivery delay: The asyncio.Event.set() queues a wakeup for waiting tasks, but the signal delivery is subject to asyncio's task scheduler. When called from within an async with block, the event may not be processed until the current task yields or the lock context exits.

  4. Timing difference between increase and decrease:

    • Increase (60→90): Loop has ~30-50s remaining sleep. Event signal arrives well before timeout.
    • Decrease (90→60): Loop may be ~10s from timeout. By the time event signal is processed, timeout has already fired.
  5. Why subsequent polls use old cadence: When the loop times out naturally (rather than being woken by event), it proceeds to poll. After poll completes, state.last_completed_poll is updated. The loop then reads state.config.cadence_s for the NEXT iteration — but if state.config was somehow not durably updated (or there's a stale reference), it uses the old value.

    Alternative theory: The state.config = new_config assignment creates a new config object, but the loop may be reading from a captured reference to the old object if there's any closure behavior we're not seeing.


3. Proposed Fix

Move the cancel logic OUTSIDE the lock, and use a more aggressive wake pattern:

async def _reschedule_adapter(self, name: str, new_config: AdapterConfig) -> None:
    state = self._adapter_states.get(name)
    if state is None or not state.is_running:
        await self._start_adapter(new_config)
        return

    old_cadence = state.config.cadence_s
    new_cadence = new_config.cadence_s

    # Update config atomically
    state.config = new_config
    state.adapter.cadence_s = new_cadence

    # ... (NWS-specific updates, logging) ...

    # Cancel and wait for acknowledgment
    state.cancel_event.set()
    await asyncio.sleep(0)  # Force task switch to process event

Option B: Stop and restart the loop task

For cadence changes, stop the current loop task and create a new one:

async def _reschedule_adapter(self, name: str, new_config: AdapterConfig) -> None:
    state = self._adapter_states.get(name)
    if state is None:
        await self._start_adapter(new_config)
        return

    # Preserve last_completed_poll
    preserved_poll = state.last_completed_poll

    # Stop current loop
    await self._stop_adapter(name)

    # Update config
    state.config = new_config
    state.last_completed_poll = preserved_poll

    # Restart loop
    await self._start_adapter(new_config)

Option C: Double-signal pattern

Set the event, yield, then set again to ensure delivery:

state.cancel_event.set()
await asyncio.sleep(0)
state.cancel_event.set()  # Redundant but ensures visibility

4. Test Gap

Missing Tests

The test file test_config_source_new.py only tests ConfigSource behavior (list, get, protocol compliance). There are no tests for:

  1. _reschedule_adapter interrupting a sleeping loop
  2. Cadence decrease being applied mid-sleep
  3. Cadence increase being applied mid-sleep
  4. Rate-limit guarantee after reschedule
  5. cancel_event mechanism in isolation
@pytest.mark.asyncio
async def test_cadence_decrease_applies_immediately():
    """Cadence decrease should wake sleeping loop and reschedule."""
    # Setup: Adapter polling at 60s cadence
    # Action: Change cadence to 30s while sleeping
    # Assert: Next poll at last_poll + 30s, not last_poll + 60s

@pytest.mark.asyncio
async def test_cadence_increase_applies_on_next_cycle():
    """Cadence increase should wake sleeping loop and extend wait."""
    # Setup: Adapter polling at 60s cadence
    # Action: Change cadence to 90s while sleeping
    # Assert: Next poll at last_poll + 90s

@pytest.mark.asyncio
async def test_cancel_event_wakes_sleeping_loop():
    """cancel_event.set() should interrupt asyncio.wait_for()."""
    # Unit test for the event mechanism in isolation

5. State at End

LXC State (Reverted)

  • Cadence in DB: 60s
  • Actual poll interval: 60s
  • Supervisor restarted: 2026-05-16T04:43:40Z
  • Verified polls:
    04:43:40.964 - First poll after restart
    04:44:41.171 - Second poll (61s later) ✅
    

Mitigation Until Fix

After any cadence change (especially decrease), verify actual poll intervals. If incorrect, restart supervisor:

systemctl restart central-supervisor

Summary

Item Details
Bug Cadence decrease hot-reload doesn't apply without restart
Root cause cancel_event.set() inside lock context has delayed delivery
Affects Cadence decreases only; increases work correctly
Workaround Restart supervisor after cadence decrease
Fix effort Low - add await asyncio.sleep(0) after event.set()
Test coverage None for hot-reload mechanism