diff --git a/docs/BUG-CADENCE-DECREASE.md b/docs/BUG-CADENCE-DECREASE.md new file mode 100644 index 0000000..4d45e5f --- /dev/null +++ b/docs/BUG-CADENCE-DECREASE.md @@ -0,0 +1,211 @@ +# Bug Investigation: Cadence Decrease Hot-Reload + +**Date:** 2026-05-16 +**Component:** central-supervisor +**File:** `supervisor.py` + +--- + +## 1. Reproduction + +### Test Case: Decrease 60s → 30s +``` +Tlast (poll completed): 04:18:24Z +Config change applied: 04:18:30Z (approx) +Expected next poll: 04:18:54Z (Tlast + 30s) +Actual next poll: 04:19:24Z (Tlast + 60s - OLD cadence) +Subsequent polls: Also at 60s intervals +``` + +### Log Evidence +```json +{"ts": "...", "msg": "Rescheduled adapter", "adapter": "nws", "old_cadence_s": 60, "new_cadence_s": 30, "next_poll": "2026-05-16T04:18:54+00:00"} +``` +- "Rescheduled adapter" log fires with **correct** calculated next_poll +- Actual poll occurs at OLD cadence time +- Subsequent polls continue at OLD cadence + +### Contrast: Increase 60s → 90s (WORKS) +``` +Tlast: 03:16:34Z +Config change: 03:16:36Z +Expected next poll: 03:18:04Z (Tlast + 90s) +Actual next poll: 03:18:04Z ✅ +``` + +--- + +## 2. Root Cause + +### Location +`supervisor.py` lines 395-450 (`_reschedule_adapter`) and lines 144-181 (`_run_adapter_loop`) + +### The Bug +The `cancel_event.set()` call in `_reschedule_adapter` does not reliably wake the `asyncio.wait_for()` in the adapter loop when the cadence is **decreased**. + +### Why It Happens + +1. **Event handler holds lock during signal:** + ```python + # _on_config_change (line 466) + async with self._lock: + new_config = await self._config_source.get_adapter(adapter_name) + # ... + await self._reschedule_adapter(adapter_name, new_config) # sets cancel_event here + ``` + +2. **Reschedule updates config then signals:** + ```python + # _reschedule_adapter + state.config = new_config # Line 420 + state.adapter.cadence_s = new_cadence # Line 423 + # ... logging ... + state.cancel_event.set() # Line 450 - inside lock context + ``` + +3. **Asyncio event delivery delay:** + The `asyncio.Event.set()` queues a wakeup for waiting tasks, but the signal delivery is subject to asyncio's task scheduler. When called from within an `async with` block, the event may not be processed until the current task yields or the lock context exits. + +4. **Timing difference between increase and decrease:** + - **Increase (60→90):** Loop has ~30-50s remaining sleep. Event signal arrives well before timeout. + - **Decrease (90→60):** Loop may be ~10s from timeout. By the time event signal is processed, timeout has already fired. + +5. **Why subsequent polls use old cadence:** + When the loop times out naturally (rather than being woken by event), it proceeds to poll. After poll completes, `state.last_completed_poll` is updated. The loop then reads `state.config.cadence_s` for the NEXT iteration — but if `state.config` was somehow not durably updated (or there's a stale reference), it uses the old value. + + **Alternative theory:** The `state.config = new_config` assignment creates a new config object, but the loop may be reading from a captured reference to the old object if there's any closure behavior we're not seeing. + +--- + +## 3. Proposed Fix + +### Option A: Force immediate reschedule (Recommended) + +Move the cancel logic OUTSIDE the lock, and use a more aggressive wake pattern: + +```python +async def _reschedule_adapter(self, name: str, new_config: AdapterConfig) -> None: + state = self._adapter_states.get(name) + if state is None or not state.is_running: + await self._start_adapter(new_config) + return + + old_cadence = state.config.cadence_s + new_cadence = new_config.cadence_s + + # Update config atomically + state.config = new_config + state.adapter.cadence_s = new_cadence + + # ... (NWS-specific updates, logging) ... + + # Cancel and wait for acknowledgment + state.cancel_event.set() + await asyncio.sleep(0) # Force task switch to process event +``` + +### Option B: Stop and restart the loop task + +For cadence changes, stop the current loop task and create a new one: + +```python +async def _reschedule_adapter(self, name: str, new_config: AdapterConfig) -> None: + state = self._adapter_states.get(name) + if state is None: + await self._start_adapter(new_config) + return + + # Preserve last_completed_poll + preserved_poll = state.last_completed_poll + + # Stop current loop + await self._stop_adapter(name) + + # Update config + state.config = new_config + state.last_completed_poll = preserved_poll + + # Restart loop + await self._start_adapter(new_config) +``` + +### Option C: Double-signal pattern + +Set the event, yield, then set again to ensure delivery: + +```python +state.cancel_event.set() +await asyncio.sleep(0) +state.cancel_event.set() # Redundant but ensures visibility +``` + +--- + +## 4. Test Gap + +### Missing Tests + +The test file `test_config_source_new.py` only tests ConfigSource behavior (list, get, protocol compliance). There are **no tests** for: + +1. `_reschedule_adapter` interrupting a sleeping loop +2. Cadence decrease being applied mid-sleep +3. Cadence increase being applied mid-sleep +4. Rate-limit guarantee after reschedule +5. `cancel_event` mechanism in isolation + +### Recommended Tests + +```python +@pytest.mark.asyncio +async def test_cadence_decrease_applies_immediately(): + """Cadence decrease should wake sleeping loop and reschedule.""" + # Setup: Adapter polling at 60s cadence + # Action: Change cadence to 30s while sleeping + # Assert: Next poll at last_poll + 30s, not last_poll + 60s + +@pytest.mark.asyncio +async def test_cadence_increase_applies_on_next_cycle(): + """Cadence increase should wake sleeping loop and extend wait.""" + # Setup: Adapter polling at 60s cadence + # Action: Change cadence to 90s while sleeping + # Assert: Next poll at last_poll + 90s + +@pytest.mark.asyncio +async def test_cancel_event_wakes_sleeping_loop(): + """cancel_event.set() should interrupt asyncio.wait_for().""" + # Unit test for the event mechanism in isolation +``` + +--- + +## 5. State at End + +### LXC State (Reverted) +- **Cadence in DB:** 60s ✅ +- **Actual poll interval:** 60s ✅ +- **Supervisor restarted:** 2026-05-16T04:43:40Z +- **Verified polls:** + ``` + 04:43:40.964 - First poll after restart + 04:44:41.171 - Second poll (61s later) ✅ + ``` + +### Mitigation Until Fix +After any cadence change (especially decrease), verify actual poll intervals. If incorrect, restart supervisor: +```bash +systemctl restart central-supervisor +``` + +--- + +## Summary + +| Item | Details | +|------|---------| +| **Bug** | Cadence decrease hot-reload doesn't apply without restart | +| **Root cause** | `cancel_event.set()` inside lock context has delayed delivery | +| **Affects** | Cadence decreases only; increases work correctly | +| **Workaround** | Restart supervisor after cadence decrease | +| **Fix effort** | Low - add `await asyncio.sleep(0)` after event.set() | +| **Test coverage** | None for hot-reload mechanism | + diff --git a/docs/environment.md b/docs/environment.md new file mode 100644 index 0000000..b2f04f0 --- /dev/null +++ b/docs/environment.md @@ -0,0 +1,96 @@ +# Central Data Hub - Environment Reference + +## Development Locations + +### Active Development: CT104 (Central LXC) + +All development work happens on the Central LXC container: + +| Property | Value | +|----------|-------| +| **Hostname** | `central` | +| **Tailscale IP** | `100.64.0.12` | +| **LAN IP** | `192.168.1.104` | +| **SSH access** | `zvx@central` or `zvx@100.64.0.12` | +| **Repository path** | `/opt/central` | +| **Python venv** | `/opt/central/.venv` | +| **Services** | `central-supervisor`, `central-archive` | + +### Parked Clone: Cortex + +The cortex VM at `/home/zvx/projects/central` contains a clone that is +**not actively used for development**. It may be retired in the future. +Do not make changes there. + +### Local Workstation: matt-desktop + +The Windows workstation (matt-desktop) has no Central repository clones. +The directory `C:\Users\mtthw\central_work\` is scratch space only and +should not be used for commits. + +## Repository + +| Property | Value | +|----------|-------| +| **Origin** | `git@github.com:zvx-echo6/central.git` | +| **Main branch** | `main` | +| **Default user** | `central` (on CT104) | + +## Services + +### central-supervisor + +The main adapter scheduler and event publisher. Polls upstream APIs, +normalizes events, and publishes to NATS JetStream. + +```bash +# Status +systemctl status central-supervisor + +# Logs +journalctl -u central-supervisor -f + +# Restart (requires sudo) +sudo systemctl restart central-supervisor +``` + +### central-archive + +Consumes events from NATS JetStream and archives to PostgreSQL/TimescaleDB. + +```bash +# Status +systemctl status central-archive + +# Logs +journalctl -u central-archive -f +``` + +## Database + +PostgreSQL 16 with TimescaleDB runs on CT104: + +```bash +# Connect as central user +psql -h localhost -U central -d central + +# Check adapter config +SELECT name, cadence_s, enabled FROM config.adapters; + +# Check recent events +SELECT id, time, category FROM events ORDER BY time DESC LIMIT 10; +``` + +## SSH Access from Windows + +From matt-desktop, connect via Tailscale: + +```bash +# Direct connection +ssh zvx@100.64.0.12 + +# Using hostname (if Tailscale DNS configured) +ssh zvx@central +``` + +Note: The `zvx` user requires password for sudo operations.