From 3b5c24c7e700dc86fdff0c0b1021dd1fed449127 Mon Sep 17 00:00:00 2001 From: Matt Date: Mon, 27 Apr 2026 02:08:28 +0000 Subject: [PATCH] =?UTF-8?q?checkpoint:=20pre-audit=20working=20tree=20stat?= =?UTF-8?q?e=20=E2=80=94=204=20untracked=20design=20docs?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- KIWIX-INTEGRATION-v2.md | 365 ++++++++++++++++++++++ NAV-INTEGRATION-v3.md | 399 ++++++++++++++++++++++++ NAV-INTEGRATION-v4.md | 363 +++++++++++++++++++++ phases/phase-6d-peertube-acquisition.md | 54 ++++ 4 files changed, 1181 insertions(+) create mode 100644 KIWIX-INTEGRATION-v2.md create mode 100644 NAV-INTEGRATION-v3.md create mode 100644 NAV-INTEGRATION-v4.md create mode 100644 phases/phase-6d-peertube-acquisition.md diff --git a/KIWIX-INTEGRATION-v2.md b/KIWIX-INTEGRATION-v2.md new file mode 100644 index 0000000..1bb291e --- /dev/null +++ b/KIWIX-INTEGRATION-v2.md @@ -0,0 +1,365 @@ +# RECON × Kiwix — ZIM Integration Design (v2) + +**Status:** Draft v2 (corrected post-stress-test) +**Date:** 2026-04-16 +**Depends on:** RECON v1.0.0 (master, CT 130) + +--- + +## 1. Goal + +Integrate Kiwix into RECON as a first-class knowledge source. Users manage ZIM files through the RECON dashboard — uploading directly or pulling from the Kiwix catalog. Kiwix-serve provides a browsable web interface for all loaded ZIMs. RECON detects new ZIMs and runs tiered ingestion: classifying the source, extracting articles via python-libzim, generating metadata, and embedding into Qdrant for Aurora semantic search. + +**Future portable deployment:** The full system is built on CT 130/cortex. A stripped-down offline copy (dense-only, reduced dimensions, int8 quantization) will later be packaged for a Raspberry Pi 5 8GB with 512GB NVMe. This is a future packaging exercise — design decisions should not block it, but we build for full capability first. + +--- + +## 2. Where Everything Lives + +All on CT 130 (192.168.1.130), running as `zvx`. + +| Component | Location | +|-----------|----------| +| kiwix-serve | Installed via Kiwix PPA (`ppa:kiwixteam/release`) or Docker (`ghcr.io/kiwix/kiwix-serve:3.8.2`). **NOT `apt install kiwix-tools`** — Ubuntu 24.04 ships 3.5.0 (2023), missing required OPDS v2 endpoints. Need ≥3.7.0. | +| kiwix-manage | Same package as kiwix-serve. Note: may be deprecated in favor of directory-serving in future libkiwix versions. Design for both. | +| ZIM file storage | `/mnt/kiwix/` (separate bind-mount from data host `/mnt/data/kiwix/`, same SSD as library but NOT inside library — library is curated human-browsable PDFs) | +| Kiwix library XML | `/mnt/kiwix/library.xml` (auto-managed, treat as legacy intermediate) | +| ZIM tracking DB | `recon.db` (new tables, see §6) | +| Ingestion code | `/opt/recon/lib/zim_pipeline.py` (new module) | +| Dashboard UI | `recon.echo6.co` (new Kiwix management page) | + +--- + +## 3. Service Architecture + +### 3.1 kiwix-serve (browsing + OPDS catalog) + +Runs as a companion systemd unit. Provides: +- **Web browsing UI** for all loaded ZIMs (Wikipedia, Stack Exchange, etc. — fully browsable with images if using `_maxi` variants) +- **OPDS v2 catalog** at `/catalog/v2/entries` for RECON to enumerate loaded ZIMs +- **Keyword search** via `/search` endpoint (HTML/XML only — **no JSON support exists**) + +``` +kiwix-serve --library /mnt/kiwix/library.xml \ + --port 8430 \ + --address 0.0.0.0 \ + --threads 4 \ + --nodatealias \ + --blockexternal +``` + +- Bound to 0.0.0.0 — browsable from the local network (kiwix.echo6.co or similar) +- Port 8430 (next to RECON dashboard on 8420) +- `--blockexternal` prevents outbound link navigation from served content +- **NO `--monitorLibrary`** — documented bugs (zombie processes, high CPU). Use SIGHUP on demand: when RECON adds a new ZIM, it sends `kill -HUP ` to trigger library reload. + +### 3.2 python-libzim (article extraction) + +kiwix-serve is for **browsing only**. All RAG article extraction uses `python-libzim` (PyPI package `libzim` v3.9.0) directly. This is faster, gives full control over filtering, and avoids the no-JSON problem with kiwix-serve's search API. + +Key API patterns: +```python +from libzim.reader import Archive + +zim = Archive("path/to/file.zim") + +# Reliable article count (NOT zim.article_count which is inflated 2-3×): +counter_meta = zim.get_metadata("Counter").decode() +# Returns: "text/html=6467891;image/webp=3211054;text/css=23" +html_count = parse_counter(counter_meta)["text/html"] + +# ZIM metadata: +title = zim.get_metadata("Title").decode() +description = zim.get_metadata("Description").decode() +language = zim.get_metadata("Language").decode() + +# Article iteration: +for i in range(zim.entry_count): + entry = zim._get_entry_by_id(i) + if entry.is_redirect: + continue + item = entry.get_item() + if item.mimetype != "text/html": # mimetype is on Item, not Entry + continue + content = bytes(item.content) + # process content... +``` + +**Important:** Modern ZIMs (2022+) use the "new namespace scheme" — flat paths, no `A/`/`I/`/`M/` prefixes. Do not use `iter_by_namespace('A')`. + +### 3.3 RECON ZIM Monitor Daemon + +New thread in `recon.service` (daemon #8). Polls kiwix-serve's OPDS catalog (`/catalog/v2/entries`) every 60 seconds, compares against `zim_sources` table in `recon.db`, and queues new ZIMs for ingestion. Also detects removed ZIMs. + +**Do not trust OPDS `articleCount`** — it's inflated 2-3× for large ZIMs. Use python-libzim's `Counter` metadata for accurate counts after detection. + +--- + +## 4. ZIM Acquisition (How ZIMs Get In) + +### 4.1 Direct Upload + +User uploads a `.zim` file through the RECON dashboard. Dashboard saves to `/mnt/kiwix/`, runs `kiwix-manage library.xml add `, sends SIGHUP to kiwix-serve. ZIM monitor detects on next poll. + +### 4.2 Catalog Pull + +RECON dashboard exposes a "Browse Kiwix Catalog" page. Queries public OPDS catalog at `https://library.kiwix.org/catalog/v2/entries` with filters (lang, category, search). **Response is Atom XML only** — no JSON. User picks a ZIM, confirms, RECON downloads via torrent (preferred for large files) or direct HTTP to `/mnt/kiwix/`. + +### 4.3 ZIM Variant Strategy + +| Variant | Content | Use Case | +|---------|---------|----------| +| `_maxi` | Full text + all images | Browsing via kiwix-serve. Plan for this as default. | +| `_nopic` | Full text, no images | RAG-only (if disk constrained). ~50% smaller. | +| `_mini` | Intro + infobox only | Not useful for RAG. | + +**Start small.** Prove the pipeline with Wikivoyage, iFixit, or a focused Stack Exchange before tackling Wikipedia. Stack ZIMs and monitor storage/RAM budget as you go. + +--- + +## 5. Tiered Ingestion Pipeline + +When the ZIM monitor detects a new ZIM, it: + +1. **Reads ZIM metadata** via python-libzim (`Counter`, `Title`, `Description`, `Language`, `Tags`, `Name`) +2. **Classifies the source** based on ZIM name/tag patterns +3. **Routes to the appropriate tier** +4. **Samples first** if the source is unknown (see §5.4) + +### 5.1 Tier 1 — Known Large Sources (Deterministic Extractors) + +**Trigger:** Known source name patterns, any article count +**Enrichment:** None — structural metadata from HTML +**Cost:** Zero + +| ZIM Pattern | Domain/Subdomain | Metadata Source | +|-------------|-----------------|-----------------| +| `wikipedia_*` | Reference/Wikipedia | Title, categories, infobox fields, section headings | +| `wiktionary_*` | Reference/Wiktionary | Word, part of speech, definitions | +| `wikibooks_*` | Reference/Wikibooks | Book title, chapter, subject | +| `wikisource_*` | Reference/Wikisource | Work title, author, year | +| `wikiversity_*` | Reference/Wikiversity | Course, subject area | +| `wikivoyage_*` | Reference/Wikivoyage | Destination, region, travel topic | +| `stack_exchange_*` | Reference/StackExchange | Tags, vote score, accepted answer flag | +| `devdocs_*` | Reference/DevDocs | Language, framework, API | +| `ifixit_*` | Maintenance/Repair | Device, category, difficulty | + +### 5.2 Tier 2 — Unknown Mid/Large Sources (Local Qwen3 Enrichment) + +**Trigger:** >10K articles AND no Tier 1 extractor match +**Enrichment:** Local Ollama Qwen3 8B (`aurora` model) +**Cost:** Zero (cortex compute time only) + +Lightweight prompt per article → JSON with domain, subdomain, summary, keywords. + +### 5.3 Tier 3 — Small Unknown Sources (Gemini Enrichment) + +**Trigger:** ≤10K articles AND no Tier 1 extractor match +**Enrichment:** Gemini API (existing RECON enrichment pipeline) +**Cost:** Low (small article count, few dollars max) + +### 5.4 The Gate — Unknown Source Review + +When a ZIM doesn't match any Tier 1 pattern AND exceeds Tier 3 threshold: + +1. Extract random sample of 50 articles +2. Log to review queue in `recon.db` +3. Dashboard notification: *"New ZIM: obscure_wiki.zim — 47,832 articles — no known extractor — sample ready"* +4. Ingestion **paused** until user reviews and approves +5. User can: approve for Tier 2, assign domain/subdomain, reject, or write a new extractor + +--- + +## 6. Database Schema + +New tables in `/opt/recon/data/recon.db`: + +### zim_sources +```sql +CREATE TABLE zim_sources ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + zim_filename TEXT NOT NULL UNIQUE, + zim_path TEXT NOT NULL, + zim_uuid TEXT, + title TEXT, + description TEXT, + language TEXT, + category TEXT, + article_count INTEGER DEFAULT 0, -- from Counter metadata, NOT OPDS + ingestion_tier INTEGER, + status TEXT DEFAULT 'detected', -- detected|sampling|review|ingesting|complete|error|rejected + processed_count INTEGER DEFAULT 0, + skipped_count INTEGER DEFAULT 0, + error_count INTEGER DEFAULT 0, + domain TEXT, + subdomain TEXT, + detected_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + started_at TIMESTAMP, + completed_at TIMESTAMP, + last_checkpoint TEXT +); +``` + +### zim_samples +```sql +CREATE TABLE zim_samples ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + zim_source_id INTEGER REFERENCES zim_sources(id), + article_path TEXT NOT NULL, + article_title TEXT, + text_preview TEXT, + metadata_json TEXT, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP +); +``` + +### zim_articles +```sql +CREATE TABLE zim_articles ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + zim_source_id INTEGER REFERENCES zim_sources(id), + article_path TEXT NOT NULL, + article_title TEXT, + qdrant_point_ids TEXT, + status TEXT DEFAULT 'pending', -- pending|embedded|skipped|error + processed_at TIMESTAMP, + UNIQUE(zim_source_id, article_path) +); +``` + +--- + +## 7. Article Processing Pipeline + +``` +ZIM Entry + ├─ Is redirect? ────────────────────► SKIP + ├─ mimetype != text/html? ──────────► SKIP + ├─ Clean text < 200 chars? ─────────► SKIP (stub) + ▼ +HTML → Clean Text (lxml, not BeautifulSoup — 10× faster) + ▼ +Metadata Extraction (per tier) + ▼ +Chunking (~512 tokens, ~50 token overlap) + ▼ +Embedding (bge-m3 dense + sparse via cortex:8090/8091) + ▼ +Qdrant Upsert → recon_knowledge_hybrid + Payload: source_type, zim_file, zim_source_id, article_title, + article_path, domain, subdomain, chunk_index, + total_chunks, keywords, language + ▼ +Checkpoint (update zim_articles + zim_sources.processed_count) +``` + +### Batching & Backpressure +- Batch size: 100 articles (configurable) +- Sleep between batches: 1s (configurable) +- ZIM ingestion runs at lower priority than real-time PDF/stream processing +- Progress logging every 1000 articles +- Resumable via `last_checkpoint` +- **Sparse upserts are slower** — known issue where on-disk sparse indexing progressively degrades. Budget for this. + +### Realistic Scale Estimates + +| ZIM | Articles | Est. Chunks | Embedding Time (RTX 3090) | +|-----|----------|-------------|---------------------------| +| Appropedia EN | ~30K | ~60K | ~10 min | +| iFixit EN | ~90K | ~180K | ~25 min | +| Stack Overflow | ~500K | ~1.5M | ~3.5 hr | +| Wikipedia EN nopic | ~5M (non-stub) | ~10M | ~24 hr | +| Wikipedia EN maxi | Same text, +images | ~10M | ~24 hr (same — images not embedded) | + +--- + +## 8. Systemd Integration + +```ini +# /etc/systemd/system/kiwix.service +[Unit] +Description=Kiwix-serve for RECON +After=network.target +PartOf=recon.service + +[Service] +User=zvx +ExecStart=/usr/bin/kiwix-serve \ + --library /mnt/kiwix/library.xml \ + --port 8430 \ + --address 0.0.0.0 \ + --threads 4 \ + --nodatealias \ + --blockexternal +ExecReload=kill -HUP $MAINPID +Restart=always +RestartSec=5 + +[Install] +WantedBy=multi-user.target +``` + +Reload after adding ZIM: `systemctl reload kiwix` + +--- + +## 9. Pre-Implementation Checks + +Before writing any code: + +1. **Verify cortex:8091 sparse embeddings** — If it's running TEI or Infinity, sparse vectors may be silently broken (dense-only output). Only native FlagEmbedding supports bge-m3 sparse. This affects the entire existing RECON pipeline, not just Kiwix. + +2. **Check Qdrant upgrade path** — If Qdrant needs upgrading, must go stepwise: 1.14 → 1.15 → 1.16. Direct jumps corrupt data. + +3. **Check cortex RAM** — Determines whether 10-30M additional vectors need quantization config changes. + +--- + +## 10. Implementation Phases + +### Phase 1 — Foundation +- Install kiwix-tools via PPA on CT 130 +- Create `/mnt/kiwix/` directory (owned by zvx) +- Set up kiwix.service systemd unit +- Create DB schema +- Download a small test ZIM (Appropedia EN — ~495MB maxi) +- Register via kiwix-manage, verify browsable at port 8430 +- Implement ZIM monitor daemon (OPDS poll → zim_sources) + +### Phase 2 — Extraction & Embedding Pipeline +- Implement python-libzim article extraction with lxml +- Implement article filtering (redirects, stubs, non-HTML) +- Implement chunking (reuse existing RECON logic) +- Implement embedding + Qdrant upsert with ZIM payload schema +- Implement checkpointing and resume +- Implement Tier 1 Wikipedia extractor as proof of concept +- End-to-end test with Appropedia + +### Phase 3 — Tiered Enrichment & Gate +- Tier 2 local Qwen3 enrichment path +- Tier 3 Gemini enrichment routing +- Source classification and routing logic +- Review gate (sampling, dashboard notification) +- Test with an unknown ZIM + +### Phase 4 — Dashboard UI +- Kiwix library page (loaded ZIMs, status, progress) +- Upload ZIM form +- Catalog browser (OPDS query + download) +- Review queue +- Stats integration + +### Phase 5 — Scale Up +- Additional Tier 1 extractors (Stack Exchange, DevDocs, iFixit) +- Wikipedia EN (start with nopic, upgrade to maxi when confident) +- ZIM version management (replace old vectors when new ZIM version arrives) + +--- + +## 11. Open Questions + +1. **ZIM version dedup** — When `wikipedia_en_2026-06.zim` replaces `wikipedia_en_2026-03.zim`, purge old vectors by `zim_source_id` filter + delete, then re-embed. Atomic cutover or incremental? + +2. **Qdrant collection strategy** — Same `recon_knowledge_hybrid` collection, or separate `recon_kiwix_hybrid`? Same collection means unified search. Separate means independent scaling but requires query fanout. + +3. **Pi deployment packaging** — Future exercise. Dense-only, Matryoshka 256-dim, int8 quant. ~3.8GB RAM for 15M vectors. Proven viable, not designed yet. diff --git a/NAV-INTEGRATION-v3.md b/NAV-INTEGRATION-v3.md new file mode 100644 index 0000000..2a70554 --- /dev/null +++ b/NAV-INTEGRATION-v3.md @@ -0,0 +1,399 @@ +# NAV-INTEGRATION-v3.md — Echo6 Navi Module + +**Status:** Active +**Created:** 2026-04-17 +**Updated:** 2026-04-18 +**Author:** Matt + Claude +**Repo:** forge.echo6.co/matt/refactored-recon + +--- + +## System Context + +- **RECON** — Backend of everything. Acquires, processes, enriches, embeds, files. Manages all data including Navi datasets. +- **Aurora** — Eyes, ears, and mouth. Speaks to humans via Open WebUI or mesh. Queries tools, synthesizes answers. +- **Navi** — Navigation module. Routing, geocoding, tiles, mesh bridge, weather, trail/land overlays. +- **Kiwix** — Offline internet. ZIM files via kiwix-serve. +- **Library** — files.echo6.co. PDFs across 21 domains. + +--- + +## Infrastructure (Post VM Migration) + +- **VM 130** — 192.168.1.130 on data node (192.168.1.240) +- **OS:** Ubuntu 24.04 LTS +- **RAM:** 16 GB +- **vCPU:** 4 +- **Boot disk:** 80-100 GB +- **Docker:** Yes (available post-migration) +- **User:** zvx +- **Mounts:** + - /mnt/library/ — existing ~67 GB library (bind-mount from data host /mnt/data/library/) + - /mnt/nav/ — NEW ~200 GB nav data (bind-mount from data host /mnt/data/nav/) + +All services run on VM 130. No separate LXC for Navi. + +--- + +## Priority Tiers + +### HIGH — The Core Proposition +1. Aurora gives turn-by-turn directions +2. Meshtastic waypoint delivery native to the Meshtastic app +3. Mesh bridge logs node position data (breadcrumb trails for "get me home") + +### MEDIUM — Enhanced Capabilities +4. Self-hosted weather (Open-Meteo) +5. Web frontend for directions (OSM tiles + topo/elevation) +6. Selectable offline region downloads for designated AO +7. Trail and land ownership overlays (USFS/BLM/PAD-US — free federal data, OnX alternative) + +### LOW — Future Enrichment +8. Wilderness data layers (foraging, water, fauna) + +--- + +## Architecture + +``` +VM 130 (192.168.1.130) +├── RECON (existing) +│ ├── recon.service (7 daemon threads) +│ ├── /opt/recon/ (Python venv, SQLite, Flask) +│ ├── recon.echo6.co :8420 +│ └── files.echo6.co :8888 (nginx) +│ +├── Navi Services +│ ├── Valhalla (Docker) :8002 — routing +│ ├── Photon (native Java) :2322 — geocoding +│ ├── nginx :8440 — PMTiles + MapLibre frontend (Phase M) +│ └── Open-Meteo (Docker or .deb) :8080 (Phase M) +│ +├── Aurora Integration +│ ├── /opt/recon/lib/nav_tools.py — route(), reverse_geocode(), weather() +│ └── Aurora pipe function tool registration +│ +├── Mesh Integration +│ ├── /opt/recon/services/mesh_bridge.py — nav-bridge.service +│ ├── TCPInterface → Meshtastic gateway :4403 +│ ├── Waypoint + text delivery +│ └── Position logging (breadcrumb trails per node) +│ +└── External Dependencies (unchanged) + ├── Qdrant — cortex 192.168.1.150:6333 + ├── bge-m3 dense — cortex:8090 + ├── bge-m3 sparse — cortex:8091 + └── Meshtastic gateway — [IP TBD]:4403 +``` + +--- + +## Storage Layout + +``` +/mnt/nav/ +├── sources/ +│ ├── idaho-latest.osm.pbf # ~250 MB (start here) +│ └── us-latest.osm.pbf # ~11 GB (expand later) +├── valhalla/ # Docker volume +│ └── (tiles built automatically) # ~2 GB Idaho, ~60 GB CONUS +├── photon/ +│ └── photon_data/ # ~95 GB planet, or smaller US-only +├── tiles/ # Phase M +│ ├── basemap.pmtiles # ~15-20 GB CONUS or extract +│ ├── contours.pmtiles # ~5-10 GB +│ ├── usfs_trails.pmtiles # ~150-300 MB +│ ├── usfs_mvum_roads.pmtiles # ~250-450 MB +│ ├── padus.pmtiles # ~400-800 MB +│ ├── blm_sma.pmtiles # ~80-150 MB +│ └── idfg_hunt_units.pmtiles # ~10 MB +├── overlays/ # Raw shapefiles/GeoJSON (source) +│ ├── S_USA.TrailNFS_Publish/ +│ ├── S_USA.Road_MVUM/ +│ ├── S_USA.Trail_MVUM/ +│ ├── PADUS4_1/ +│ ├── BLM_SMA/ +│ ├── IDFG/ +│ └── IDL/ +├── weather/ # Phase M +│ └── open-meteo-data/ # ~50-100 GB +├── frontend/ # Phase M +│ ├── index.html +│ ├── style.json +│ └── sw.js +└── docker-compose.yml +``` + +--- + +## Decision Log + +| # | Decision | Rationale | +|---|----------|-----------| +| D1 | **Everything on VM 130** | RECON is the backend of everything. No separate LXC. | +| D2 | **VM not LXC** | Docker needed for Valhalla. Migrated from CT 130 LXC. | +| D3 | **Valhalla via Docker** | C++ build from source is fragile. Docker is the community standard. | +| D4 | **Photon native (Java jar)** | Simple JRE + jar, no Docker needed. Systemd unit. | +| D5 | **Idaho-first, expand later** | Validate the stack on a 250 MB PBF before committing to 11 GB CONUS. | +| D6 | **US-only Photon** | 3-4 GB heap vs 8 GB for planet. Sufficient for prepper nav use case. | +| D7 | **Trail data as display overlay, not routed** | USFS trails don't merge cleanly into OSM PBF. OnX does it the same way. | +| D8 | **No private parcel ownership** | Irrelevant for nav use case. Saves $80K/yr Regrid license. | +| D9 | **Log mesh node positions** | Enables "how do I get back to base camp?" from breadcrumb history. | +| D10 | **Private mesh channel for nav** | Avoid polluting public LongFast. MEDIUM_FAST or SHORT_FAST preset. | + +--- + +## HIGH PRIORITY PHASES + +--- + +### Phase H1: Routing + Geocoding Infrastructure + +**Goal:** Valhalla and Photon running on VM 130, answering API calls. + +**H1a — Valhalla (Idaho-first):** +1. Create directory structure under /mnt/nav/ +2. Download idaho-latest.osm.pbf from Geofabrik (~250 MB) +3. Deploy Valhalla via Docker (ghcr.io/valhalla/valhalla-scripted:latest) +4. Build tiles from Idaho PBF (auto on first start) +5. Validate: Buhl → Boise route with narrative maneuvers +6. Test pedestrian + bicycle costing modes + +**H1b — Photon:** +1. Install JRE on VM 130 +2. Download Photon jar + planet dump (or US-only extract) +3. Configure systemd unit for Photon on port 2322 +4. Validate forward geocoding: "Buhl Idaho" +5. Validate reverse geocoding: 42.6, -114.46 + +**H1c — Expand to CONUS (after validation):** +1. Download us-latest.osm.pbf (~11 GB) +2. Build Valhalla tiles on cortex (needs 32 GB peak RAM), rsync to /mnt/nav/valhalla/ +3. Swap Valhalla to use CONUS tiles +4. Re-validate routing + +**Validation:** Buhl → Boise returns JSON with maneuvers including verbal_succinct_transition_instruction. Photon resolves "Twin Falls Idaho" to correct coordinates. + +--- + +### Phase H2: Aurora Nav Tools + +**Goal:** Ask Aurora "How do I get from Buhl to Boise?" and get turn-by-turn. + +**Prerequisites:** Phase H1a+H1b complete. + +**Tasks:** +1. Create /opt/recon/lib/nav_tools.py + - route(origin, destination, mode) → geocode via Photon → route via Valhalla → formatted result + - reverse_geocode(lat, lon) → Photon reverse lookup +2. Register route tool in Aurora's Open WebUI Pipe Function +3. Handle natural language variations +4. Format response: summary + numbered maneuvers + +**Validation:** In Open WebUI, ask Aurora: "How do I get from Buhl to Boise?" → get directions. + +--- + +### Phase H3: Meshtastic Mesh Bridge + Waypoints + Position Logging + +**Goal:** Text `nav Boise` on a Meshtastic radio → receive waypoint pins + text directions. All node positions logged for breadcrumb trails. + +**Prerequisites:** Phase H2 complete. Meshtastic gateway accessible via TCP. + +**Pre-build decisions needed:** +- Gateway node hardware + IP +- Private channel PSK +- Position logging schema (SQLite table on VM 130) + +**Components:** + +**mesh_bridge.py daemon:** +- TCPInterface to gateway :4403 +- Command handlers: `nav `, `n`/`next`, `cancel`, `where am i`, `home` (route back to base/earliest position) +- Waypoint delivery with sliding window (3 at a time, 6s pacing) +- Route state per node with 24h TTL + +**Position logger:** +- Subscribe to POSITION_APP packets +- Store: node_id, lat, lon, altitude, speed, heading, timestamp +- SQLite table at /opt/recon/data/positions.db (or same recon.db) +- Enables "get me home" by looking up earliest position or pre-registered base camp +- Retention: configurable, default 30 days + +**Waypoint compression:** +| Maneuver | Name (≤30 chars) | Icon | +|----------|------------------|------| +| Start | `START Buhl` | 📍 | +| Left | `L Main St 0.3` | ⬅️ | +| Right | `R Oak Ave 1.2` | ➡️ | +| Straight | `S US-93 N 4.7` | ⬆️ | +| Destination | `DST Twin Falls` | 🏁 | + +**Meshtastic Gateway Config:** +- Role: CLIENT or ROUTER_CLIENT +- TCP Server: Enabled :4403 +- Channel 0: Public LongFast (untouched) +- Channel 1: Private nav, PSK-protected, MEDIUM_FAST +- GPS: Enabled + +**Validation:** +1. From handheld, text `nav Twin Falls` on channel 1 +2. Receive text summary + 3 waypoint pins on Meshtastic app map +3. Text `n` → next waypoints +4. Text `home` → route back to earliest logged position +5. Verify positions.db is accumulating node position data + +--- + +## MEDIUM PRIORITY PHASES + +Start after HIGH phases are validated in the field. + +--- + +### Phase M1: Weather — Open-Meteo + +**Goal:** Aurora answers "what's the weather?" offline. Mesh command: `wx`. + +- Deploy Open-Meteo (Docker or .deb) with GFS surface subset +- ~50-100 GB storage, syncs while internet available +- Add weather() tool to nav_tools.py +- Add `wx` command to mesh_bridge + +--- + +### Phase M2: Web Frontend + Tile Serving + +**Goal:** navi.echo6.co — self-hosted map UI with directions, search, layers. + +- PMTiles basemap via Planetiler or Protomaps extract +- MapLibre GL JS frontend with Valhalla routing + Photon search +- Layer switcher: street / topo / satellite(NAIP AO-only) +- GPS tracking dot +- Start with Headway, evaluate fork vs custom + +--- + +### Phase M3: Offline Region Downloads + +**Goal:** RECON dashboard region picker → triggers tile + routing build. + +- v1: Geofabrik region tree dropdown (simple) +- v2: Draw-a-bbox on map (complex) +- Pipeline: osmium extract → planetiler → valhalla_build_tiles +- Store in /mnt/nav/regions/{name}/ + +--- + +### Phase M4: Trail and Land Overlays (OnX Alternative) + +**Goal:** Public land boundaries, USFS trails, MVUM access on the map. + +**Data sources (all free, public domain):** +| Dataset | Source | Size (CONUS) | +|---------|--------|-------------| +| USFS NFS Trails | data.fs.usda.gov | ~150-300 MB as PMTiles | +| USFS MVUM Roads | data.fs.usda.gov | ~250-450 MB as PMTiles | +| USFS MVUM Trails | data.fs.usda.gov | ~50-100 MB as PMTiles | +| PAD-US 4.1 Fee | usgs.gov | ~400-800 MB as PMTiles | +| BLM SMA | gbp-blm-egis.hub.arcgis.com | ~80-150 MB as PMTiles | +| IDFG Hunt Units | idfg.idaho.gov | ~10 MB as PMTiles | +| IDL State Trust | idl.idaho.gov | ~20 MB as PMTiles | +| IDPR OHV/Snow Trails | idpr-data-idaho.hub.arcgis.com | ~10 MB as PMTiles | + +**Conversion pipeline:** +``` +Download shapefile → ogr2ogr (reproject to EPSG:4326, select fields) +→ tippecanoe (GeoJSON → PMTiles) → serve via nginx +``` + +**MapLibre styling:** +- BLM = yellow, USFS = green, NPS = purple, FWS = orange +- State = blue, Private = transparent, Wilderness = dark green overlay +- Motorized trails = solid red, non-motorized = dashed green +- MVUM open roads = green, seasonal = orange, closed = red + +**Keep as display-only overlays.** Do NOT merge into OSM PBF for routing. + +**Refresh schedule:** PAD-US annually, BLM SMA quarterly, USFS per-unit, IDFG annually. Automate with cron. + +--- + +## LOW PRIORITY PHASES + +### Phase L1: Wilderness Data + +Defer until HIGH and MEDIUM are stable. See NAV-INTEGRATION-v2.md for details. + +--- + +## Dependency Graph + +``` +Phase H1a (Valhalla Idaho) + │ + ├── Phase H1b (Photon) + │ │ + │ └── Phase H2 (Aurora nav tools) + │ │ + │ ├── Phase H3 (Mesh bridge + waypoints + position logging) + │ │ + │ ├── Phase M1 (Weather) — independent of H3 + │ │ + │ └── Phase L1 (Wilderness) — independent + │ + ├── Phase H1c (Expand to CONUS) — independent of H2 + │ + ├── Phase M2 (Web frontend) — needs H1, enhanced by M4 + │ + ├── Phase M3 (Region downloads) — needs H1 + │ + └── Phase M4 (Trail/land overlays) — independent of routing, pairs with M2 + +H1a → H1b → H2 → H3 is the critical path. +M phases can proceed in any order after H1. +``` + +--- + +## Estimated Effort + +| Phase | CC Sessions | Blocked By | +|-------|------------|------------| +| H1a: Valhalla (Idaho) | 1 | /mnt/nav/ mount | +| H1b: Photon | 0.5 | H1a | +| H1c: Expand to CONUS | 0.5 | H1a validated | +| H2: Aurora nav tools | 1 | H1a + H1b | +| H3: Mesh bridge | 2-3 | H2 + Meshtastic hardware | +| M1: Weather | 1 | H1 | +| M2: Web frontend | 2-4 | H1 | +| M3: Region downloads | 2-3 | H1 | +| M4: Trail/land overlays | 1-2 | tippecanoe installed | +| L1: Wilderness | 2-3 | H2 | + +**HIGH total: 5-6 CC sessions** +**MEDIUM total: 6-10 CC sessions** + +--- + +## Open Decisions (Resolve Before H3) + +1. **Meshtastic gateway:** Which physical node? What IP? +2. **Private nav channel PSK:** Generate before H3 +3. **Position logging location:** Separate positions.db or table in recon.db? +4. **Photon scope:** Planet dump (95 GB) or US-only extract? +5. **Base camp concept:** Pre-registered waypoint, or earliest position in log? + +--- + +## Risk Register + +| Risk | Impact | Mitigation | +|------|--------|------------| +| Valhalla CONUS tile build OOMs | H1c blocked | Build on cortex, rsync to data node | +| Photon 95 GB download fails | H1b delayed | wget --continue, checksum verify | +| Gateway TCP flaky | H3 degraded | Reconnect loop with backoff | +| No GPS fix on field handheld | H3 edge case | Graceful error + manual coord input | +| Valhalla can't route backcountry trails | Trail nav limited | Trails are display-only anyway; use OSM foot paths where they exist | +| USFS trail data quality varies by forest | Overlay gaps | Supplement with user GPS tracks over time | +| VM 130 RAM pressure | Service degradation | Monitor, bump to 20 GB if needed (headroom exists) | diff --git a/NAV-INTEGRATION-v4.md b/NAV-INTEGRATION-v4.md new file mode 100644 index 0000000..46f03c3 --- /dev/null +++ b/NAV-INTEGRATION-v4.md @@ -0,0 +1,363 @@ +# NAV-INTEGRATION-v4.md — Echo6 Navi Module + +**Status:** Active +**Created:** 2026-04-17 +**Updated:** 2026-04-19 +**Author:** Matt + Claude +**Repo:** forge.echo6.co/matt/refactored-recon + +--- + +## System Context + +- **RECON** — Backend of everything. Acquires, processes, enriches, embeds, files. Manages all data including Navi datasets. +- **Aurora** — Eyes, ears, and mouth. Speaks to humans via Open WebUI or mesh. Queries tools, synthesizes answers. +- **Navi** — Navigation module. Routing, geocoding, tiles, mesh bridge, weather, trail/land overlays. +- **Kiwix** — Offline internet. ZIM files via kiwix-serve. Aurora integration via tool-callable search. +- **Library** — files.echo6.co. PDFs across 21 domains. + +--- + +## Infrastructure + +- **VM 130** — 192.168.1.130 on data node (192.168.1.240) +- **OS:** Ubuntu 24.04 LTS (migrated from LXC to VM for Docker support) +- **RAM:** 16 GB +- **vCPU:** 4 +- **Boot disk:** 80-100 GB +- **Docker:** Yes +- **User:** zvx +- **Mounts:** + - /mnt/library/ — ~67 GB library (virtiofs from data host /mnt/data/library/) + - /mnt/nav/ — ~200 GB nav data (virtiofs from data host /mnt/data/nav/) + - /mnt/kiwix/ — Kiwix ZIM storage (virtiofs from data host /mnt/data/kiwix/) + - /mnt/nas/ — pi-nas NFS share (192.168.1.245) + +--- + +## Completed Work + +### Phase H1a: Valhalla Routing Engine ✅ +- **Docker:** `ghcr.io/valhalla/valhalla-scripted:latest` v3.6.3 +- **Port:** 8002 +- **Data:** Idaho PBF, 540 tiles built +- **Validated:** Buhl → Boise (127.1 mi, 144 min, auto + pedestrian costing) +- **Branch:** feature/navi on forge.echo6.co/matt/recon + +### Phase H1b: Photon Geocoding ✅ +- **Service:** systemd unit, Java jar, Xmx10g +- **Port:** 2322 +- **Data:** Full planet import, 281.4M documents, 85 GB index +- **Validated:** Forward + reverse geocoding, worldwide coverage + +### Phase H2: Aurora Nav Tools ✅ +- **Files:** + - /opt/recon/lib/nav_tools.py — route(), reverse_geocode() + - /opt/recon/lib/aurora_nav_tool.py — Open WebUI tool wrapper +- **Registered:** Navigation tool visible in Open WebUI +- **Tested:** 5/5 tests passing + +### Phase H2b: Semantic Query Router ✅ +- **Files:** + - /opt/recon/lib/query_router.py — standalone router (38 example queries, 4 routes) + - recon_rag_tool.py v4.2.0 on cortex — router gate integrated into inlet() +- **Routes:** + - nav_route (0.735 confidence) → Valhalla directions, skip RAG + - nav_reverse_geocode (0.871) → Photon reverse, skip RAG + - direct_answer (0.877) → pass through, skip RAG + - rag_search (0.751) → full RAG pipeline +- **Safety:** TEI down → falls to RAG. Nav fails → falls to RAG. +- **Expandable:** Adding new routes = embed 10-20 examples, compute centroid, add to dict. No retraining. + +### Pi-nas Country Index ✅ +- **File:** /export/data/nav/photon-country-index.txt (also as .md) +- **Coverage:** 282.5M lines indexed by country code +- **US records:** 52.8M starting at line 122,520,379 +- **Use:** Enables regional Photon builds for Pi deployment + +--- + +## Active Services on VM 130 + +| Service | Type | Port | Status | +|---------|------|------|--------| +| recon.service | systemd (native) | 8420 (dashboard), 8888 (files) | Running | +| Valhalla | Docker | 8002 | Running (540 tiles, Idaho) | +| Photon | systemd (Java) | 2322 | Running (281M docs, 85 GB) | + +--- + +## Git Repos + +| Repo | What | Language | Branch | +|------|------|----------|--------| +| `matt/recon` | Backend, pipeline, nav_tools, mesh_bridge, router | Python | feature/navi | +| `matt/refactored-recon` | Design docs, bibles, plans | Markdown | main | +| `matt/navi` | Web map frontend (TO CREATE) | JS/HTML/CSS | — | +| `matt/navi-mobile` | Ferrostar Android app (FUTURE) | Kotlin | — | + +--- + +## Priority Tiers (Revised) + +### HIGH — Core Proposition (DONE) +1. ✅ Aurora gives turn-by-turn directions +2. ✅ Semantic router for intelligent tool selection +3. Meshtastic waypoint delivery (BLOCKED — needs hardware decisions) + +### MEDIUM — Enhanced Capabilities (IN PROGRESS) +4. Address book in RECON — pre-Photon geocoding for saved locations +5. Netsyms Address DB — 160M USPS-validated addresses, SQLite +6. Web frontend — navi.echo6.co (NEW REPO: matt/navi) +7. Ferrostar mobile app — Android turn-by-turn pointed at Valhalla +8. TomTom traffic integration — routing-level, not just overlay +9. Self-hosted weather (Open-Meteo) +10. Selectable offline region downloads for designated AO +11. Trail and land ownership overlays (USFS/BLM/PAD-US) + +### LOW — Future Enrichment +12. Wilderness data layers (foraging, water, fauna) + +--- + +## Next Phases (Ordered) + +### Phase M-AB: Address Book + Netsyms Download + +**Goal:** Saved locations resolve instantly without geocoding. Netsyms provides USPS-validated address precision. + +**Address Book:** +- YAML or SQLite in /opt/recon/data/ or /opt/recon/config/ +- Checked BEFORE Photon in nav_tools geocoding chain +- Feeds: Aurora web ("how do I get home"), mesh bridge ("nav home"), Navi frontend (starred locations) +- Structure: key, name, lat, lon, address, aliases[] +- Future expansion: contacts, callsigns, frequencies (RECON-managed rolodex) + +**Netsyms:** +- Download: https://dl.netsyms.net/gis/addresses/2025/AddressDatabase2025.zip (11 GB compressed, 35 GB uncompressed) +- SHA256: 3deb85a37c6a4d027dd35fcbc1084e577b06a95be471042acc17fe21dedc3d8e +- Format: SQLite, 160M addresses (US + Canada), USPS ZIP+4 validated +- Schema: zipcode, number, street, street2, city, state, plus4, country, latitude, longitude, source +- Source data: National Address Database + OpenAddresses.io + USPS ZIP+4 +- License: Public domain (facts cannot be copyrighted under US law) +- Comparison: Photon (OSM crowd-sourced) vs Netsyms (government records). Test side-by-side. +- Pi deployment: "lite" version at 6.7 GB compressed, no lat/lon, optimized for autocomplete + +**Photon vs Netsyms vs Address Book — geocoding chain:** +1. Address book → exact match on saved locations (instant, zero network) +2. Netsyms SQLite → street address precision (local query, no service) +3. Photon → place names, POIs, worldwide ("Boise Airport", "Sawtooth NF") + +### Phase M-WEB: Web Frontend (navi.echo6.co) + +**Goal:** Google Maps-style web experience with self-hosted tiles, routing, search. + +**Repo:** forge.echo6.co/matt/navi (new) +**Deploy:** nginx on VM 130, port 8440, navi.echo6.co + +**Stack:** +- MapLibre GL JS — map rendering +- PMTiles — vector tiles served by nginx with Range Requests +- Valhalla API at :8002 — routing with polyline + maneuvers +- Photon API at :2322 — search/geocode +- Netsyms SQLite — address autocomplete (via small API or direct query) +- TomTom traffic overlay — raster tiles via API key (visual only for now) +- Address book locations — starred markers on map + +**MVP Features:** +- Search bar (Photon forward geocode) +- Click-to-route (two points → Valhalla → draw polyline + maneuver list) +- Mode selector (auto, pedestrian, bicycle) +- GPS dot (browser Geolocation API) +- TomTom traffic tile overlay +- Responsive mobile layout + +**Future Features:** +- Layer switcher (street / topo / satellite) +- Saved locations from address book +- Turn-by-turn panel with voice (Web Speech API) +- Offline PWA with Service Worker caching +- USFS trail / BLM land / PAD-US overlays +- Contour/hillshade overlay + +**Tile Data (needs building):** +- Basemap PMTiles from Planetiler or `pmtiles extract` from Protomaps daily builds +- Start with Idaho regional extract, expand to CONUS +- Contours from SRTM via phyghtmap (future) + +### Phase M-APP: Ferrostar Mobile App + +**Goal:** Native Android turn-by-turn navigation app pointed at your Valhalla. + +**Repo:** forge.echo6.co/matt/navi-mobile (future) + +**Stack:** +- Ferrostar SDK (BSD license, Rust core, Kotlin/Jetpack Compose UI) +- Built-in Valhalla route provider: `WellKnownRouteProvider.Valhalla("http://192.168.1.130:8002/route/v1", "auto")` +- MapLibre Native for map rendering +- GPS snapping, off-route detection, automatic rerouting +- Voice guidance via SpokenInstructionObserver +- Your PMTiles for offline map tiles + +**Effort:** Fork the Ferrostar demo app, swap three URLs (Valhalla, tile source, search). Half-day to functional, a few days to polish. + +### Phase M-TRAFFIC: TomTom → Valhalla Traffic Integration + +**Goal:** Valhalla routes based on real-time traffic, not just static road speeds. + +**Architecture:** +1. Cron job (every 5 min while internet available) → poll TomTom Traffic Flow Segments API for AO +2. Mapping service → translate TomTom segment IDs to Valhalla edge IDs (built once via `valhalla_ways_to_edges`, stored as lookup table) +3. Traffic tile writer → pack speeds into Valhalla's binary traffic tile format → update traffic.tar +4. Valhalla hot-reloads traffic data without restart + +**TomTom free tier:** 2,500 requests/day — sufficient for single AO polled every 5 min during waking hours. + +**Key insight:** Traffic-aware routing is a peacetime feature. Grid-down = no TomTom feed = Valhalla routes without traffic = fine (traffic patterns irrelevant without grid). + +**Effort:** 2-3 CC sessions. Requires understanding Valhalla's internal edge ID system and binary traffic tile format. + +**Reference:** Christian Beiwinkel's "Ultimate Guide to Traffic in Valhalla" and his "Valhalla Orbis Tools" for TomTom integration. + +### Phase H3: Meshtastic Mesh Bridge + Waypoints + Position Logging + +**BLOCKED on hardware decisions:** +1. Which physical Meshtastic node is the gateway? +2. Gateway IP on LAN? +3. Private nav channel PSK + +**Architecture (Option B — Aurora-mediated):** +``` +Mesh DM → mesh_bridge receives text + → Injects sender GPS + context + → Sends to Aurora (via Ollama API or Open WebUI API) + → Semantic router classifies intent + → Aurora calls appropriate tool (nav, RAG, weather, etc.) + → Response formatted + compressed + → Waypoints + text sent back over mesh DM +``` + +Fast-path bypass for `n`, `next`, `cancel`, `sitrep` — skip Aurora. + +**Position logging:** +- Subscribe to POSITION_APP packets +- SQLite table: node_id, lat, lon, altitude, speed, heading, timestamp +- Enables: "get me home" (earliest position), breadcrumb trails, team tracking + +**Waypoint delivery:** +- Sliding window (3 at a time, 6s pacing on MEDIUM_FAST) +- Compressed names ≤30 chars ("L Main St 0.3") +- Emoji icons per maneuver type +- 24h TTL expiry +- Private channel to avoid polluting public LongFast + +**Gateway config:** +- Role: CLIENT or ROUTER_CLIENT +- TCP: Enabled :4403 +- Channel 0: Public LongFast +- Channel 1: Private nav, PSK-protected, MEDIUM_FAST +- GPS: Enabled + +--- + +## Data Sources Available + +### Geocoding +| Source | Records | Size | Coverage | Best For | +|--------|---------|------|----------|----------| +| Photon (running) | 281M | 85 GB | Planet | Place names, POIs, worldwide | +| Netsyms (to download) | 160M | 35 GB | US + Canada | Street addresses, USPS precision | +| OpenAddresses.io | 600M+ | Varies | Worldwide | Future international expansion | +| Address book (to build) | User-defined | <1 MB | Personal | Saved locations, zero latency | + +### Trail and Land Data (Phase M future) +| Dataset | Source | License | Size as PMTiles | +|---------|--------|---------|----------------| +| USFS NFS Trails | data.fs.usda.gov | Public domain | ~150-300 MB | +| USFS MVUM Roads | data.fs.usda.gov | Public domain | ~250-450 MB | +| USFS MVUM Trails | data.fs.usda.gov | Public domain | ~50-100 MB | +| PAD-US 4.1 | usgs.gov | Public domain | ~400-800 MB | +| BLM SMA | gbp-blm-egis.hub.arcgis.com | Public domain | ~80-150 MB | +| IDFG Hunt Units | idfg.idaho.gov | Free | ~10 MB | +| IDL State Trust | idl.idaho.gov | Free | ~20 MB | +| IDPR OHV/Snow Trails | idpr-data-idaho.hub.arcgis.com | Free | ~10 MB | + +**Conversion pipeline:** shapefile → ogr2ogr (EPSG:4326) → tippecanoe → PMTiles → nginx +**Keep as display-only overlays.** Do NOT merge into OSM PBF for routing. + +--- + +## Pi 5 Deployment Target + +Everything built on the homelab should have a "Pi profile" — regional extracts, configurable scope, no 16+ GB RAM dependency. + +| Component | Homelab | Pi 5 (8 GB) | +|-----------|---------|-------------| +| Valhalla | CONUS (60 GB, Docker) | Regional (2-5 GB, native or Docker) | +| Geocoding | Photon planet (85 GB, JVM 10g) | Netsyms SQLite (35 GB, zero overhead) | +| Place names | Photon | Lightweight OSM place extract (~500 MB) | +| Tiles | CONUS PMTiles (15-20 GB) | Regional PMTiles (2-5 GB) | +| LLM | Qwen3 8B on cortex GPU | 1-3B model on CPU (slow but functional) | +| Embeddings | bge-m3 on cortex GPU | Dense-only, 256-dim, int8 | +| Meshtastic | TCP to gateway | USB serial to node | +| Weather | Open-Meteo (50-100 GB) | Subset or skip | +| Storage | 1 TB SSD shared | 1-2 TB NVMe | + +--- + +## Semantic Router — Current Routes + +| Route | Confidence | Action | Expandable | +|-------|-----------|--------|------------| +| nav_route | 0.735 | Valhalla directions, skip RAG | ✅ | +| nav_reverse_geocode | 0.871 | Photon reverse, skip RAG | ✅ | +| direct_answer | 0.877 | Pass through, skip RAG | ✅ | +| rag_search | 0.751 | Full RAG pipeline | ✅ | +| kiwix_search | — | PLANNED: kiwix-serve search | Add 10-20 examples | +| weather | — | PLANNED: Open-Meteo query | Add 10-20 examples | +| mesh_sitrep | — | PLANNED: node positions summary | Add 10-20 examples | + +--- + +## Storage Budget (Shared 938 GB SSD) + +| Mount | Current | Planned Additions | Total | +|-------|---------|-------------------|-------| +| /mnt/library/ | 67 GB | — | 67 GB | +| /mnt/kiwix/ | ~40 GB | +110 GB (Wikipedia + others) | ~150 GB | +| /mnt/nav/ | ~111 GB (Valhalla 2 GB + Photon 85 GB + sources + PBF) | +35 GB Netsyms, +20 GB PMTiles, +60 GB CONUS Valhalla | ~226 GB | +| Other host data | ~400 GB | — | ~400 GB | +| **Total** | | | **~843 GB** | +| **Free** | | | **~95 GB** | + +**Action needed:** Trim Photon to US-only to reclaim ~70 GB, bringing free to ~165 GB. Or accept planet coverage and manage tightly. + +--- + +## Risk Register + +| Risk | Impact | Mitigation | +|------|--------|------------| +| Valhalla CONUS tile build OOMs | Expansion blocked | Build on cortex, rsync to data node | +| Photon 85 GB crowds out Kiwix ZIMs | Storage pressure | Trim to US-only (~15 GB), reclaim ~70 GB | +| TomTom free tier rate limit | Traffic routing degraded | Single AO polling only, cache aggressively | +| Meshtastic gateway TCP flaky | H3 degraded | Reconnect loop with backoff | +| Ferrostar SDK breaking changes (pre-1.0) | App maintenance | Pin SDK version, update deliberately | +| Netsyms lacks place names/POIs | "Boise Airport" fails | Keep Photon as fallback for non-address queries | +| Address book file not synced to cortex | OWUI can't use saved locations | Inline in Valve JSON or add HTTP endpoint | + +--- + +## Estimated Effort (Remaining) + +| Phase | CC Sessions | Blocked By | +|-------|------------|------------| +| M-AB: Address book + Netsyms | 1 | Nothing | +| M-WEB: Web frontend MVP | 2-3 | PMTiles built | +| M-APP: Ferrostar mobile | 1-2 | PMTiles built | +| M-TRAFFIC: TomTom → Valhalla | 2-3 | M-WEB validated | +| H3: Mesh bridge | 2-3 | Meshtastic hardware | +| Trail/land overlays | 1-2 | tippecanoe installed | +| Weather (Open-Meteo) | 1 | Nothing | +| Photon trim to US-only | 0.5 | Nothing | +| Expand Valhalla to CONUS | 0.5 | Cortex RAM for build | diff --git a/phases/phase-6d-peertube-acquisition.md b/phases/phase-6d-peertube-acquisition.md new file mode 100644 index 0000000..cec527c --- /dev/null +++ b/phases/phase-6d-peertube-acquisition.md @@ -0,0 +1,54 @@ +# Phase 6d: PeerTube Acquisition Module + +**Date:** 2026-04-15 +**Commit:** 277110d (refactor branch) +**Status:** Complete + +## What Changed + +Created `lib/acquisition/peertube.py` — a new module that polls PeerTube for +video transcripts and writes them as flat file pairs into `data/acquired/stream/` +for the dispatcher to pick up. This replaces the `peertube_scanner_loop` removed +in Phase 5c-1. + +### New File: `lib/acquisition/peertube.py` (~170 lines) + +- `_build_known_sets(db)` — queries catalogue for `source='stream.echo6.co'`, builds UUID + title dedup sets +- `list_new_videos(db, config)` — calls `get_videos()`, filters against known sets, checks captions with rate limiting +- `acquire_one(video, caption_path, config)` — fetches VTT, converts to text, writes `.tmp` files, hashes, renames atomically +- `acquire_batch(db, config)` — orchestrates list + acquire, returns `{acquired, skipped, errors}` +- `acquisition_loop(stop_event, db, config, interval)` — service loop, polls every `interval` seconds + +### Edited: `recon.py` + +- `cmd_service()`: Added `peertube-acq` thread running `acquisition_loop` (interval from config, default 1800s) +- `cmd_ingest_peertube()`: Replaced legacy `ingest_channel`/`ingest_all` with `acquire_batch` +- Simplified argparse: removed `--channel`, `--since`, `--enrich`, `--process`; kept `--stats` + +### Edited: `config.yaml` + +- Added `poll_interval: 1800` under `peertube:` section + +## Architecture + +``` +PeerTube API → list_new_videos (dedup) → acquire_one (fetch VTT, hash, write) + → data/acquired/stream/{hash}.txt + {hash}.meta.json + → dispatcher _find_pairs() → transcript_processor pre_flight() + → enrich → embed → complete +``` + +## Key Design Decisions + +1. **No DB writes in acquisition** — `acquire_one` only writes files. `pre_flight()` handles catalogue registration. +2. **Atomic writes** — `.tmp` suffix during writes, rename meta first then content. Dispatcher only sees complete pairs. +3. **Two dedup cohorts** — UUID set (from URL paths) and title set (from filename column) cover both legacy and new catalogue entries. +4. **Rate limiting** — 0.5s delay between caption API calls to avoid PeerTube 429s. + +## Verification + +- Import/compile: OK +- Dry run: `list_new_videos` returns new videos not in catalogue +- Real acquisition: hash `a8893f3757295e347cb5b529cae350ff` acquired and dispatched (returned 'duplicate' — already in catalogue from legacy ingest, confirming dedup works) +- Service restart: 7 threads, `peertube-acq` in thread list, 0 errors in 90-second window +- CLI: `recon ingest-peertube --stats` still works, `recon ingest-peertube` uses new path