Commit graph

58 commits

Author SHA1 Message Date
a4288c0cd8 Add contacts/phone book system with per-user scoping
New files:
- lib/auth.py: Authentik forward-auth helpers (get_user_id, @require_auth)
- lib/contacts.py: ContactsDB with CRUD, soft delete, restore, purge, find_nearby
- lib/contacts_api.py: Flask Blueprint with 9 API endpoints at /api/contacts
- templates/knowledge/deleted_contacts.html: Dashboard recovery page

Modified:
- lib/api.py: Register contacts_bp, add KNOWLEDGE_SUBNAV entry, /deleted-contacts route
- config/profiles: has_contacts feature flag (true for home, false for pi profiles)

Separate SQLite DB at data/contacts.db. Per-user isolation via X-Authentik-Username.
Home/Work labels enforced unique per user. Haversine proximity queries (75m default).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-22 05:29:54 +00:00
095bf8c2af Add Google Places (New) tertiary enrichment for business POIs
Fills opening_hours, phone, and website gaps when OSM + Overture data
is incomplete. Only fires for business-class POIs (amenity, shop, tourism,
leisure, office, craft). Daily API call cap with SQLite tracking.
cache_put now preserves google columns across cache refreshes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-22 04:08:12 +00:00
620f99c762 Add business_intent_poi_boost reranker signal
When a query contains no road-type keywords (st, blvd, ave, etc.),
boost amenity/shop/tourism/leisure/office/craft results (+3.0) and
penalize highway/route results (-4.0). This fixes searches like
"starbucks twin falls" where a named service road outranked the
actual business POI due to Photon position tiebreaking.

Also fixes:
- Intent classifier now recognizes full state names ("idaho" not
  just "ID") for LOCALITY classification
- Locality-type Photon results now populate _city from name field
  so they participate in locality_fuzz scoring
- Trace logging expanded to all candidates with osm_key/value

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 19:39:37 +00:00
d460f0e202 Fix type classifier: POI check takes precedence over street_address
Businesses with housenumbers (e.g. M&W Markets at 130 US-30) were
classified as street_address because the housenumber check fired before
the osm_key check. Reorder so osm_key in amenity/shop/tourism/leisure/office
is evaluated first, ensuring businesses get type=poi regardless of
whether they have a street address. Also adds office to the POI key set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 19:08:04 +00:00
65693d15aa Add Overture Maps POI enrichment layer for place details
Ingests 20.9M North America places from Overture Maps Foundation
(release 2026-04-15.0) into PostgreSQL. Enriches /api/place responses
with phone, website, and brand data via spatial + fuzzy name matching
when OSM extratags are sparse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 16:51:25 +00:00
2121ee4936 Add place detail proxy with Nominatim-first routing and Overpass fallback
New /api/place/<osm_type>/<osm_id> endpoint returns cleaned OSM tag data
for PlaceDetail panel enrichment. Routes to local Nominatim (Idaho coverage)
first, falls back to Overpass public API for out-of-region queries. Responses
cached in SQLite (data/place_cache.db) with no expiry.

New modules: lib/place_detail.py (proxy + cache), lib/osm_categories.py
(~50 category humanization mappings). Profile YAMLs updated with
place_details config block and has_nominatim_details flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 03:06:51 +00:00
64605b38bb Add TomTom traffic proxy and update profiles for hillshade/traffic layers
- Add /api/traffic/flow proxy route to hide TomTom API key from frontend
- Add tileset_hillshade and traffic config blocks to all three profiles
- Flip has_hillshade and has_traffic_overlay flags in home and regional profiles
- Minimal profile has config blocks but flags remain false (dormant)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 00:52:04 +00:00
e6b81db520 feat(navi): deployment profiles + /api/config endpoint
Add profile-driven config infrastructure:
- config/profiles/{home,regional_pi,minimal_pi}.yaml templates
- lib/deployment_config.py loader (reads RECON_PROFILE env var)
- GET /api/config returns active profile as JSON (5min cache)

Frontend reads this on startup to determine tile source, defaults,
and feature flags. No existing behavior changed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 23:35:39 +00:00
d4c5c371ca Merge feature/navi-integration: Navi backend (address book, Netsyms, geocoding chain, reverse endpoint) 2026-04-20 22:40:03 +00:00
ac69e2761d feat(navi): add /api/reverse endpoint for map-click reverse geocoding
Accepts lat/lon query params, calls Photon /reverse, returns same
response shape as /api/geocode. Returns 200 with empty results on
no match (graceful degradation for ocean/unmapped areas).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 21:26:35 +00:00
87b230dcba feat(navi): structured geocode with usaddress parsing and reranker
Add lib/geocode.py — multi-source retrieval pipeline:
- usaddress CRF parsing with intent classification
- Netsyms structured lookup (uses raw street abbreviations)
- Photon /structured + /api freetext retrieval
- Weighted 10-signal reranker (housenumber, street fuzz, locality,
  source authority, etc.)
- match_code annotations + address book proximity labeling
- Trace log at /tmp/geocode_rerank_trace.log

nav_tools.py now delegates geocode() to the new module.
Tests updated: US address queries correctly return Netsyms results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 16:29:47 +00:00
c76d63b785 refactor(navi): Photon-first geocoding with ranked results
Inverts the /api/geocode chain. Photon is now the primary search
engine; the hand-rolled Netsyms free-text parser is removed.
Address book short-circuits nicknames only ("home", "work") —
full-address queries flow through Photon and address book
entries within 75m annotate matching results with labeled_as.
Coordinate strings detected before search.

Response shape: /api/geocode now returns a ranked candidates
list (always 200 OK, empty list if no match). No more 404 for
unmatched queries. Users can type messy input — wrong case,
missing punctuation, abbreviations, typos — and get results
or close matches.

Netsyms preserved at /api/netsyms/lookup for direct access.
USPS plus4 enrichment of Photon street-address hits is a
planned follow-up.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 15:48:03 +00:00
a14501347b fix(navi): address book prefix+boundary match for longer queries
lookup() previously did exact-alias-only matching, so "214 north st
filer" missed the home entry with alias "214 north st". Extend to
match when the query begins with an alias followed by a word
boundary, and when an alias appears as a contiguous token sequence
inside the query. Short aliases ("home") keep matching exactly and
also match with trailing text.

Fixes the UX case where typing a known full address falls through
to Netsyms instead of short-circuiting to address_book.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 07:54:32 +00:00
dfab388769 feat(navi): add netsyms tier-2 geocoding + geocode API
Add Netsyms AddressDatabase2025 (159M US+CA addresses) as tier-2
in the geocode chain: address_book → netsyms → photon.

- lib/netsyms.py: SQLite lookup module (lazy, read-only, thread-safe)
- lib/netsyms_api.py: Flask blueprints for /api/netsyms/* and /api/geocode
- lib/netsyms_test.py: 7 test cases (street, free-text, zipcode, health)
- lib/nav_tools.py: new geocode() with consistent {name,lat,lon,source,raw}
- lib/api.py: register netsyms_bp and geocode_bp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 07:24:09 +00:00
23483e8198 feat(navi): address book with geocoding integration
- YAML-backed saved locations (config/address_book.yaml)
- Exact/partial alias matching with case-insensitive lookup
- Flask blueprint: /api/address_book/lookup, /api/address_book/list
- Geocoder short-circuits Photon when address book has exact match
- Test suite for lookup behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 04:02:11 +00:00
3243f2f252 feat(navi): semantic query router for intelligent tool selection - Phase H2b
Add centroid-based query classifier that routes Aurora queries to the
appropriate handler (nav_route, nav_reverse_geocode, direct_answer,
rag_search) before the RAG pipeline runs. Uses TEI embeddings against
pre-computed route centroids from 38 example queries.

- query_router.py: standalone module with lazy centroid init
- query_router_test.py: 7-query test suite (all passing)
- Corresponding recon_rag_tool.py v4.2.0 deployed to Open WebUI DB

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 23:50:35 +00:00
9841c38011 fix(navi): format tool output as human-readable directions 2026-04-19 22:42:17 +00:00
a9510b5ed9 feat(navi): add nav_tools with route() and reverse_geocode() - Phase H2
- nav_tools.py: route() geocodes via Photon, routes via Valhalla, returns
  summary/maneuvers/polyline. reverse_geocode() for coordinate lookups.
  Supports auto/pedestrian/bicycle/truck modes.
- nav_tools_test.py: 5 live tests against local Photon (2322) and Valhalla (8002)
- aurora_nav_tool.py: Open WebUI Tool exposing get_directions to Aurora LLM

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 22:14:26 +00:00
c5283ece3e Merge feature/scraper: Zimit-based web scraper
Replaces wget/SingleFile/Playwright crawl backends with Zimit (openZIM
Docker crawler). Produces ZIM files directly — no zimwriterfs step.
Validated with meshtastic.org (3400+ page Docusaurus site).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 19:37:04 +00:00
5f5bcedab9 Fix progress regex and SIGHUP/scan_zims race condition
- Parse Browsertrix "crawled":N JSON format instead of "N pages"
- Add 3s delay between SIGHUP to kiwix-serve and scan_zims() call
  so the OPDS catalog is reloaded before we query it for linking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 19:35:42 +00:00
9692044790 Fix progress parsing for Browsertrix JSON log format
Parse "crawled":N from Browsertrix crawlStatus JSON logs instead of
looking for "N pages" pattern. Also check stdout (not just stderr).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 19:33:50 +00:00
b035ba3f20 Fix Zimit: add required --name flag for warc2zim
warc2zim (called internally by zimit) requires --name for ZIM metadata.
Without it, argument validation fails with exit code 2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 14:30:42 +00:00
76076fc4ab Fix Zimit CLI: add subcommand, correct flag names, fix container cleanup
- Must pass `zimit` as command after image name (entrypoint execs args)
- --url → --seeds, --name removed, --lang → --zim-lang, --workers → -w
- Remove --rm so docker logs work after exit, manually rm container

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 14:13:34 +00:00
8945c82e3f Replace wget/SingleFile/Playwright backends with Zimit
- Zimit Docker container handles all site types (static, SPA, JS redirects)
- Removed: _detect_crawl_mode, _crawl_wget, _crawl_singlefile, preflight logic
- Added: _crawl_zimit() with Docker lifecycle management
- Simplified pipeline: submit → Zimit crawl → kiwix-manage register → done
- No more zimwriterfs step — Zimit produces ZIM directly
- Dashboard UI simplified: removed crawl mode dropdown
- Config simplified: removed reject patterns, preflight, singlefile sections

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 14:06:23 +00:00
f0b160ef7c Extract _full_zim_cleanup helper, add SIGHUP + scrape_jobs cleanup
- Extract shared _full_zim_cleanup(source_id) from api_kiwix_remove
- Add SIGHUP to kiwix-serve after kiwix-manage remove
- Delete linked scrape_jobs rows during ZIM removal
- Update api_scraper_delete to do full ZIM cleanup when applicable
- Set chromium_path for single-file browser crawl support
- Add status.db to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 02:28:49 +00:00
45c3bb8d56 Add scraper job queue management (delete, clear failed)
New API endpoints: DELETE single job, clear all failed/cancelled.
Dashboard now shows Delete buttons on completed/failed jobs,
Retry+Delete on failed jobs, and a Clear Failed bulk action.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 21:03:39 +00:00
1ce9a3731f Add scraper dashboard UI under Kiwix tab
New /kiwix/scraper page with submit form (URL, title, language,
crawl mode), stats cards, and auto-refreshing jobs table with
cancel/retry actions. Kiwix section now has Library/Scraper subnav.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 20:47:17 +00:00
45b954fccc Fix ZIM filename collisions by appending job ID
Format: {domain}_{lang}_{YYYY-MM}_{job_id}.zim
Prevents zimwriterfs failures when the same domain is scraped
multiple times in the same month.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 20:17:53 +00:00
125602fa13 Fix SingleFile CLI: remove invalid --crawl-delay flag
SingleFile CLI has no --crawl-delay option. The invalid flag caused the
process to print help and exit with no output. Added --crawl-no-parent
and --crawl-replace-URLs instead. Removed unused crawl_delay config key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 19:28:03 +00:00
da50e5f0b8 Add scraper Phase 2: smart crawl mode detection + browser fallback
- Pre-flight detection: wget + Playwright probe to auto-detect if site
  needs browser rendering (JS apps, parking page redirects)
- SingleFile CLI crawl backend for JS-rendered sites
- crawl_mode column in scrape_jobs (static/browser/redirect/auto)
- API: optional crawl_mode param on submit, cleared on retry
- Config: rate_limit_delay 2.0→0.5, /api/ reject pattern, preflight
  + singlefile config sections
- Prerequisites: Node.js 22, single-file-cli, Playwright + Chromium

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 18:26:43 +00:00
491a4350fc Merge feature/kiwix: Kiwix ZIM integration
- ZIM monitor + kiwix-serve foundation (Phase 1)
- Batch article ingestion pipeline (Phase 2)
- Dashboard tab, wiki.echo6.co citations
- Language filter for non-English articles
- Status badge + progress column fixes
- Download URL generation fix (/content/ prefix, full ZIM name)
2026-04-18 00:07:00 +00:00
b250d0c257 Fix Kiwix download URL generation in embedder
- Add /content/ prefix to wiki.echo6.co URLs (required by kiwix-serve)
- Stop stripping ZIM flavor/date suffix (e.g. _maxi_2025-11) from filename
- Use str.removesuffix instead of regex to strip only .zim extension

Before: https://wiki.echo6.co/appropedia_en_all/Article
After:  https://wiki.echo6.co/content/appropedia_en_all_maxi_2025-11/Article

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 00:06:52 +00:00
a40ce47127 Fix progress column to show Qdrant count for completed sources
Complete sources now show "19,344 in Qdrant" instead of misleading
extraction counts. Each status gets contextual progress display:
complete → X in Qdrant, processing → X/Y in Qdrant (%),
extracting → X/Y extracted, detected → dash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 15:31:01 +00:00
fed02186fa Fix Kiwix status badges to reflect full pipeline state
Status was showing COMPLETE after ZIM extraction finished, even when
documents were still queued for enrichment/embedding. Now computes
effective_status by checking actual pipeline state per-source:

- DETECTED: ingest not enabled (gray)
- EXTRACTING: ZIM processor running (blue)
- PROCESSING: extracted but docs still in enricher/embedder queue (amber)
- COMPLETE: all docs fully enriched and embedded in Qdrant (green)

Also fixed _build_kiwix_sources pipeline query to filter by category
per-source instead of returning global kiwix stats for every source.

Progress column now shows "X / Y in Qdrant" when processing, or
"X / Y extracted" otherwise.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 15:22:44 +00:00
6f2a1d206e Add langdetect language filter to enricher + purge non-English ZIM articles
- Install langdetect package for content-level language detection
- Add _check_language() to enricher.py: reads first 1500 chars of first
  page, detects language via langdetect, skips if not in allowed list
- Configurable via config.yaml pipeline.language_filter and
  pipeline.allowed_languages (default: en only)
- Catches non-English content from ANY source (PDF, web, ZIM, PeerTube)
  before burning Gemini API quota on enrichment
- Add scan_zims retry logic (3 attempts, 2s delay) for upload handler
- Purged 6,483 stale non-English zim_articles rows from DB

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 14:37:13 +00:00
501004ecf1 Filter non-English articles from ZIM ingestion
Skip articles with MediaWiki translation suffixes (/es, /fr, /pl, etc.)
before text extraction to avoid wasting Gemini enrichment on translations.
Uses path-based regex matching against ISO 639 language codes.

~5,276 non-English articles already ingested from Appropedia (top: es=837,
zh=765, ru=475, fr=433, ko=407). Purge decision deferred.
2026-04-17 07:30:30 +00:00
2635160887 Kiwix integration: ZIM processor, dashboard tab, wiki.echo6.co citations
- ZIM processor: extract articles from ZIM files, feed into existing enrichment pipeline
- Dashboard: Kiwix tab with library table, ingest toggle, upload, remove
- kiwix-serve on port 8430, wiki.echo6.co behind Authentik
- Citation URLs point to wiki.echo6.co/{zimname}/{article_path}
- Dashboard shows WIKI type badge for ZIM-sourced content
- Appropedia EN (19,445 articles) fully ingested as proof of concept
2026-04-17 07:00:24 +00:00
c60aa5e80d Phase 2: ZIM processor — batch article ingestion pipeline
Adds lib/processors/zim_processor.py which opens a ZIM file via
python-libzim, iterates HTML articles, strips to clean text (lxml),
and feeds each article into the existing RECON enrichment pipeline.

Key features:
- HTML to text via lxml (strips nav/footer/script/style)
- Filters redirects, non-HTML entries, stubs (<200 chars)
- Content hash dedup against existing catalogue
- Creates processing dirs with page files and meta.json
- Registers articles as "extracted" for automatic enrichment
- Checkpointing via zim_sources.last_checkpoint for resume
- Configurable batch size and delay for rate control
- Standalone CLI: python3 -m lib.processors.zim_processor

Tested: 100 Appropedia articles processed in 3s, enricher picks
them up automatically via the existing pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 02:03:12 +00:00
7c1af0f063 Phase 1: Kiwix foundation — ZIM monitor and kiwix-serve setup
- Add lib/zim_monitor.py: polls kiwix-serve OPDS v2 catalog, detects
  new ZIMs, reads accurate article count from python-libzim Counter
  metadata (not inflated OPDS count), inserts into zim_sources table.
  Idempotent on re-run, marks removed ZIMs.
- DB schema: zim_sources, zim_samples, zim_articles tables (created
  via sqlite3, not in migrations — matches existing RECON pattern)
- kiwix-tools 3.7.0 installed from binary tarball at /opt/recon/bin/
  (Ubuntu 24.04 apt ships 3.5.0 which lacks OPDS v2)
- kiwix.service systemd unit on port 8430
- python-libzim 3.9.0 installed
- Test ZIM: Appropedia EN maxi (496 MB, 19,445 articles)
- Add bin/ to .gitignore (binary tarball, not source)
2026-04-16 23:39:34 +00:00
8d54ff165d Merge refactor branch: RECON v1.0.0 v1.0.0
This merge integrates the complete refactor effort spanning Phases 0-6k,
bringing RECON from its initial baseline into production-grade form.

Pipeline architecture
---------------------
- Phases 0-2: foundation cleanup, removed dead code, standardized logging
- Phase 3: dispatcher rewrite — watches data/acquired/<subfolder>/ for
  {hash}.txt + {hash}.meta.json pairs, atomic .tmp rename, idempotent
- Phase 4: content processors for PDF (PyPDF2 -> pdftotext -> Tesseract ->
  Gemini Vision fallback chain), transcript, and text formats
- Phase 5: enrichment, embedding, and filing daemons split into
  independently restartable threads

PeerTube acquisition
--------------------
- Phase 6a-6c: PeerTube channel watcher, caption acquisition with rate
  limiting (429 handling), 0.5s rate_limit_delay enforced
- Phase 6d: multi-instance support
- Phase 6e: rewired then reverted dashboard PeerTube endpoint to live
  in acquisition module

Format handling & library cleanup
---------------------------------
- Phase 6f: text processor for .txt ingestion
- Phase 6f-2: format normalizer in dispatcher
- Phase 6g-6j: library reorg — ghost domain cleanup, SCL moved to
  dedicated domain folder, pi-nas fully decommissioned as a storage
  target (NFS-only now), ~73 GB reclaimed
- Phase 6k: hash-identical dedup — 2,477 duplicate PDFs removed,
  22.05 GB freed, catalogue/documents/Qdrant payloads updated
  coherently, 226 empty domain subdirs pruned
- 16,340 transcripts remain un-filed pending title-match review

Dashboard & metadata
--------------------
- Gemini "null" string bug fixed in pdf_processor metadata voting
- Dashboard upload migrated to pipeline with multi-format support

State at release
----------------
- 7 daemon threads: dispatcher, enrich, embed, filing, peertube-acq,
  progress, dashboard
- 29,201 documents in catalogue / documents tables (UNIQUE on hash PK)
- ~2.1M Qdrant vectors in recon_knowledge_hybrid (cortex:6333)
- ~67 GB library on /mnt/library (NFS from pi-nas)
- files.echo6.co serving 9,397 deduped PDFs
- recon.echo6.co dashboard + API on :8420

See cleanup-log.md for the full backlog and resolution history.
2026-04-16 18:20:25 +00:00
e6224cb279 Migrate dashboard upload to pipeline with multi-format support
Upload handler now writes files to the appropriate hopper subfolder
instead of copying directly to /mnt/library/:
- .pdf -> acquired/pdf/
- .txt -> acquired/text/
- .epub, .doc, .docx, .mobi -> acquired/pdf/ (dispatcher format
  normalizer converts to PDF before processing)

The dispatcher picks up files and routes through the appropriate
processor (pdf_processor or text_processor) for full metadata
voting, domain classification, and canonical filing.

Changes to api_upload() / _process_upload():
- Relaxed extension check: PDF, TXT, EPUB, DOC, DOCX, MOBI
- Routes to correct hopper subfolder by extension
- Writes meta.json sidecar with original filename and category hint
- Removed: direct library copy, add_to_catalogue, queue_document
- Added: hopper-level dedup check (catches rapid re-uploads)
- Kept: catalogue dedup check for immediate user feedback

Changes to api_upload_status():
- Added fallback: checks acquired/ and processing/ dirs if hash
  not yet in documents table (covers gap between upload and
  dispatcher pickup)

Template updated: accept attribute and help text now reflect
multi-format support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 02:18:45 +00:00
999cf37626 Fix: Gemini "null" string bug in pdf_processor metadata voting
Same fix as text_processor — Gemini sometimes returns the literal
string "null" instead of JSON null for empty metadata fields. The
voting logic and Gemini extraction now both treat "null" strings
as None.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 23:30:59 +00:00
f4659d155f Phase 6f-2: format normalizer in dispatcher
Adds _normalize_formats() to the dispatcher that converts non-standard
document formats to PDF before dispatch. Supports:
- .epub, .mobi -> PDF via ebook-convert (Calibre)
- .doc, .docx -> PDF via LibreOffice headless

Called per-subfolder before _find_pairs() so _find_pairs() only ever
sees standard content files. Conversion failures are logged and
skipped -- the original file stays in acquired/ for manual review.

Also converts 3 staged epub files and cleans up _staging/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 23:08:19 +00:00
62539861f2 Phase 6f: text processor for .txt file ingestion
New processor: lib/processors/text_processor.py
Handles plain text files (.txt) as primary source documents.

Pipeline: acquired/text/ -> dispatcher -> text_processor.pre_flight()
-> enrich -> embed -> filing worker -> library/Domain/Subdomain/

Metadata extraction via two-source vote:
- Source A: filename parsing (title from filename)
- Source B: Gemini LLM extraction (title/author/edition/year from
  first 3 pages of text)

Page splitting reuses chunk_text() from lib/web_scraper.py.
Filing behavior matches PDFs (files to library, not organized
in-place like transcripts).

Config: adds text: text_processor to pipeline.dispatch map.
New hopper subfolder: data/acquired/text/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 22:39:31 +00:00
7fe7d03583 Revert "Phase 6e: rewire dashboard PeerTube endpoint to acquisition module"
This reverts commit 7e42528d2f.
2026-04-15 03:20:46 +00:00
7e42528d2f Phase 6e: rewire dashboard PeerTube endpoint to acquisition module
Replace legacy ingest_channel/ingest_all imports with acquire_batch
from lib.acquisition.peertube. The endpoint now writes flat file pairs
to the hopper and lets the dispatcher handle processing, matching the
Phase 6d architecture. Removes channel/since/process parameters that
were tied to the old direct-ingest path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 03:15:41 +00:00
277110d999 Phase 6d: PeerTube acquisition module + service thread
New lib/acquisition/peertube.py replaces the removed peertube_scanner_loop.
Polls PeerTube API every 30min, dedupes against catalogue (UUID + title),
writes flat file pairs to data/acquired/stream/ for the dispatcher.

- acquire_batch(): one-shot find-and-acquire with rate limiting
- acquisition_loop(): service thread wrapper (interval from config)
- list_new_videos(): dedup via _build_known_sets() against catalogue
- acquire_one(): fetch VTT, convert, write .tmp then rename atomically

cmd_service(): added peertube-acq daemon thread
cmd_ingest_peertube(): rewired to use acquire_batch(), drops --channel/
  --since/--enrich/--process (dispatcher handles full pipeline)
config.yaml: added peertube.poll_interval: 1800

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 03:08:51 +00:00
efae4023f6 Phase 6c: remove vestigial extract worker, dead crawler, .bak files
recon.py:
- Remove extract stage_loop thread from cmd_service(). Confirmed
  vestigial: 0 queued items, silent logs over 24+ hour run. The new
  processors do extraction inline in pre_flight().
- Remove cmd_crawl CLI subcommand and its argparse registration.
- Clean up associated imports and variables.

Deleted:
- lib/crawler.py (432 lines) -- old web crawler subsystem, only
  referenced by the removed CLI subcommand.
- 24 .bak files (untracked pre-edit safety backups, originals
  preserved in git history).

Investigation found the four old loop function definitions
(scanner_loop, peertube_scanner_loop, crawler_scheduler_loop,
organizer_loop) were already deleted in Phase 5c-1.

Modules investigated and KEPT:
- lib/web_scraper.py -- exports chunk_text() used by transcript_processor
- lib/new_pipeline.py -- active Stream B library management CLI tool
- lib/peertube_scraper.py -- only mechanism for transcript ingestion
- lib/extractor.py -- would activate for new PDFs via cmd_run CLI

Service restart verified: 6 threads (dispatcher, enrich, embed,
filing, progress, dashboard), no extract worker, zero errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 23:46:00 +00:00
70b80cb312 Phase 6b: fix dashboard Untitled/WEB bug for transcripts
Two bugs in the Recently Completed table:

1. Title showed "Untitled" for all transcripts because the dashboard
   read documents.book_title (populated by PDF metadata voting) which
   is NULL for transcripts. Fixed by COALESCE(book_title, filename)
   in the SQL query -- falls back to catalogue.filename which holds
   the real video title.

2. Type showed "WEB" for all transcripts because the type CASE
   expression only had web and pdf branches, with web matching any
   http% path -- and transcript paths are PeerTube watch URLs.
   Fixed by adding a transcript branch keyed on catalogue.source =
   stream.echo6.co, evaluated before the web branch.

Also adds badge-transcript CSS (purple) and JS rendering case.
Applied consistently to both the Recently Completed and Sources
table queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 23:05:29 +00:00
df29d598d3 Phase 6a: transcripts mark organized in-place, skip filing
Transcripts are derived text from PeerTube videos, not primary source
files. They do not belong in library/Domain/Subdomain/ like PDFs.

Change: transcript_processor.pre_flight() now sets organized_at =
CURRENT_TIMESTAMP at the end of successful processing, marking the
transcript as organized in place. The watch URL remains in
catalogue.path and Qdrant download_url so users clicking search
results go to the PeerTube video.

The filing workers path LIKE filter naturally excludes transcripts
since their documents.path is the watch URL, not a filesystem path.
No filing worker changes needed.

Back-fills 2,260 drain items from Phase 5c-2 via one-time SQL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 22:49:21 +00:00