Commit graph

60 commits

Author SHA1 Message Date
9c5b0520f9 Add PAD-US public land classification lookup
Integrates USGS PAD-US 4.0 (651k features) into a local PostGIS database
for point-in-polygon land ownership queries. Adds /api/landclass endpoint
returning classifications, public/private status, and management hierarchy.

- lib/landclass.py: connection pool, lookup_landclass(), domain label maps
- lib/api.py: GET /api/landclass?lat=&lon= (feature-flag gated)
- home.yaml: enable has_landclass flag

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-22 15:36:37 +00:00
3280e34718 Add Nav-I dashboard section with restore-as conflict resolution
- Create Nav-I top-level section in dashboard navigation
- Move Deleted Contacts from Knowledge subnav to Nav-I
- Add Nav-I landing page with card grid (deleted count, API keys stub)
- Add /nav-i/api-keys placeholder page
- Add restore-as endpoint for Home/Work conflict resolution
- Conflict modal in deleted contacts template for label rename on restore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-22 06:26:25 +00:00
a4288c0cd8 Add contacts/phone book system with per-user scoping
New files:
- lib/auth.py: Authentik forward-auth helpers (get_user_id, @require_auth)
- lib/contacts.py: ContactsDB with CRUD, soft delete, restore, purge, find_nearby
- lib/contacts_api.py: Flask Blueprint with 9 API endpoints at /api/contacts
- templates/knowledge/deleted_contacts.html: Dashboard recovery page

Modified:
- lib/api.py: Register contacts_bp, add KNOWLEDGE_SUBNAV entry, /deleted-contacts route
- config/profiles: has_contacts feature flag (true for home, false for pi profiles)

Separate SQLite DB at data/contacts.db. Per-user isolation via X-Authentik-Username.
Home/Work labels enforced unique per user. Haversine proximity queries (75m default).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-22 05:29:54 +00:00
095bf8c2af Add Google Places (New) tertiary enrichment for business POIs
Fills opening_hours, phone, and website gaps when OSM + Overture data
is incomplete. Only fires for business-class POIs (amenity, shop, tourism,
leisure, office, craft). Daily API call cap with SQLite tracking.
cache_put now preserves google columns across cache refreshes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-22 04:08:12 +00:00
620f99c762 Add business_intent_poi_boost reranker signal
When a query contains no road-type keywords (st, blvd, ave, etc.),
boost amenity/shop/tourism/leisure/office/craft results (+3.0) and
penalize highway/route results (-4.0). This fixes searches like
"starbucks twin falls" where a named service road outranked the
actual business POI due to Photon position tiebreaking.

Also fixes:
- Intent classifier now recognizes full state names ("idaho" not
  just "ID") for LOCALITY classification
- Locality-type Photon results now populate _city from name field
  so they participate in locality_fuzz scoring
- Trace logging expanded to all candidates with osm_key/value

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 19:39:37 +00:00
d460f0e202 Fix type classifier: POI check takes precedence over street_address
Businesses with housenumbers (e.g. M&W Markets at 130 US-30) were
classified as street_address because the housenumber check fired before
the osm_key check. Reorder so osm_key in amenity/shop/tourism/leisure/office
is evaluated first, ensuring businesses get type=poi regardless of
whether they have a street address. Also adds office to the POI key set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 19:08:04 +00:00
65693d15aa Add Overture Maps POI enrichment layer for place details
Ingests 20.9M North America places from Overture Maps Foundation
(release 2026-04-15.0) into PostgreSQL. Enriches /api/place responses
with phone, website, and brand data via spatial + fuzzy name matching
when OSM extratags are sparse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 16:51:25 +00:00
2121ee4936 Add place detail proxy with Nominatim-first routing and Overpass fallback
New /api/place/<osm_type>/<osm_id> endpoint returns cleaned OSM tag data
for PlaceDetail panel enrichment. Routes to local Nominatim (Idaho coverage)
first, falls back to Overpass public API for out-of-region queries. Responses
cached in SQLite (data/place_cache.db) with no expiry.

New modules: lib/place_detail.py (proxy + cache), lib/osm_categories.py
(~50 category humanization mappings). Profile YAMLs updated with
place_details config block and has_nominatim_details flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 03:06:51 +00:00
64605b38bb Add TomTom traffic proxy and update profiles for hillshade/traffic layers
- Add /api/traffic/flow proxy route to hide TomTom API key from frontend
- Add tileset_hillshade and traffic config blocks to all three profiles
- Flip has_hillshade and has_traffic_overlay flags in home and regional profiles
- Minimal profile has config blocks but flags remain false (dormant)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 00:52:04 +00:00
e6b81db520 feat(navi): deployment profiles + /api/config endpoint
Add profile-driven config infrastructure:
- config/profiles/{home,regional_pi,minimal_pi}.yaml templates
- lib/deployment_config.py loader (reads RECON_PROFILE env var)
- GET /api/config returns active profile as JSON (5min cache)

Frontend reads this on startup to determine tile source, defaults,
and feature flags. No existing behavior changed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 23:35:39 +00:00
d4c5c371ca Merge feature/navi-integration: Navi backend (address book, Netsyms, geocoding chain, reverse endpoint) 2026-04-20 22:40:03 +00:00
ac69e2761d feat(navi): add /api/reverse endpoint for map-click reverse geocoding
Accepts lat/lon query params, calls Photon /reverse, returns same
response shape as /api/geocode. Returns 200 with empty results on
no match (graceful degradation for ocean/unmapped areas).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 21:26:35 +00:00
87b230dcba feat(navi): structured geocode with usaddress parsing and reranker
Add lib/geocode.py — multi-source retrieval pipeline:
- usaddress CRF parsing with intent classification
- Netsyms structured lookup (uses raw street abbreviations)
- Photon /structured + /api freetext retrieval
- Weighted 10-signal reranker (housenumber, street fuzz, locality,
  source authority, etc.)
- match_code annotations + address book proximity labeling
- Trace log at /tmp/geocode_rerank_trace.log

nav_tools.py now delegates geocode() to the new module.
Tests updated: US address queries correctly return Netsyms results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 16:29:47 +00:00
c76d63b785 refactor(navi): Photon-first geocoding with ranked results
Inverts the /api/geocode chain. Photon is now the primary search
engine; the hand-rolled Netsyms free-text parser is removed.
Address book short-circuits nicknames only ("home", "work") —
full-address queries flow through Photon and address book
entries within 75m annotate matching results with labeled_as.
Coordinate strings detected before search.

Response shape: /api/geocode now returns a ranked candidates
list (always 200 OK, empty list if no match). No more 404 for
unmatched queries. Users can type messy input — wrong case,
missing punctuation, abbreviations, typos — and get results
or close matches.

Netsyms preserved at /api/netsyms/lookup for direct access.
USPS plus4 enrichment of Photon street-address hits is a
planned follow-up.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 15:48:03 +00:00
a14501347b fix(navi): address book prefix+boundary match for longer queries
lookup() previously did exact-alias-only matching, so "214 north st
filer" missed the home entry with alias "214 north st". Extend to
match when the query begins with an alias followed by a word
boundary, and when an alias appears as a contiguous token sequence
inside the query. Short aliases ("home") keep matching exactly and
also match with trailing text.

Fixes the UX case where typing a known full address falls through
to Netsyms instead of short-circuiting to address_book.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 07:54:32 +00:00
dfab388769 feat(navi): add netsyms tier-2 geocoding + geocode API
Add Netsyms AddressDatabase2025 (159M US+CA addresses) as tier-2
in the geocode chain: address_book → netsyms → photon.

- lib/netsyms.py: SQLite lookup module (lazy, read-only, thread-safe)
- lib/netsyms_api.py: Flask blueprints for /api/netsyms/* and /api/geocode
- lib/netsyms_test.py: 7 test cases (street, free-text, zipcode, health)
- lib/nav_tools.py: new geocode() with consistent {name,lat,lon,source,raw}
- lib/api.py: register netsyms_bp and geocode_bp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 07:24:09 +00:00
23483e8198 feat(navi): address book with geocoding integration
- YAML-backed saved locations (config/address_book.yaml)
- Exact/partial alias matching with case-insensitive lookup
- Flask blueprint: /api/address_book/lookup, /api/address_book/list
- Geocoder short-circuits Photon when address book has exact match
- Test suite for lookup behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 04:02:11 +00:00
3243f2f252 feat(navi): semantic query router for intelligent tool selection - Phase H2b
Add centroid-based query classifier that routes Aurora queries to the
appropriate handler (nav_route, nav_reverse_geocode, direct_answer,
rag_search) before the RAG pipeline runs. Uses TEI embeddings against
pre-computed route centroids from 38 example queries.

- query_router.py: standalone module with lazy centroid init
- query_router_test.py: 7-query test suite (all passing)
- Corresponding recon_rag_tool.py v4.2.0 deployed to Open WebUI DB

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 23:50:35 +00:00
9841c38011 fix(navi): format tool output as human-readable directions 2026-04-19 22:42:17 +00:00
a9510b5ed9 feat(navi): add nav_tools with route() and reverse_geocode() - Phase H2
- nav_tools.py: route() geocodes via Photon, routes via Valhalla, returns
  summary/maneuvers/polyline. reverse_geocode() for coordinate lookups.
  Supports auto/pedestrian/bicycle/truck modes.
- nav_tools_test.py: 5 live tests against local Photon (2322) and Valhalla (8002)
- aurora_nav_tool.py: Open WebUI Tool exposing get_directions to Aurora LLM

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 22:14:26 +00:00
c5283ece3e Merge feature/scraper: Zimit-based web scraper
Replaces wget/SingleFile/Playwright crawl backends with Zimit (openZIM
Docker crawler). Produces ZIM files directly — no zimwriterfs step.
Validated with meshtastic.org (3400+ page Docusaurus site).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 19:37:04 +00:00
5f5bcedab9 Fix progress regex and SIGHUP/scan_zims race condition
- Parse Browsertrix "crawled":N JSON format instead of "N pages"
- Add 3s delay between SIGHUP to kiwix-serve and scan_zims() call
  so the OPDS catalog is reloaded before we query it for linking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 19:35:42 +00:00
9692044790 Fix progress parsing for Browsertrix JSON log format
Parse "crawled":N from Browsertrix crawlStatus JSON logs instead of
looking for "N pages" pattern. Also check stdout (not just stderr).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 19:33:50 +00:00
b035ba3f20 Fix Zimit: add required --name flag for warc2zim
warc2zim (called internally by zimit) requires --name for ZIM metadata.
Without it, argument validation fails with exit code 2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 14:30:42 +00:00
76076fc4ab Fix Zimit CLI: add subcommand, correct flag names, fix container cleanup
- Must pass `zimit` as command after image name (entrypoint execs args)
- --url → --seeds, --name removed, --lang → --zim-lang, --workers → -w
- Remove --rm so docker logs work after exit, manually rm container

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 14:13:34 +00:00
8945c82e3f Replace wget/SingleFile/Playwright backends with Zimit
- Zimit Docker container handles all site types (static, SPA, JS redirects)
- Removed: _detect_crawl_mode, _crawl_wget, _crawl_singlefile, preflight logic
- Added: _crawl_zimit() with Docker lifecycle management
- Simplified pipeline: submit → Zimit crawl → kiwix-manage register → done
- No more zimwriterfs step — Zimit produces ZIM directly
- Dashboard UI simplified: removed crawl mode dropdown
- Config simplified: removed reject patterns, preflight, singlefile sections

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 14:06:23 +00:00
f0b160ef7c Extract _full_zim_cleanup helper, add SIGHUP + scrape_jobs cleanup
- Extract shared _full_zim_cleanup(source_id) from api_kiwix_remove
- Add SIGHUP to kiwix-serve after kiwix-manage remove
- Delete linked scrape_jobs rows during ZIM removal
- Update api_scraper_delete to do full ZIM cleanup when applicable
- Set chromium_path for single-file browser crawl support
- Add status.db to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-19 02:28:49 +00:00
45c3bb8d56 Add scraper job queue management (delete, clear failed)
New API endpoints: DELETE single job, clear all failed/cancelled.
Dashboard now shows Delete buttons on completed/failed jobs,
Retry+Delete on failed jobs, and a Clear Failed bulk action.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 21:03:39 +00:00
1ce9a3731f Add scraper dashboard UI under Kiwix tab
New /kiwix/scraper page with submit form (URL, title, language,
crawl mode), stats cards, and auto-refreshing jobs table with
cancel/retry actions. Kiwix section now has Library/Scraper subnav.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 20:47:17 +00:00
45b954fccc Fix ZIM filename collisions by appending job ID
Format: {domain}_{lang}_{YYYY-MM}_{job_id}.zim
Prevents zimwriterfs failures when the same domain is scraped
multiple times in the same month.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 20:17:53 +00:00
125602fa13 Fix SingleFile CLI: remove invalid --crawl-delay flag
SingleFile CLI has no --crawl-delay option. The invalid flag caused the
process to print help and exit with no output. Added --crawl-no-parent
and --crawl-replace-URLs instead. Removed unused crawl_delay config key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 19:28:03 +00:00
da50e5f0b8 Add scraper Phase 2: smart crawl mode detection + browser fallback
- Pre-flight detection: wget + Playwright probe to auto-detect if site
  needs browser rendering (JS apps, parking page redirects)
- SingleFile CLI crawl backend for JS-rendered sites
- crawl_mode column in scrape_jobs (static/browser/redirect/auto)
- API: optional crawl_mode param on submit, cleared on retry
- Config: rate_limit_delay 2.0→0.5, /api/ reject pattern, preflight
  + singlefile config sections
- Prerequisites: Node.js 22, single-file-cli, Playwright + Chromium

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 18:26:43 +00:00
491a4350fc Merge feature/kiwix: Kiwix ZIM integration
- ZIM monitor + kiwix-serve foundation (Phase 1)
- Batch article ingestion pipeline (Phase 2)
- Dashboard tab, wiki.echo6.co citations
- Language filter for non-English articles
- Status badge + progress column fixes
- Download URL generation fix (/content/ prefix, full ZIM name)
2026-04-18 00:07:00 +00:00
b250d0c257 Fix Kiwix download URL generation in embedder
- Add /content/ prefix to wiki.echo6.co URLs (required by kiwix-serve)
- Stop stripping ZIM flavor/date suffix (e.g. _maxi_2025-11) from filename
- Use str.removesuffix instead of regex to strip only .zim extension

Before: https://wiki.echo6.co/appropedia_en_all/Article
After:  https://wiki.echo6.co/content/appropedia_en_all_maxi_2025-11/Article

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-18 00:06:52 +00:00
a40ce47127 Fix progress column to show Qdrant count for completed sources
Complete sources now show "19,344 in Qdrant" instead of misleading
extraction counts. Each status gets contextual progress display:
complete → X in Qdrant, processing → X/Y in Qdrant (%),
extracting → X/Y extracted, detected → dash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 15:31:01 +00:00
fed02186fa Fix Kiwix status badges to reflect full pipeline state
Status was showing COMPLETE after ZIM extraction finished, even when
documents were still queued for enrichment/embedding. Now computes
effective_status by checking actual pipeline state per-source:

- DETECTED: ingest not enabled (gray)
- EXTRACTING: ZIM processor running (blue)
- PROCESSING: extracted but docs still in enricher/embedder queue (amber)
- COMPLETE: all docs fully enriched and embedded in Qdrant (green)

Also fixed _build_kiwix_sources pipeline query to filter by category
per-source instead of returning global kiwix stats for every source.

Progress column now shows "X / Y in Qdrant" when processing, or
"X / Y extracted" otherwise.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 15:22:44 +00:00
6f2a1d206e Add langdetect language filter to enricher + purge non-English ZIM articles
- Install langdetect package for content-level language detection
- Add _check_language() to enricher.py: reads first 1500 chars of first
  page, detects language via langdetect, skips if not in allowed list
- Configurable via config.yaml pipeline.language_filter and
  pipeline.allowed_languages (default: en only)
- Catches non-English content from ANY source (PDF, web, ZIM, PeerTube)
  before burning Gemini API quota on enrichment
- Add scan_zims retry logic (3 attempts, 2s delay) for upload handler
- Purged 6,483 stale non-English zim_articles rows from DB

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 14:37:13 +00:00
501004ecf1 Filter non-English articles from ZIM ingestion
Skip articles with MediaWiki translation suffixes (/es, /fr, /pl, etc.)
before text extraction to avoid wasting Gemini enrichment on translations.
Uses path-based regex matching against ISO 639 language codes.

~5,276 non-English articles already ingested from Appropedia (top: es=837,
zh=765, ru=475, fr=433, ko=407). Purge decision deferred.
2026-04-17 07:30:30 +00:00
2635160887 Kiwix integration: ZIM processor, dashboard tab, wiki.echo6.co citations
- ZIM processor: extract articles from ZIM files, feed into existing enrichment pipeline
- Dashboard: Kiwix tab with library table, ingest toggle, upload, remove
- kiwix-serve on port 8430, wiki.echo6.co behind Authentik
- Citation URLs point to wiki.echo6.co/{zimname}/{article_path}
- Dashboard shows WIKI type badge for ZIM-sourced content
- Appropedia EN (19,445 articles) fully ingested as proof of concept
2026-04-17 07:00:24 +00:00
c60aa5e80d Phase 2: ZIM processor — batch article ingestion pipeline
Adds lib/processors/zim_processor.py which opens a ZIM file via
python-libzim, iterates HTML articles, strips to clean text (lxml),
and feeds each article into the existing RECON enrichment pipeline.

Key features:
- HTML to text via lxml (strips nav/footer/script/style)
- Filters redirects, non-HTML entries, stubs (<200 chars)
- Content hash dedup against existing catalogue
- Creates processing dirs with page files and meta.json
- Registers articles as "extracted" for automatic enrichment
- Checkpointing via zim_sources.last_checkpoint for resume
- Configurable batch size and delay for rate control
- Standalone CLI: python3 -m lib.processors.zim_processor

Tested: 100 Appropedia articles processed in 3s, enricher picks
them up automatically via the existing pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 02:03:12 +00:00
7c1af0f063 Phase 1: Kiwix foundation — ZIM monitor and kiwix-serve setup
- Add lib/zim_monitor.py: polls kiwix-serve OPDS v2 catalog, detects
  new ZIMs, reads accurate article count from python-libzim Counter
  metadata (not inflated OPDS count), inserts into zim_sources table.
  Idempotent on re-run, marks removed ZIMs.
- DB schema: zim_sources, zim_samples, zim_articles tables (created
  via sqlite3, not in migrations — matches existing RECON pattern)
- kiwix-tools 3.7.0 installed from binary tarball at /opt/recon/bin/
  (Ubuntu 24.04 apt ships 3.5.0 which lacks OPDS v2)
- kiwix.service systemd unit on port 8430
- python-libzim 3.9.0 installed
- Test ZIM: Appropedia EN maxi (496 MB, 19,445 articles)
- Add bin/ to .gitignore (binary tarball, not source)
2026-04-16 23:39:34 +00:00
8d54ff165d Merge refactor branch: RECON v1.0.0 v1.0.0
This merge integrates the complete refactor effort spanning Phases 0-6k,
bringing RECON from its initial baseline into production-grade form.

Pipeline architecture
---------------------
- Phases 0-2: foundation cleanup, removed dead code, standardized logging
- Phase 3: dispatcher rewrite — watches data/acquired/<subfolder>/ for
  {hash}.txt + {hash}.meta.json pairs, atomic .tmp rename, idempotent
- Phase 4: content processors for PDF (PyPDF2 -> pdftotext -> Tesseract ->
  Gemini Vision fallback chain), transcript, and text formats
- Phase 5: enrichment, embedding, and filing daemons split into
  independently restartable threads

PeerTube acquisition
--------------------
- Phase 6a-6c: PeerTube channel watcher, caption acquisition with rate
  limiting (429 handling), 0.5s rate_limit_delay enforced
- Phase 6d: multi-instance support
- Phase 6e: rewired then reverted dashboard PeerTube endpoint to live
  in acquisition module

Format handling & library cleanup
---------------------------------
- Phase 6f: text processor for .txt ingestion
- Phase 6f-2: format normalizer in dispatcher
- Phase 6g-6j: library reorg — ghost domain cleanup, SCL moved to
  dedicated domain folder, pi-nas fully decommissioned as a storage
  target (NFS-only now), ~73 GB reclaimed
- Phase 6k: hash-identical dedup — 2,477 duplicate PDFs removed,
  22.05 GB freed, catalogue/documents/Qdrant payloads updated
  coherently, 226 empty domain subdirs pruned
- 16,340 transcripts remain un-filed pending title-match review

Dashboard & metadata
--------------------
- Gemini "null" string bug fixed in pdf_processor metadata voting
- Dashboard upload migrated to pipeline with multi-format support

State at release
----------------
- 7 daemon threads: dispatcher, enrich, embed, filing, peertube-acq,
  progress, dashboard
- 29,201 documents in catalogue / documents tables (UNIQUE on hash PK)
- ~2.1M Qdrant vectors in recon_knowledge_hybrid (cortex:6333)
- ~67 GB library on /mnt/library (NFS from pi-nas)
- files.echo6.co serving 9,397 deduped PDFs
- recon.echo6.co dashboard + API on :8420

See cleanup-log.md for the full backlog and resolution history.
2026-04-16 18:20:25 +00:00
e6224cb279 Migrate dashboard upload to pipeline with multi-format support
Upload handler now writes files to the appropriate hopper subfolder
instead of copying directly to /mnt/library/:
- .pdf -> acquired/pdf/
- .txt -> acquired/text/
- .epub, .doc, .docx, .mobi -> acquired/pdf/ (dispatcher format
  normalizer converts to PDF before processing)

The dispatcher picks up files and routes through the appropriate
processor (pdf_processor or text_processor) for full metadata
voting, domain classification, and canonical filing.

Changes to api_upload() / _process_upload():
- Relaxed extension check: PDF, TXT, EPUB, DOC, DOCX, MOBI
- Routes to correct hopper subfolder by extension
- Writes meta.json sidecar with original filename and category hint
- Removed: direct library copy, add_to_catalogue, queue_document
- Added: hopper-level dedup check (catches rapid re-uploads)
- Kept: catalogue dedup check for immediate user feedback

Changes to api_upload_status():
- Added fallback: checks acquired/ and processing/ dirs if hash
  not yet in documents table (covers gap between upload and
  dispatcher pickup)

Template updated: accept attribute and help text now reflect
multi-format support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 02:18:45 +00:00
999cf37626 Fix: Gemini "null" string bug in pdf_processor metadata voting
Same fix as text_processor — Gemini sometimes returns the literal
string "null" instead of JSON null for empty metadata fields. The
voting logic and Gemini extraction now both treat "null" strings
as None.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 23:30:59 +00:00
f4659d155f Phase 6f-2: format normalizer in dispatcher
Adds _normalize_formats() to the dispatcher that converts non-standard
document formats to PDF before dispatch. Supports:
- .epub, .mobi -> PDF via ebook-convert (Calibre)
- .doc, .docx -> PDF via LibreOffice headless

Called per-subfolder before _find_pairs() so _find_pairs() only ever
sees standard content files. Conversion failures are logged and
skipped -- the original file stays in acquired/ for manual review.

Also converts 3 staged epub files and cleans up _staging/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 23:08:19 +00:00
62539861f2 Phase 6f: text processor for .txt file ingestion
New processor: lib/processors/text_processor.py
Handles plain text files (.txt) as primary source documents.

Pipeline: acquired/text/ -> dispatcher -> text_processor.pre_flight()
-> enrich -> embed -> filing worker -> library/Domain/Subdomain/

Metadata extraction via two-source vote:
- Source A: filename parsing (title from filename)
- Source B: Gemini LLM extraction (title/author/edition/year from
  first 3 pages of text)

Page splitting reuses chunk_text() from lib/web_scraper.py.
Filing behavior matches PDFs (files to library, not organized
in-place like transcripts).

Config: adds text: text_processor to pipeline.dispatch map.
New hopper subfolder: data/acquired/text/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 22:39:31 +00:00
7fe7d03583 Revert "Phase 6e: rewire dashboard PeerTube endpoint to acquisition module"
This reverts commit 7e42528d2f.
2026-04-15 03:20:46 +00:00
7e42528d2f Phase 6e: rewire dashboard PeerTube endpoint to acquisition module
Replace legacy ingest_channel/ingest_all imports with acquire_batch
from lib.acquisition.peertube. The endpoint now writes flat file pairs
to the hopper and lets the dispatcher handle processing, matching the
Phase 6d architecture. Removes channel/since/process parameters that
were tied to the old direct-ingest path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 03:15:41 +00:00
277110d999 Phase 6d: PeerTube acquisition module + service thread
New lib/acquisition/peertube.py replaces the removed peertube_scanner_loop.
Polls PeerTube API every 30min, dedupes against catalogue (UUID + title),
writes flat file pairs to data/acquired/stream/ for the dispatcher.

- acquire_batch(): one-shot find-and-acquire with rate limiting
- acquisition_loop(): service thread wrapper (interval from config)
- list_new_videos(): dedup via _build_known_sets() against catalogue
- acquire_one(): fetch VTT, convert, write .tmp then rename atomically

cmd_service(): added peertube-acq daemon thread
cmd_ingest_peertube(): rewired to use acquire_batch(), drops --channel/
  --since/--enrich/--process (dispatcher handles full pipeline)
config.yaml: added peertube.poll_interval: 1800

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 03:08:51 +00:00
efae4023f6 Phase 6c: remove vestigial extract worker, dead crawler, .bak files
recon.py:
- Remove extract stage_loop thread from cmd_service(). Confirmed
  vestigial: 0 queued items, silent logs over 24+ hour run. The new
  processors do extraction inline in pre_flight().
- Remove cmd_crawl CLI subcommand and its argparse registration.
- Clean up associated imports and variables.

Deleted:
- lib/crawler.py (432 lines) -- old web crawler subsystem, only
  referenced by the removed CLI subcommand.
- 24 .bak files (untracked pre-edit safety backups, originals
  preserved in git history).

Investigation found the four old loop function definitions
(scanner_loop, peertube_scanner_loop, crawler_scheduler_loop,
organizer_loop) were already deleted in Phase 5c-1.

Modules investigated and KEPT:
- lib/web_scraper.py -- exports chunk_text() used by transcript_processor
- lib/new_pipeline.py -- active Stream B library management CLI tool
- lib/peertube_scraper.py -- only mechanism for transcript ingestion
- lib/extractor.py -- would activate for new PDFs via cmd_run CLI

Service restart verified: 6 threads (dispatcher, enrich, embed,
filing, progress, dashboard), no extract worker, zero errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 23:46:00 +00:00