recon

matt/recon

mirror of https://github.com/zvx-echo6/recon.git synced 2026-05-20 14:44:54 +02:00

Author	SHA1	Message	Date
Matt	9c5b0520f9	Add PAD-US public land classification lookup Integrates USGS PAD-US 4.0 (651k features) into a local PostGIS database for point-in-polygon land ownership queries. Adds /api/landclass endpoint returning classifications, public/private status, and management hierarchy. - lib/landclass.py: connection pool, lookup_landclass(), domain label maps - lib/api.py: GET /api/landclass?lat=&lon= (feature-flag gated) - home.yaml: enable has_landclass flag Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-22 15:36:37 +00:00
Matt	3280e34718	Add Nav-I dashboard section with restore-as conflict resolution - Create Nav-I top-level section in dashboard navigation - Move Deleted Contacts from Knowledge subnav to Nav-I - Add Nav-I landing page with card grid (deleted count, API keys stub) - Add /nav-i/api-keys placeholder page - Add restore-as endpoint for Home/Work conflict resolution - Conflict modal in deleted contacts template for label rename on restore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-22 06:26:25 +00:00
Matt	a4288c0cd8	Add contacts/phone book system with per-user scoping New files: - lib/auth.py: Authentik forward-auth helpers (get_user_id, @require_auth) - lib/contacts.py: ContactsDB with CRUD, soft delete, restore, purge, find_nearby - lib/contacts_api.py: Flask Blueprint with 9 API endpoints at /api/contacts - templates/knowledge/deleted_contacts.html: Dashboard recovery page Modified: - lib/api.py: Register contacts_bp, add KNOWLEDGE_SUBNAV entry, /deleted-contacts route - config/profiles: has_contacts feature flag (true for home, false for pi profiles) Separate SQLite DB at data/contacts.db. Per-user isolation via X-Authentik-Username. Home/Work labels enforced unique per user. Haversine proximity queries (75m default). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-22 05:29:54 +00:00
Matt	095bf8c2af	Add Google Places (New) tertiary enrichment for business POIs Fills opening_hours, phone, and website gaps when OSM + Overture data is incomplete. Only fires for business-class POIs (amenity, shop, tourism, leisure, office, craft). Daily API call cap with SQLite tracking. cache_put now preserves google columns across cache refreshes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-22 04:08:12 +00:00
Matt	620f99c762	Add business_intent_poi_boost reranker signal When a query contains no road-type keywords (st, blvd, ave, etc.), boost amenity/shop/tourism/leisure/office/craft results (+3.0) and penalize highway/route results (-4.0). This fixes searches like "starbucks twin falls" where a named service road outranked the actual business POI due to Photon position tiebreaking. Also fixes: - Intent classifier now recognizes full state names ("idaho" not just "ID") for LOCALITY classification - Locality-type Photon results now populate _city from name field so they participate in locality_fuzz scoring - Trace logging expanded to all candidates with osm_key/value Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-21 19:39:37 +00:00
Matt	d460f0e202	Fix type classifier: POI check takes precedence over street_address Businesses with housenumbers (e.g. M&W Markets at 130 US-30) were classified as street_address because the housenumber check fired before the osm_key check. Reorder so osm_key in amenity/shop/tourism/leisure/office is evaluated first, ensuring businesses get type=poi regardless of whether they have a street address. Also adds office to the POI key set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-21 19:08:04 +00:00
Matt	65693d15aa	Add Overture Maps POI enrichment layer for place details Ingests 20.9M North America places from Overture Maps Foundation (release 2026-04-15.0) into PostgreSQL. Enriches /api/place responses with phone, website, and brand data via spatial + fuzzy name matching when OSM extratags are sparse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-21 16:51:25 +00:00
Matt	2121ee4936	Add place detail proxy with Nominatim-first routing and Overpass fallback New /api/place/<osm_type>/<osm_id> endpoint returns cleaned OSM tag data for PlaceDetail panel enrichment. Routes to local Nominatim (Idaho coverage) first, falls back to Overpass public API for out-of-region queries. Responses cached in SQLite (data/place_cache.db) with no expiry. New modules: lib/place_detail.py (proxy + cache), lib/osm_categories.py (~50 category humanization mappings). Profile YAMLs updated with place_details config block and has_nominatim_details flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-21 03:06:51 +00:00
Matt	64605b38bb	Add TomTom traffic proxy and update profiles for hillshade/traffic layers - Add /api/traffic/flow proxy route to hide TomTom API key from frontend - Add tileset_hillshade and traffic config blocks to all three profiles - Flip has_hillshade and has_traffic_overlay flags in home and regional profiles - Minimal profile has config blocks but flags remain false (dormant) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-21 00:52:04 +00:00
Matt	e6b81db520	feat(navi): deployment profiles + /api/config endpoint Add profile-driven config infrastructure: - config/profiles/{home,regional_pi,minimal_pi}.yaml templates - lib/deployment_config.py loader (reads RECON_PROFILE env var) - GET /api/config returns active profile as JSON (5min cache) Frontend reads this on startup to determine tile source, defaults, and feature flags. No existing behavior changed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-20 23:35:39 +00:00
Matt	d4c5c371ca	Merge feature/navi-integration: Navi backend (address book, Netsyms, geocoding chain, reverse endpoint)	2026-04-20 22:40:03 +00:00
Matt	ac69e2761d	feat(navi): add /api/reverse endpoint for map-click reverse geocoding Accepts lat/lon query params, calls Photon /reverse, returns same response shape as /api/geocode. Returns 200 with empty results on no match (graceful degradation for ocean/unmapped areas). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-20 21:26:35 +00:00
Matt	87b230dcba	feat(navi): structured geocode with usaddress parsing and reranker Add lib/geocode.py — multi-source retrieval pipeline: - usaddress CRF parsing with intent classification - Netsyms structured lookup (uses raw street abbreviations) - Photon /structured + /api freetext retrieval - Weighted 10-signal reranker (housenumber, street fuzz, locality, source authority, etc.) - match_code annotations + address book proximity labeling - Trace log at /tmp/geocode_rerank_trace.log nav_tools.py now delegates geocode() to the new module. Tests updated: US address queries correctly return Netsyms results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-20 16:29:47 +00:00
Matt	c76d63b785	refactor(navi): Photon-first geocoding with ranked results Inverts the /api/geocode chain. Photon is now the primary search engine; the hand-rolled Netsyms free-text parser is removed. Address book short-circuits nicknames only ("home", "work") — full-address queries flow through Photon and address book entries within 75m annotate matching results with labeled_as. Coordinate strings detected before search. Response shape: /api/geocode now returns a ranked candidates list (always 200 OK, empty list if no match). No more 404 for unmatched queries. Users can type messy input — wrong case, missing punctuation, abbreviations, typos — and get results or close matches. Netsyms preserved at /api/netsyms/lookup for direct access. USPS plus4 enrichment of Photon street-address hits is a planned follow-up. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-20 15:48:03 +00:00
Matt	a14501347b	fix(navi): address book prefix+boundary match for longer queries lookup() previously did exact-alias-only matching, so "214 north st filer" missed the home entry with alias "214 north st". Extend to match when the query begins with an alias followed by a word boundary, and when an alias appears as a contiguous token sequence inside the query. Short aliases ("home") keep matching exactly and also match with trailing text. Fixes the UX case where typing a known full address falls through to Netsyms instead of short-circuiting to address_book. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-20 07:54:32 +00:00
Matt	dfab388769	feat(navi): add netsyms tier-2 geocoding + geocode API Add Netsyms AddressDatabase2025 (159M US+CA addresses) as tier-2 in the geocode chain: address_book → netsyms → photon. - lib/netsyms.py: SQLite lookup module (lazy, read-only, thread-safe) - lib/netsyms_api.py: Flask blueprints for /api/netsyms/* and /api/geocode - lib/netsyms_test.py: 7 test cases (street, free-text, zipcode, health) - lib/nav_tools.py: new geocode() with consistent {name,lat,lon,source,raw} - lib/api.py: register netsyms_bp and geocode_bp Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-20 07:24:09 +00:00
Matt	23483e8198	feat(navi): address book with geocoding integration - YAML-backed saved locations (config/address_book.yaml) - Exact/partial alias matching with case-insensitive lookup - Flask blueprint: /api/address_book/lookup, /api/address_book/list - Geocoder short-circuits Photon when address book has exact match - Test suite for lookup behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-20 04:02:11 +00:00
Matt	3243f2f252	feat(navi): semantic query router for intelligent tool selection - Phase H2b Add centroid-based query classifier that routes Aurora queries to the appropriate handler (nav_route, nav_reverse_geocode, direct_answer, rag_search) before the RAG pipeline runs. Uses TEI embeddings against pre-computed route centroids from 38 example queries. - query_router.py: standalone module with lazy centroid init - query_router_test.py: 7-query test suite (all passing) - Corresponding recon_rag_tool.py v4.2.0 deployed to Open WebUI DB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 23:50:35 +00:00
Matt	9841c38011	fix(navi): format tool output as human-readable directions	2026-04-19 22:42:17 +00:00
Matt	a9510b5ed9	feat(navi): add nav_tools with route() and reverse_geocode() - Phase H2 - nav_tools.py: route() geocodes via Photon, routes via Valhalla, returns summary/maneuvers/polyline. reverse_geocode() for coordinate lookups. Supports auto/pedestrian/bicycle/truck modes. - nav_tools_test.py: 5 live tests against local Photon (2322) and Valhalla (8002) - aurora_nav_tool.py: Open WebUI Tool exposing get_directions to Aurora LLM Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 22:14:26 +00:00
Matt	c5283ece3e	Merge feature/scraper: Zimit-based web scraper Replaces wget/SingleFile/Playwright crawl backends with Zimit (openZIM Docker crawler). Produces ZIM files directly — no zimwriterfs step. Validated with meshtastic.org (3400+ page Docusaurus site). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 19:37:04 +00:00
Matt	5f5bcedab9	Fix progress regex and SIGHUP/scan_zims race condition - Parse Browsertrix "crawled":N JSON format instead of "N pages" - Add 3s delay between SIGHUP to kiwix-serve and scan_zims() call so the OPDS catalog is reloaded before we query it for linking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 19:35:42 +00:00
Matt	9692044790	Fix progress parsing for Browsertrix JSON log format Parse "crawled":N from Browsertrix crawlStatus JSON logs instead of looking for "N pages" pattern. Also check stdout (not just stderr). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 19:33:50 +00:00
Matt	b035ba3f20	Fix Zimit: add required --name flag for warc2zim warc2zim (called internally by zimit) requires --name for ZIM metadata. Without it, argument validation fails with exit code 2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 14:30:42 +00:00
Matt	76076fc4ab	Fix Zimit CLI: add subcommand, correct flag names, fix container cleanup - Must pass `zimit` as command after image name (entrypoint execs args) - --url → --seeds, --name removed, --lang → --zim-lang, --workers → -w - Remove --rm so docker logs work after exit, manually rm container Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 14:13:34 +00:00
Matt	8945c82e3f	Replace wget/SingleFile/Playwright backends with Zimit - Zimit Docker container handles all site types (static, SPA, JS redirects) - Removed: _detect_crawl_mode, _crawl_wget, _crawl_singlefile, preflight logic - Added: _crawl_zimit() with Docker lifecycle management - Simplified pipeline: submit → Zimit crawl → kiwix-manage register → done - No more zimwriterfs step — Zimit produces ZIM directly - Dashboard UI simplified: removed crawl mode dropdown - Config simplified: removed reject patterns, preflight, singlefile sections Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 14:06:23 +00:00
Matt	f0b160ef7c	Extract _full_zim_cleanup helper, add SIGHUP + scrape_jobs cleanup - Extract shared _full_zim_cleanup(source_id) from api_kiwix_remove - Add SIGHUP to kiwix-serve after kiwix-manage remove - Delete linked scrape_jobs rows during ZIM removal - Update api_scraper_delete to do full ZIM cleanup when applicable - Set chromium_path for single-file browser crawl support - Add status.db to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-19 02:28:49 +00:00
Matt	45c3bb8d56	Add scraper job queue management (delete, clear failed) New API endpoints: DELETE single job, clear all failed/cancelled. Dashboard now shows Delete buttons on completed/failed jobs, Retry+Delete on failed jobs, and a Clear Failed bulk action. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 21:03:39 +00:00
Matt	1ce9a3731f	Add scraper dashboard UI under Kiwix tab New /kiwix/scraper page with submit form (URL, title, language, crawl mode), stats cards, and auto-refreshing jobs table with cancel/retry actions. Kiwix section now has Library/Scraper subnav. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 20:47:17 +00:00
Matt	45b954fccc	Fix ZIM filename collisions by appending job ID Format: {domain}_{lang}_{YYYY-MM}_{job_id}.zim Prevents zimwriterfs failures when the same domain is scraped multiple times in the same month. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 20:17:53 +00:00
Matt	125602fa13	Fix SingleFile CLI: remove invalid --crawl-delay flag SingleFile CLI has no --crawl-delay option. The invalid flag caused the process to print help and exit with no output. Added --crawl-no-parent and --crawl-replace-URLs instead. Removed unused crawl_delay config key. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 19:28:03 +00:00
Matt	da50e5f0b8	Add scraper Phase 2: smart crawl mode detection + browser fallback - Pre-flight detection: wget + Playwright probe to auto-detect if site needs browser rendering (JS apps, parking page redirects) - SingleFile CLI crawl backend for JS-rendered sites - crawl_mode column in scrape_jobs (static/browser/redirect/auto) - API: optional crawl_mode param on submit, cleared on retry - Config: rate_limit_delay 2.0→0.5, /api/ reject pattern, preflight + singlefile config sections - Prerequisites: Node.js 22, single-file-cli, Playwright + Chromium Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 18:26:43 +00:00
Matt	491a4350fc	Merge feature/kiwix: Kiwix ZIM integration - ZIM monitor + kiwix-serve foundation (Phase 1) - Batch article ingestion pipeline (Phase 2) - Dashboard tab, wiki.echo6.co citations - Language filter for non-English articles - Status badge + progress column fixes - Download URL generation fix (/content/ prefix, full ZIM name)	2026-04-18 00:07:00 +00:00
Matt	b250d0c257	Fix Kiwix download URL generation in embedder - Add /content/ prefix to wiki.echo6.co URLs (required by kiwix-serve) - Stop stripping ZIM flavor/date suffix (e.g. _maxi_2025-11) from filename - Use str.removesuffix instead of regex to strip only .zim extension Before: https://wiki.echo6.co/appropedia_en_all/Article After: https://wiki.echo6.co/content/appropedia_en_all_maxi_2025-11/Article Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-18 00:06:52 +00:00
Matt	a40ce47127	Fix progress column to show Qdrant count for completed sources Complete sources now show "19,344 in Qdrant" instead of misleading extraction counts. Each status gets contextual progress display: complete → X in Qdrant, processing → X/Y in Qdrant (%), extracting → X/Y extracted, detected → dash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-17 15:31:01 +00:00
Matt	fed02186fa	Fix Kiwix status badges to reflect full pipeline state Status was showing COMPLETE after ZIM extraction finished, even when documents were still queued for enrichment/embedding. Now computes effective_status by checking actual pipeline state per-source: - DETECTED: ingest not enabled (gray) - EXTRACTING: ZIM processor running (blue) - PROCESSING: extracted but docs still in enricher/embedder queue (amber) - COMPLETE: all docs fully enriched and embedded in Qdrant (green) Also fixed _build_kiwix_sources pipeline query to filter by category per-source instead of returning global kiwix stats for every source. Progress column now shows "X / Y in Qdrant" when processing, or "X / Y extracted" otherwise. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-17 15:22:44 +00:00
Matt	6f2a1d206e	Add langdetect language filter to enricher + purge non-English ZIM articles - Install langdetect package for content-level language detection - Add _check_language() to enricher.py: reads first 1500 chars of first page, detects language via langdetect, skips if not in allowed list - Configurable via config.yaml pipeline.language_filter and pipeline.allowed_languages (default: en only) - Catches non-English content from ANY source (PDF, web, ZIM, PeerTube) before burning Gemini API quota on enrichment - Add scan_zims retry logic (3 attempts, 2s delay) for upload handler - Purged 6,483 stale non-English zim_articles rows from DB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-17 14:37:13 +00:00
Matt	501004ecf1	Filter non-English articles from ZIM ingestion Skip articles with MediaWiki translation suffixes (/es, /fr, /pl, etc.) before text extraction to avoid wasting Gemini enrichment on translations. Uses path-based regex matching against ISO 639 language codes. ~5,276 non-English articles already ingested from Appropedia (top: es=837, zh=765, ru=475, fr=433, ko=407). Purge decision deferred.	2026-04-17 07:30:30 +00:00
Matt	2635160887	Kiwix integration: ZIM processor, dashboard tab, wiki.echo6.co citations - ZIM processor: extract articles from ZIM files, feed into existing enrichment pipeline - Dashboard: Kiwix tab with library table, ingest toggle, upload, remove - kiwix-serve on port 8430, wiki.echo6.co behind Authentik - Citation URLs point to wiki.echo6.co/{zimname}/{article_path} - Dashboard shows WIKI type badge for ZIM-sourced content - Appropedia EN (19,445 articles) fully ingested as proof of concept	2026-04-17 07:00:24 +00:00
Matt	c60aa5e80d	Phase 2: ZIM processor — batch article ingestion pipeline Adds lib/processors/zim_processor.py which opens a ZIM file via python-libzim, iterates HTML articles, strips to clean text (lxml), and feeds each article into the existing RECON enrichment pipeline. Key features: - HTML to text via lxml (strips nav/footer/script/style) - Filters redirects, non-HTML entries, stubs (<200 chars) - Content hash dedup against existing catalogue - Creates processing dirs with page files and meta.json - Registers articles as "extracted" for automatic enrichment - Checkpointing via zim_sources.last_checkpoint for resume - Configurable batch size and delay for rate control - Standalone CLI: python3 -m lib.processors.zim_processor Tested: 100 Appropedia articles processed in 3s, enricher picks them up automatically via the existing pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-17 02:03:12 +00:00
Matt	7c1af0f063	Phase 1: Kiwix foundation — ZIM monitor and kiwix-serve setup - Add lib/zim_monitor.py: polls kiwix-serve OPDS v2 catalog, detects new ZIMs, reads accurate article count from python-libzim Counter metadata (not inflated OPDS count), inserts into zim_sources table. Idempotent on re-run, marks removed ZIMs. - DB schema: zim_sources, zim_samples, zim_articles tables (created via sqlite3, not in migrations — matches existing RECON pattern) - kiwix-tools 3.7.0 installed from binary tarball at /opt/recon/bin/ (Ubuntu 24.04 apt ships 3.5.0 which lacks OPDS v2) - kiwix.service systemd unit on port 8430 - python-libzim 3.9.0 installed - Test ZIM: Appropedia EN maxi (496 MB, 19,445 articles) - Add bin/ to .gitignore (binary tarball, not source)	2026-04-16 23:39:34 +00:00
Matt	8d54ff165d	Merge refactor branch: RECON v1.0.0 v1.0.0 This merge integrates the complete refactor effort spanning Phases 0-6k, bringing RECON from its initial baseline into production-grade form. Pipeline architecture --------------------- - Phases 0-2: foundation cleanup, removed dead code, standardized logging - Phase 3: dispatcher rewrite — watches data/acquired/<subfolder>/ for {hash}.txt + {hash}.meta.json pairs, atomic .tmp rename, idempotent - Phase 4: content processors for PDF (PyPDF2 -> pdftotext -> Tesseract -> Gemini Vision fallback chain), transcript, and text formats - Phase 5: enrichment, embedding, and filing daemons split into independently restartable threads PeerTube acquisition -------------------- - Phase 6a-6c: PeerTube channel watcher, caption acquisition with rate limiting (429 handling), 0.5s rate_limit_delay enforced - Phase 6d: multi-instance support - Phase 6e: rewired then reverted dashboard PeerTube endpoint to live in acquisition module Format handling & library cleanup --------------------------------- - Phase 6f: text processor for .txt ingestion - Phase 6f-2: format normalizer in dispatcher - Phase 6g-6j: library reorg — ghost domain cleanup, SCL moved to dedicated domain folder, pi-nas fully decommissioned as a storage target (NFS-only now), ~73 GB reclaimed - Phase 6k: hash-identical dedup — 2,477 duplicate PDFs removed, 22.05 GB freed, catalogue/documents/Qdrant payloads updated coherently, 226 empty domain subdirs pruned - 16,340 transcripts remain un-filed pending title-match review Dashboard & metadata -------------------- - Gemini "null" string bug fixed in pdf_processor metadata voting - Dashboard upload migrated to pipeline with multi-format support State at release ---------------- - 7 daemon threads: dispatcher, enrich, embed, filing, peertube-acq, progress, dashboard - 29,201 documents in catalogue / documents tables (UNIQUE on hash PK) - ~2.1M Qdrant vectors in recon_knowledge_hybrid (cortex:6333) - ~67 GB library on /mnt/library (NFS from pi-nas) - files.echo6.co serving 9,397 deduped PDFs - recon.echo6.co dashboard + API on :8420 See cleanup-log.md for the full backlog and resolution history.	2026-04-16 18:20:25 +00:00
Matt	e6224cb279	Migrate dashboard upload to pipeline with multi-format support Upload handler now writes files to the appropriate hopper subfolder instead of copying directly to /mnt/library/: - .pdf -> acquired/pdf/ - .txt -> acquired/text/ - .epub, .doc, .docx, .mobi -> acquired/pdf/ (dispatcher format normalizer converts to PDF before processing) The dispatcher picks up files and routes through the appropriate processor (pdf_processor or text_processor) for full metadata voting, domain classification, and canonical filing. Changes to api_upload() / _process_upload(): - Relaxed extension check: PDF, TXT, EPUB, DOC, DOCX, MOBI - Routes to correct hopper subfolder by extension - Writes meta.json sidecar with original filename and category hint - Removed: direct library copy, add_to_catalogue, queue_document - Added: hopper-level dedup check (catches rapid re-uploads) - Kept: catalogue dedup check for immediate user feedback Changes to api_upload_status(): - Added fallback: checks acquired/ and processing/ dirs if hash not yet in documents table (covers gap between upload and dispatcher pickup) Template updated: accept attribute and help text now reflect multi-format support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 02:18:45 +00:00
Matt	999cf37626	Fix: Gemini "null" string bug in pdf_processor metadata voting Same fix as text_processor — Gemini sometimes returns the literal string "null" instead of JSON null for empty metadata fields. The voting logic and Gemini extraction now both treat "null" strings as None. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 23:30:59 +00:00
Matt	f4659d155f	Phase 6f-2: format normalizer in dispatcher Adds _normalize_formats() to the dispatcher that converts non-standard document formats to PDF before dispatch. Supports: - .epub, .mobi -> PDF via ebook-convert (Calibre) - .doc, .docx -> PDF via LibreOffice headless Called per-subfolder before _find_pairs() so _find_pairs() only ever sees standard content files. Conversion failures are logged and skipped -- the original file stays in acquired/ for manual review. Also converts 3 staged epub files and cleans up _staging/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 23:08:19 +00:00
Matt	62539861f2	Phase 6f: text processor for .txt file ingestion New processor: lib/processors/text_processor.py Handles plain text files (.txt) as primary source documents. Pipeline: acquired/text/ -> dispatcher -> text_processor.pre_flight() -> enrich -> embed -> filing worker -> library/Domain/Subdomain/ Metadata extraction via two-source vote: - Source A: filename parsing (title from filename) - Source B: Gemini LLM extraction (title/author/edition/year from first 3 pages of text) Page splitting reuses chunk_text() from lib/web_scraper.py. Filing behavior matches PDFs (files to library, not organized in-place like transcripts). Config: adds text: text_processor to pipeline.dispatch map. New hopper subfolder: data/acquired/text/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 22:39:31 +00:00
Matt	7fe7d03583	Revert "Phase 6e: rewire dashboard PeerTube endpoint to acquisition module" This reverts commit `7e42528d2f`.	2026-04-15 03:20:46 +00:00
Matt	7e42528d2f	Phase 6e: rewire dashboard PeerTube endpoint to acquisition module Replace legacy ingest_channel/ingest_all imports with acquire_batch from lib.acquisition.peertube. The endpoint now writes flat file pairs to the hopper and lets the dispatcher handle processing, matching the Phase 6d architecture. Removes channel/since/process parameters that were tied to the old direct-ingest path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 03:15:41 +00:00
Matt	277110d999	Phase 6d: PeerTube acquisition module + service thread New lib/acquisition/peertube.py replaces the removed peertube_scanner_loop. Polls PeerTube API every 30min, dedupes against catalogue (UUID + title), writes flat file pairs to data/acquired/stream/ for the dispatcher. - acquire_batch(): one-shot find-and-acquire with rate limiting - acquisition_loop(): service thread wrapper (interval from config) - list_new_videos(): dedup via _build_known_sets() against catalogue - acquire_one(): fetch VTT, convert, write .tmp then rename atomically cmd_service(): added peertube-acq daemon thread cmd_ingest_peertube(): rewired to use acquire_batch(), drops --channel/ --since/--enrich/--process (dispatcher handles full pipeline) config.yaml: added peertube.poll_interval: 1800 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 03:08:51 +00:00
Matt	efae4023f6	Phase 6c: remove vestigial extract worker, dead crawler, .bak files recon.py: - Remove extract stage_loop thread from cmd_service(). Confirmed vestigial: 0 queued items, silent logs over 24+ hour run. The new processors do extraction inline in pre_flight(). - Remove cmd_crawl CLI subcommand and its argparse registration. - Clean up associated imports and variables. Deleted: - lib/crawler.py (432 lines) -- old web crawler subsystem, only referenced by the removed CLI subcommand. - 24 .bak files (untracked pre-edit safety backups, originals preserved in git history). Investigation found the four old loop function definitions (scanner_loop, peertube_scanner_loop, crawler_scheduler_loop, organizer_loop) were already deleted in Phase 5c-1. Modules investigated and KEPT: - lib/web_scraper.py -- exports chunk_text() used by transcript_processor - lib/new_pipeline.py -- active Stream B library management CLI tool - lib/peertube_scraper.py -- only mechanism for transcript ingestion - lib/extractor.py -- would activate for new PDFs via cmd_run CLI Service restart verified: 6 threads (dispatcher, enrich, embed, filing, progress, dashboard), no extract worker, zero errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 23:46:00 +00:00

1 2

60 commits