Commit graph

10 commits

Author SHA1 Message Date
501004ecf1 Filter non-English articles from ZIM ingestion
Skip articles with MediaWiki translation suffixes (/es, /fr, /pl, etc.)
before text extraction to avoid wasting Gemini enrichment on translations.
Uses path-based regex matching against ISO 639 language codes.

~5,276 non-English articles already ingested from Appropedia (top: es=837,
zh=765, ru=475, fr=433, ko=407). Purge decision deferred.
2026-04-17 07:30:30 +00:00
c60aa5e80d Phase 2: ZIM processor — batch article ingestion pipeline
Adds lib/processors/zim_processor.py which opens a ZIM file via
python-libzim, iterates HTML articles, strips to clean text (lxml),
and feeds each article into the existing RECON enrichment pipeline.

Key features:
- HTML to text via lxml (strips nav/footer/script/style)
- Filters redirects, non-HTML entries, stubs (<200 chars)
- Content hash dedup against existing catalogue
- Creates processing dirs with page files and meta.json
- Registers articles as "extracted" for automatic enrichment
- Checkpointing via zim_sources.last_checkpoint for resume
- Configurable batch size and delay for rate control
- Standalone CLI: python3 -m lib.processors.zim_processor

Tested: 100 Appropedia articles processed in 3s, enricher picks
them up automatically via the existing pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 02:03:12 +00:00
999cf37626 Fix: Gemini "null" string bug in pdf_processor metadata voting
Same fix as text_processor — Gemini sometimes returns the literal
string "null" instead of JSON null for empty metadata fields. The
voting logic and Gemini extraction now both treat "null" strings
as None.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 23:30:59 +00:00
62539861f2 Phase 6f: text processor for .txt file ingestion
New processor: lib/processors/text_processor.py
Handles plain text files (.txt) as primary source documents.

Pipeline: acquired/text/ -> dispatcher -> text_processor.pre_flight()
-> enrich -> embed -> filing worker -> library/Domain/Subdomain/

Metadata extraction via two-source vote:
- Source A: filename parsing (title from filename)
- Source B: Gemini LLM extraction (title/author/edition/year from
  first 3 pages of text)

Page splitting reuses chunk_text() from lib/web_scraper.py.
Filing behavior matches PDFs (files to library, not organized
in-place like transcripts).

Config: adds text: text_processor to pipeline.dispatch map.
New hopper subfolder: data/acquired/text/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 22:39:31 +00:00
df29d598d3 Phase 6a: transcripts mark organized in-place, skip filing
Transcripts are derived text from PeerTube videos, not primary source
files. They do not belong in library/Domain/Subdomain/ like PDFs.

Change: transcript_processor.pre_flight() now sets organized_at =
CURRENT_TIMESTAMP at the end of successful processing, marking the
transcript as organized in place. The watch URL remains in
catalogue.path and Qdrant download_url so users clicking search
results go to the PeerTube video.

The filing workers path LIKE filter naturally excludes transcripts
since their documents.path is the watch URL, not a filesystem path.
No filing worker changes needed.

Back-fills 2,260 drain items from Phase 5c-2 via one-time SQL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 22:49:21 +00:00
9fa60f9c86 Fix: stale cleanup in processors must fail loudly on permission errors
Phase 5c-2 failed because shutil.rmtree(ignore_errors=True) silently
failed to clean up root-owned legacy files in processing/{hash}/,
letting the processor proceed into a half-cleaned directory and then
crash on subsequent file writes.

Changes: removed ignore_errors=True, wrapped in try/except that logs
and re-raises, so the processor fails early and visibly if stale
cleanup fails.

Recovery from Phase 5c-2 failure.
2026-04-14 20:15:48 +00:00
96e1e642c4 Phase 4: PDF processor with layered metadata extraction
- Add lib/processors/pdf_processor.py with full pre_flight pipeline
- Layered metadata: Source A (PDF dict), Source B (filename), Source C (Gemini)
- Field-by-field voting with provenance tracking (metadata_provenance column)
- Level-4 strict dedupe (title+author+edition+year)
- Content failures route to _review/rejected_pdfs/
- Level-4 duplicates route to _review/duplicate_quarantine/
- Full text extraction using existing extract_text_from_page fallback chain
- Schema: added metadata_provenance TEXT to documents table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 16:57:44 +00:00
9fe6a0a782 Phase 4: Phase 3 cleanup fixes
Fix 1.1: filing preserves source file extension instead of defaulting to .pdf
Fix 1.2: back-fixed soldering transcript from .pdf to .txt (hash 380dbc78)
Fix 1.3: dispatcher logs missing processor modules at DEBUG, not ERROR
Fix 1.4: transcript processor cleans stale processing/concepts dirs on entry
Also: dispatcher now handles solo content files without .meta.json sidecar

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 16:39:57 +00:00
f69c04a0e3 Phase 3: fix page_count in transcript processor
Set page_count on documents row during pre_flight. Without this,
enricher comparison `page_count >= 3` fails with TypeError on NULL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 15:43:21 +00:00
66fadb7487 Phase 3: dispatcher, transcript processor, text_dir resolution
- lib/dispatcher.py: one-shot dispatcher that scans acquired/<type>/
  for content+sidecar pairs and routes to registered processors
- lib/processors/transcript_processor.py: pre_flight() for transcripts
  (hash, dedupe, split into pages, register in DB, set text_dir)
- lib/utils.py: resolve_text_dir() helper for text_dir column fallback
- lib/enricher.py: use resolve_text_dir() instead of hardcoded path
- lib/embedder.py: use resolve_text_dir() instead of hardcoded path
- lib/processors/__init__.py, lib/acquisition/__init__.py: package inits

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 15:39:42 +00:00