matt/refactored-recon

Fork 0

mirror of https://github.com/zvx-echo6/refactored-recon.git synced 2026-05-20 06:34:34 +02:00

Matt 3b5c24c7e7 checkpoint: pre-audit working tree state — 4 untracked design docs

2026-04-27 02:08:28 +00:00

14 KiB

Raw Blame History

RECON × Kiwix — ZIM Integration Design (v2)

Status: Draft v2 (corrected post-stress-test)
Date: 2026-04-16
Depends on: RECON v1.0.0 (master, CT 130)

1. Goal

Integrate Kiwix into RECON as a first-class knowledge source. Users manage ZIM files through the RECON dashboard — uploading directly or pulling from the Kiwix catalog. Kiwix-serve provides a browsable web interface for all loaded ZIMs. RECON detects new ZIMs and runs tiered ingestion: classifying the source, extracting articles via python-libzim, generating metadata, and embedding into Qdrant for Aurora semantic search.

Future portable deployment: The full system is built on CT 130/cortex. A stripped-down offline copy (dense-only, reduced dimensions, int8 quantization) will later be packaged for a Raspberry Pi 5 8GB with 512GB NVMe. This is a future packaging exercise — design decisions should not block it, but we build for full capability first.

2. Where Everything Lives

All on CT 130 (192.168.1.130), running as zvx.

Component	Location
kiwix-serve	Installed via Kiwix PPA (`ppa:kiwixteam/release`) or Docker (`ghcr.io/kiwix/kiwix-serve:3.8.2`). NOT `apt install kiwix-tools` — Ubuntu 24.04 ships 3.5.0 (2023), missing required OPDS v2 endpoints. Need ≥3.7.0.
kiwix-manage	Same package as kiwix-serve. Note: may be deprecated in favor of directory-serving in future libkiwix versions. Design for both.
ZIM file storage	`/mnt/kiwix/` (separate bind-mount from data host `/mnt/data/kiwix/`, same SSD as library but NOT inside library — library is curated human-browsable PDFs)
Kiwix library XML	`/mnt/kiwix/library.xml` (auto-managed, treat as legacy intermediate)
ZIM tracking DB	`recon.db` (new tables, see §6)
Ingestion code	`/opt/recon/lib/zim_pipeline.py` (new module)
Dashboard UI	`recon.echo6.co` (new Kiwix management page)

3. Service Architecture

3.1 kiwix-serve (browsing + OPDS catalog)

Runs as a companion systemd unit. Provides:

Web browsing UI for all loaded ZIMs (Wikipedia, Stack Exchange, etc. — fully browsable with images if using _maxi variants)
OPDS v2 catalog at /catalog/v2/entries for RECON to enumerate loaded ZIMs
Keyword search via /search endpoint (HTML/XML only — no JSON support exists)

kiwix-serve --library /mnt/kiwix/library.xml \
            --port 8430 \
            --address 0.0.0.0 \
            --threads 4 \
            --nodatealias \
            --blockexternal

Bound to 0.0.0.0 — browsable from the local network (kiwix.echo6.co or similar)
Port 8430 (next to RECON dashboard on 8420)
--blockexternal prevents outbound link navigation from served content
NO --monitorLibrary — documented bugs (zombie processes, high CPU). Use SIGHUP on demand: when RECON adds a new ZIM, it sends kill -HUP <pid> to trigger library reload.

3.2 python-libzim (article extraction)

kiwix-serve is for browsing only. All RAG article extraction uses python-libzim (PyPI package libzim v3.9.0) directly. This is faster, gives full control over filtering, and avoids the no-JSON problem with kiwix-serve's search API.

Key API patterns:

from libzim.reader import Archive

zim = Archive("path/to/file.zim")

# Reliable article count (NOT zim.article_count which is inflated 2-3×):
counter_meta = zim.get_metadata("Counter").decode()
# Returns: "text/html=6467891;image/webp=3211054;text/css=23"
html_count = parse_counter(counter_meta)["text/html"]

# ZIM metadata:
title = zim.get_metadata("Title").decode()
description = zim.get_metadata("Description").decode()
language = zim.get_metadata("Language").decode()

# Article iteration:
for i in range(zim.entry_count):
    entry = zim._get_entry_by_id(i)
    if entry.is_redirect:
        continue
    item = entry.get_item()
    if item.mimetype != "text/html":  # mimetype is on Item, not Entry
        continue
    content = bytes(item.content)
    # process content...

Important: Modern ZIMs (2022+) use the "new namespace scheme" — flat paths, no A//I//M/ prefixes. Do not use iter_by_namespace('A').

3.3 RECON ZIM Monitor Daemon

New thread in recon.service (daemon #8). Polls kiwix-serve's OPDS catalog (/catalog/v2/entries) every 60 seconds, compares against zim_sources table in recon.db, and queues new ZIMs for ingestion. Also detects removed ZIMs.

Do not trust OPDS articleCount — it's inflated 2-3× for large ZIMs. Use python-libzim's Counter metadata for accurate counts after detection.

4. ZIM Acquisition (How ZIMs Get In)

4.1 Direct Upload

User uploads a .zim file through the RECON dashboard. Dashboard saves to /mnt/kiwix/, runs kiwix-manage library.xml add <file.zim>, sends SIGHUP to kiwix-serve. ZIM monitor detects on next poll.

4.2 Catalog Pull

RECON dashboard exposes a "Browse Kiwix Catalog" page. Queries public OPDS catalog at https://library.kiwix.org/catalog/v2/entries with filters (lang, category, search). Response is Atom XML only — no JSON. User picks a ZIM, confirms, RECON downloads via torrent (preferred for large files) or direct HTTP to /mnt/kiwix/.

4.3 ZIM Variant Strategy

Variant	Content	Use Case
`_maxi`	Full text + all images	Browsing via kiwix-serve. Plan for this as default.
`_nopic`	Full text, no images	RAG-only (if disk constrained). ~50% smaller.
`_mini`	Intro + infobox only	Not useful for RAG.

Start small. Prove the pipeline with Wikivoyage, iFixit, or a focused Stack Exchange before tackling Wikipedia. Stack ZIMs and monitor storage/RAM budget as you go.

5. Tiered Ingestion Pipeline

When the ZIM monitor detects a new ZIM, it:

Reads ZIM metadata via python-libzim (Counter, Title, Description, Language, Tags, Name)
Classifies the source based on ZIM name/tag patterns
Routes to the appropriate tier
Samples first if the source is unknown (see §5.4)

5.1 Tier 1 — Known Large Sources (Deterministic Extractors)

Trigger: Known source name patterns, any article count
Enrichment: None — structural metadata from HTML
Cost: Zero

ZIM Pattern	Domain/Subdomain	Metadata Source
`wikipedia_*`	Reference/Wikipedia	Title, categories, infobox fields, section headings
`wiktionary_*`	Reference/Wiktionary	Word, part of speech, definitions
`wikibooks_*`	Reference/Wikibooks	Book title, chapter, subject
`wikisource_*`	Reference/Wikisource	Work title, author, year
`wikiversity_*`	Reference/Wikiversity	Course, subject area
`wikivoyage_*`	Reference/Wikivoyage	Destination, region, travel topic
`stack_exchange_*`	Reference/StackExchange	Tags, vote score, accepted answer flag
`devdocs_*`	Reference/DevDocs	Language, framework, API
`ifixit_*`	Maintenance/Repair	Device, category, difficulty

5.2 Tier 2 — Unknown Mid/Large Sources (Local Qwen3 Enrichment)

Trigger: >10K articles AND no Tier 1 extractor match
Enrichment: Local Ollama Qwen3 8B (aurora model)
Cost: Zero (cortex compute time only)

Lightweight prompt per article → JSON with domain, subdomain, summary, keywords.

5.3 Tier 3 — Small Unknown Sources (Gemini Enrichment)

Trigger: ≤10K articles AND no Tier 1 extractor match
Enrichment: Gemini API (existing RECON enrichment pipeline)
Cost: Low (small article count, few dollars max)

5.4 The Gate — Unknown Source Review

When a ZIM doesn't match any Tier 1 pattern AND exceeds Tier 3 threshold:

Extract random sample of 50 articles
Log to review queue in recon.db
Dashboard notification: "New ZIM: obscure_wiki.zim — 47,832 articles — no known extractor — sample ready"
Ingestion paused until user reviews and approves
User can: approve for Tier 2, assign domain/subdomain, reject, or write a new extractor

6. Database Schema

New tables in /opt/recon/data/recon.db:

zim_sources

CREATE TABLE zim_sources (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    zim_filename    TEXT NOT NULL UNIQUE,
    zim_path        TEXT NOT NULL,
    zim_uuid        TEXT,
    title           TEXT,
    description     TEXT,
    language        TEXT,
    category        TEXT,
    article_count   INTEGER DEFAULT 0,      -- from Counter metadata, NOT OPDS
    ingestion_tier  INTEGER,
    status          TEXT DEFAULT 'detected', -- detected|sampling|review|ingesting|complete|error|rejected
    processed_count INTEGER DEFAULT 0,
    skipped_count   INTEGER DEFAULT 0,
    error_count     INTEGER DEFAULT 0,
    domain          TEXT,
    subdomain       TEXT,
    detected_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    started_at      TIMESTAMP,
    completed_at    TIMESTAMP,
    last_checkpoint TEXT
);

zim_samples

CREATE TABLE zim_samples (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    zim_source_id   INTEGER REFERENCES zim_sources(id),
    article_path    TEXT NOT NULL,
    article_title   TEXT,
    text_preview    TEXT,
    metadata_json   TEXT,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

zim_articles

CREATE TABLE zim_articles (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    zim_source_id   INTEGER REFERENCES zim_sources(id),
    article_path    TEXT NOT NULL,
    article_title   TEXT,
    qdrant_point_ids TEXT,
    status          TEXT DEFAULT 'pending', -- pending|embedded|skipped|error
    processed_at    TIMESTAMP,
    UNIQUE(zim_source_id, article_path)
);

7. Article Processing Pipeline

ZIM Entry
  ├─ Is redirect? ────────────────────► SKIP
  ├─ mimetype != text/html? ──────────► SKIP
  ├─ Clean text < 200 chars? ─────────► SKIP (stub)
  ▼
HTML → Clean Text (lxml, not BeautifulSoup — 10× faster)
  ▼
Metadata Extraction (per tier)
  ▼
Chunking (~512 tokens, ~50 token overlap)
  ▼
Embedding (bge-m3 dense + sparse via cortex:8090/8091)
  ▼
Qdrant Upsert → recon_knowledge_hybrid
  Payload: source_type, zim_file, zim_source_id, article_title,
           article_path, domain, subdomain, chunk_index, 
           total_chunks, keywords, language
  ▼
Checkpoint (update zim_articles + zim_sources.processed_count)

Batching & Backpressure

Batch size: 100 articles (configurable)
Sleep between batches: 1s (configurable)
ZIM ingestion runs at lower priority than real-time PDF/stream processing
Progress logging every 1000 articles
Resumable via last_checkpoint
Sparse upserts are slower — known issue where on-disk sparse indexing progressively degrades. Budget for this.

Realistic Scale Estimates

ZIM	Articles	Est. Chunks	Embedding Time (RTX 3090)
Appropedia EN	~30K	~60K	~10 min
iFixit EN	~90K	~180K	~25 min
Stack Overflow	~500K	~1.5M	~3.5 hr
Wikipedia EN nopic	~5M (non-stub)	~10M	~24 hr
Wikipedia EN maxi	Same text, +images	~10M	~24 hr (same — images not embedded)

8. Systemd Integration

# /etc/systemd/system/kiwix.service
[Unit]
Description=Kiwix-serve for RECON
After=network.target
PartOf=recon.service

[Service]
User=zvx
ExecStart=/usr/bin/kiwix-serve \
    --library /mnt/kiwix/library.xml \
    --port 8430 \
    --address 0.0.0.0 \
    --threads 4 \
    --nodatealias \
    --blockexternal
ExecReload=kill -HUP $MAINPID
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Reload after adding ZIM: systemctl reload kiwix

9. Pre-Implementation Checks

Before writing any code:

Verify cortex:8091 sparse embeddings — If it's running TEI or Infinity, sparse vectors may be silently broken (dense-only output). Only native FlagEmbedding supports bge-m3 sparse. This affects the entire existing RECON pipeline, not just Kiwix.
Check Qdrant upgrade path — If Qdrant needs upgrading, must go stepwise: 1.14 → 1.15 → 1.16. Direct jumps corrupt data.
Check cortex RAM — Determines whether 10-30M additional vectors need quantization config changes.

10. Implementation Phases

Phase 1 — Foundation

Install kiwix-tools via PPA on CT 130
Create /mnt/kiwix/ directory (owned by zvx)
Set up kiwix.service systemd unit
Create DB schema
Download a small test ZIM (Appropedia EN — ~495MB maxi)
Register via kiwix-manage, verify browsable at port 8430
Implement ZIM monitor daemon (OPDS poll → zim_sources)

Phase 2 — Extraction & Embedding Pipeline

Implement python-libzim article extraction with lxml
Implement article filtering (redirects, stubs, non-HTML)
Implement chunking (reuse existing RECON logic)
Implement embedding + Qdrant upsert with ZIM payload schema
Implement checkpointing and resume
Implement Tier 1 Wikipedia extractor as proof of concept
End-to-end test with Appropedia

Phase 3 — Tiered Enrichment & Gate

Tier 2 local Qwen3 enrichment path
Tier 3 Gemini enrichment routing
Source classification and routing logic
Review gate (sampling, dashboard notification)
Test with an unknown ZIM

Phase 4 — Dashboard UI

Kiwix library page (loaded ZIMs, status, progress)
Upload ZIM form
Catalog browser (OPDS query + download)
Review queue
Stats integration

Phase 5 — Scale Up

Additional Tier 1 extractors (Stack Exchange, DevDocs, iFixit)
Wikipedia EN (start with nopic, upgrade to maxi when confident)
ZIM version management (replace old vectors when new ZIM version arrives)

11. Open Questions

ZIM version dedup — When wikipedia_en_2026-06.zim replaces wikipedia_en_2026-03.zim, purge old vectors by zim_source_id filter + delete, then re-embed. Atomic cutover or incremental?
Qdrant collection strategy — Same recon_knowledge_hybrid collection, or separate recon_kiwix_hybrid? Same collection means unified search. Separate means independent scaling but requires query fanout.
Pi deployment packaging — Future exercise. Dense-only, Matryoshka 256-dim, int8 quant. ~3.8GB RAM for 15M vectors. Proven viable, not designed yet.

14 KiB Raw Blame History Unescape Escape