refactored-recon/KIWIX-INTEGRATION-v2.md

14 KiB
Raw Blame History

RECON × Kiwix — ZIM Integration Design (v2)

Status: Draft v2 (corrected post-stress-test)
Date: 2026-04-16
Depends on: RECON v1.0.0 (master, CT 130)


1. Goal

Integrate Kiwix into RECON as a first-class knowledge source. Users manage ZIM files through the RECON dashboard — uploading directly or pulling from the Kiwix catalog. Kiwix-serve provides a browsable web interface for all loaded ZIMs. RECON detects new ZIMs and runs tiered ingestion: classifying the source, extracting articles via python-libzim, generating metadata, and embedding into Qdrant for Aurora semantic search.

Future portable deployment: The full system is built on CT 130/cortex. A stripped-down offline copy (dense-only, reduced dimensions, int8 quantization) will later be packaged for a Raspberry Pi 5 8GB with 512GB NVMe. This is a future packaging exercise — design decisions should not block it, but we build for full capability first.


2. Where Everything Lives

All on CT 130 (192.168.1.130), running as zvx.

Component Location
kiwix-serve Installed via Kiwix PPA (ppa:kiwixteam/release) or Docker (ghcr.io/kiwix/kiwix-serve:3.8.2). NOT apt install kiwix-tools — Ubuntu 24.04 ships 3.5.0 (2023), missing required OPDS v2 endpoints. Need ≥3.7.0.
kiwix-manage Same package as kiwix-serve. Note: may be deprecated in favor of directory-serving in future libkiwix versions. Design for both.
ZIM file storage /mnt/kiwix/ (separate bind-mount from data host /mnt/data/kiwix/, same SSD as library but NOT inside library — library is curated human-browsable PDFs)
Kiwix library XML /mnt/kiwix/library.xml (auto-managed, treat as legacy intermediate)
ZIM tracking DB recon.db (new tables, see §6)
Ingestion code /opt/recon/lib/zim_pipeline.py (new module)
Dashboard UI recon.echo6.co (new Kiwix management page)

3. Service Architecture

3.1 kiwix-serve (browsing + OPDS catalog)

Runs as a companion systemd unit. Provides:

  • Web browsing UI for all loaded ZIMs (Wikipedia, Stack Exchange, etc. — fully browsable with images if using _maxi variants)
  • OPDS v2 catalog at /catalog/v2/entries for RECON to enumerate loaded ZIMs
  • Keyword search via /search endpoint (HTML/XML only — no JSON support exists)
kiwix-serve --library /mnt/kiwix/library.xml \
            --port 8430 \
            --address 0.0.0.0 \
            --threads 4 \
            --nodatealias \
            --blockexternal
  • Bound to 0.0.0.0 — browsable from the local network (kiwix.echo6.co or similar)
  • Port 8430 (next to RECON dashboard on 8420)
  • --blockexternal prevents outbound link navigation from served content
  • NO --monitorLibrary — documented bugs (zombie processes, high CPU). Use SIGHUP on demand: when RECON adds a new ZIM, it sends kill -HUP <pid> to trigger library reload.

3.2 python-libzim (article extraction)

kiwix-serve is for browsing only. All RAG article extraction uses python-libzim (PyPI package libzim v3.9.0) directly. This is faster, gives full control over filtering, and avoids the no-JSON problem with kiwix-serve's search API.

Key API patterns:

from libzim.reader import Archive

zim = Archive("path/to/file.zim")

# Reliable article count (NOT zim.article_count which is inflated 2-3×):
counter_meta = zim.get_metadata("Counter").decode()
# Returns: "text/html=6467891;image/webp=3211054;text/css=23"
html_count = parse_counter(counter_meta)["text/html"]

# ZIM metadata:
title = zim.get_metadata("Title").decode()
description = zim.get_metadata("Description").decode()
language = zim.get_metadata("Language").decode()

# Article iteration:
for i in range(zim.entry_count):
    entry = zim._get_entry_by_id(i)
    if entry.is_redirect:
        continue
    item = entry.get_item()
    if item.mimetype != "text/html":  # mimetype is on Item, not Entry
        continue
    content = bytes(item.content)
    # process content...

Important: Modern ZIMs (2022+) use the "new namespace scheme" — flat paths, no A//I//M/ prefixes. Do not use iter_by_namespace('A').

3.3 RECON ZIM Monitor Daemon

New thread in recon.service (daemon #8). Polls kiwix-serve's OPDS catalog (/catalog/v2/entries) every 60 seconds, compares against zim_sources table in recon.db, and queues new ZIMs for ingestion. Also detects removed ZIMs.

Do not trust OPDS articleCount — it's inflated 2-3× for large ZIMs. Use python-libzim's Counter metadata for accurate counts after detection.


4. ZIM Acquisition (How ZIMs Get In)

4.1 Direct Upload

User uploads a .zim file through the RECON dashboard. Dashboard saves to /mnt/kiwix/, runs kiwix-manage library.xml add <file.zim>, sends SIGHUP to kiwix-serve. ZIM monitor detects on next poll.

4.2 Catalog Pull

RECON dashboard exposes a "Browse Kiwix Catalog" page. Queries public OPDS catalog at https://library.kiwix.org/catalog/v2/entries with filters (lang, category, search). Response is Atom XML only — no JSON. User picks a ZIM, confirms, RECON downloads via torrent (preferred for large files) or direct HTTP to /mnt/kiwix/.

4.3 ZIM Variant Strategy

Variant Content Use Case
_maxi Full text + all images Browsing via kiwix-serve. Plan for this as default.
_nopic Full text, no images RAG-only (if disk constrained). ~50% smaller.
_mini Intro + infobox only Not useful for RAG.

Start small. Prove the pipeline with Wikivoyage, iFixit, or a focused Stack Exchange before tackling Wikipedia. Stack ZIMs and monitor storage/RAM budget as you go.


5. Tiered Ingestion Pipeline

When the ZIM monitor detects a new ZIM, it:

  1. Reads ZIM metadata via python-libzim (Counter, Title, Description, Language, Tags, Name)
  2. Classifies the source based on ZIM name/tag patterns
  3. Routes to the appropriate tier
  4. Samples first if the source is unknown (see §5.4)

5.1 Tier 1 — Known Large Sources (Deterministic Extractors)

Trigger: Known source name patterns, any article count
Enrichment: None — structural metadata from HTML
Cost: Zero

ZIM Pattern Domain/Subdomain Metadata Source
wikipedia_* Reference/Wikipedia Title, categories, infobox fields, section headings
wiktionary_* Reference/Wiktionary Word, part of speech, definitions
wikibooks_* Reference/Wikibooks Book title, chapter, subject
wikisource_* Reference/Wikisource Work title, author, year
wikiversity_* Reference/Wikiversity Course, subject area
wikivoyage_* Reference/Wikivoyage Destination, region, travel topic
stack_exchange_* Reference/StackExchange Tags, vote score, accepted answer flag
devdocs_* Reference/DevDocs Language, framework, API
ifixit_* Maintenance/Repair Device, category, difficulty

5.2 Tier 2 — Unknown Mid/Large Sources (Local Qwen3 Enrichment)

Trigger: >10K articles AND no Tier 1 extractor match
Enrichment: Local Ollama Qwen3 8B (aurora model)
Cost: Zero (cortex compute time only)

Lightweight prompt per article → JSON with domain, subdomain, summary, keywords.

5.3 Tier 3 — Small Unknown Sources (Gemini Enrichment)

Trigger: ≤10K articles AND no Tier 1 extractor match
Enrichment: Gemini API (existing RECON enrichment pipeline)
Cost: Low (small article count, few dollars max)

5.4 The Gate — Unknown Source Review

When a ZIM doesn't match any Tier 1 pattern AND exceeds Tier 3 threshold:

  1. Extract random sample of 50 articles
  2. Log to review queue in recon.db
  3. Dashboard notification: "New ZIM: obscure_wiki.zim — 47,832 articles — no known extractor — sample ready"
  4. Ingestion paused until user reviews and approves
  5. User can: approve for Tier 2, assign domain/subdomain, reject, or write a new extractor

6. Database Schema

New tables in /opt/recon/data/recon.db:

zim_sources

CREATE TABLE zim_sources (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    zim_filename    TEXT NOT NULL UNIQUE,
    zim_path        TEXT NOT NULL,
    zim_uuid        TEXT,
    title           TEXT,
    description     TEXT,
    language        TEXT,
    category        TEXT,
    article_count   INTEGER DEFAULT 0,      -- from Counter metadata, NOT OPDS
    ingestion_tier  INTEGER,
    status          TEXT DEFAULT 'detected', -- detected|sampling|review|ingesting|complete|error|rejected
    processed_count INTEGER DEFAULT 0,
    skipped_count   INTEGER DEFAULT 0,
    error_count     INTEGER DEFAULT 0,
    domain          TEXT,
    subdomain       TEXT,
    detected_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    started_at      TIMESTAMP,
    completed_at    TIMESTAMP,
    last_checkpoint TEXT
);

zim_samples

CREATE TABLE zim_samples (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    zim_source_id   INTEGER REFERENCES zim_sources(id),
    article_path    TEXT NOT NULL,
    article_title   TEXT,
    text_preview    TEXT,
    metadata_json   TEXT,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

zim_articles

CREATE TABLE zim_articles (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    zim_source_id   INTEGER REFERENCES zim_sources(id),
    article_path    TEXT NOT NULL,
    article_title   TEXT,
    qdrant_point_ids TEXT,
    status          TEXT DEFAULT 'pending', -- pending|embedded|skipped|error
    processed_at    TIMESTAMP,
    UNIQUE(zim_source_id, article_path)
);

7. Article Processing Pipeline

ZIM Entry
  ├─ Is redirect? ────────────────────► SKIP
  ├─ mimetype != text/html? ──────────► SKIP
  ├─ Clean text < 200 chars? ─────────► SKIP (stub)
  ▼
HTML → Clean Text (lxml, not BeautifulSoup — 10× faster)
  ▼
Metadata Extraction (per tier)
  ▼
Chunking (~512 tokens, ~50 token overlap)
  ▼
Embedding (bge-m3 dense + sparse via cortex:8090/8091)
  ▼
Qdrant Upsert → recon_knowledge_hybrid
  Payload: source_type, zim_file, zim_source_id, article_title,
           article_path, domain, subdomain, chunk_index, 
           total_chunks, keywords, language
  ▼
Checkpoint (update zim_articles + zim_sources.processed_count)

Batching & Backpressure

  • Batch size: 100 articles (configurable)
  • Sleep between batches: 1s (configurable)
  • ZIM ingestion runs at lower priority than real-time PDF/stream processing
  • Progress logging every 1000 articles
  • Resumable via last_checkpoint
  • Sparse upserts are slower — known issue where on-disk sparse indexing progressively degrades. Budget for this.

Realistic Scale Estimates

ZIM Articles Est. Chunks Embedding Time (RTX 3090)
Appropedia EN ~30K ~60K ~10 min
iFixit EN ~90K ~180K ~25 min
Stack Overflow ~500K ~1.5M ~3.5 hr
Wikipedia EN nopic ~5M (non-stub) ~10M ~24 hr
Wikipedia EN maxi Same text, +images ~10M ~24 hr (same — images not embedded)

8. Systemd Integration

# /etc/systemd/system/kiwix.service
[Unit]
Description=Kiwix-serve for RECON
After=network.target
PartOf=recon.service

[Service]
User=zvx
ExecStart=/usr/bin/kiwix-serve \
    --library /mnt/kiwix/library.xml \
    --port 8430 \
    --address 0.0.0.0 \
    --threads 4 \
    --nodatealias \
    --blockexternal
ExecReload=kill -HUP $MAINPID
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Reload after adding ZIM: systemctl reload kiwix


9. Pre-Implementation Checks

Before writing any code:

  1. Verify cortex:8091 sparse embeddings — If it's running TEI or Infinity, sparse vectors may be silently broken (dense-only output). Only native FlagEmbedding supports bge-m3 sparse. This affects the entire existing RECON pipeline, not just Kiwix.

  2. Check Qdrant upgrade path — If Qdrant needs upgrading, must go stepwise: 1.14 → 1.15 → 1.16. Direct jumps corrupt data.

  3. Check cortex RAM — Determines whether 10-30M additional vectors need quantization config changes.


10. Implementation Phases

Phase 1 — Foundation

  • Install kiwix-tools via PPA on CT 130
  • Create /mnt/kiwix/ directory (owned by zvx)
  • Set up kiwix.service systemd unit
  • Create DB schema
  • Download a small test ZIM (Appropedia EN — ~495MB maxi)
  • Register via kiwix-manage, verify browsable at port 8430
  • Implement ZIM monitor daemon (OPDS poll → zim_sources)

Phase 2 — Extraction & Embedding Pipeline

  • Implement python-libzim article extraction with lxml
  • Implement article filtering (redirects, stubs, non-HTML)
  • Implement chunking (reuse existing RECON logic)
  • Implement embedding + Qdrant upsert with ZIM payload schema
  • Implement checkpointing and resume
  • Implement Tier 1 Wikipedia extractor as proof of concept
  • End-to-end test with Appropedia

Phase 3 — Tiered Enrichment & Gate

  • Tier 2 local Qwen3 enrichment path
  • Tier 3 Gemini enrichment routing
  • Source classification and routing logic
  • Review gate (sampling, dashboard notification)
  • Test with an unknown ZIM

Phase 4 — Dashboard UI

  • Kiwix library page (loaded ZIMs, status, progress)
  • Upload ZIM form
  • Catalog browser (OPDS query + download)
  • Review queue
  • Stats integration

Phase 5 — Scale Up

  • Additional Tier 1 extractors (Stack Exchange, DevDocs, iFixit)
  • Wikipedia EN (start with nopic, upgrade to maxi when confident)
  • ZIM version management (replace old vectors when new ZIM version arrives)

11. Open Questions

  1. ZIM version dedup — When wikipedia_en_2026-06.zim replaces wikipedia_en_2026-03.zim, purge old vectors by zim_source_id filter + delete, then re-embed. Atomic cutover or incremental?

  2. Qdrant collection strategy — Same recon_knowledge_hybrid collection, or separate recon_kiwix_hybrid? Same collection means unified search. Separate means independent scaling but requires query fanout.

  3. Pi deployment packaging — Future exercise. Dense-only, Matryoshka 256-dim, int8 quant. ~3.8GB RAM for 15M vectors. Proven viable, not designed yet.