14 KiB
RECON × Kiwix — ZIM Integration Design (v2)
Status: Draft v2 (corrected post-stress-test)
Date: 2026-04-16
Depends on: RECON v1.0.0 (master, CT 130)
1. Goal
Integrate Kiwix into RECON as a first-class knowledge source. Users manage ZIM files through the RECON dashboard — uploading directly or pulling from the Kiwix catalog. Kiwix-serve provides a browsable web interface for all loaded ZIMs. RECON detects new ZIMs and runs tiered ingestion: classifying the source, extracting articles via python-libzim, generating metadata, and embedding into Qdrant for Aurora semantic search.
Future portable deployment: The full system is built on CT 130/cortex. A stripped-down offline copy (dense-only, reduced dimensions, int8 quantization) will later be packaged for a Raspberry Pi 5 8GB with 512GB NVMe. This is a future packaging exercise — design decisions should not block it, but we build for full capability first.
2. Where Everything Lives
All on CT 130 (192.168.1.130), running as zvx.
| Component | Location |
|---|---|
| kiwix-serve | Installed via Kiwix PPA (ppa:kiwixteam/release) or Docker (ghcr.io/kiwix/kiwix-serve:3.8.2). NOT apt install kiwix-tools — Ubuntu 24.04 ships 3.5.0 (2023), missing required OPDS v2 endpoints. Need ≥3.7.0. |
| kiwix-manage | Same package as kiwix-serve. Note: may be deprecated in favor of directory-serving in future libkiwix versions. Design for both. |
| ZIM file storage | /mnt/kiwix/ (separate bind-mount from data host /mnt/data/kiwix/, same SSD as library but NOT inside library — library is curated human-browsable PDFs) |
| Kiwix library XML | /mnt/kiwix/library.xml (auto-managed, treat as legacy intermediate) |
| ZIM tracking DB | recon.db (new tables, see §6) |
| Ingestion code | /opt/recon/lib/zim_pipeline.py (new module) |
| Dashboard UI | recon.echo6.co (new Kiwix management page) |
3. Service Architecture
3.1 kiwix-serve (browsing + OPDS catalog)
Runs as a companion systemd unit. Provides:
- Web browsing UI for all loaded ZIMs (Wikipedia, Stack Exchange, etc. — fully browsable with images if using
_maxivariants) - OPDS v2 catalog at
/catalog/v2/entriesfor RECON to enumerate loaded ZIMs - Keyword search via
/searchendpoint (HTML/XML only — no JSON support exists)
kiwix-serve --library /mnt/kiwix/library.xml \
--port 8430 \
--address 0.0.0.0 \
--threads 4 \
--nodatealias \
--blockexternal
- Bound to 0.0.0.0 — browsable from the local network (kiwix.echo6.co or similar)
- Port 8430 (next to RECON dashboard on 8420)
--blockexternalprevents outbound link navigation from served content- NO
--monitorLibrary— documented bugs (zombie processes, high CPU). Use SIGHUP on demand: when RECON adds a new ZIM, it sendskill -HUP <pid>to trigger library reload.
3.2 python-libzim (article extraction)
kiwix-serve is for browsing only. All RAG article extraction uses python-libzim (PyPI package libzim v3.9.0) directly. This is faster, gives full control over filtering, and avoids the no-JSON problem with kiwix-serve's search API.
Key API patterns:
from libzim.reader import Archive
zim = Archive("path/to/file.zim")
# Reliable article count (NOT zim.article_count which is inflated 2-3×):
counter_meta = zim.get_metadata("Counter").decode()
# Returns: "text/html=6467891;image/webp=3211054;text/css=23"
html_count = parse_counter(counter_meta)["text/html"]
# ZIM metadata:
title = zim.get_metadata("Title").decode()
description = zim.get_metadata("Description").decode()
language = zim.get_metadata("Language").decode()
# Article iteration:
for i in range(zim.entry_count):
entry = zim._get_entry_by_id(i)
if entry.is_redirect:
continue
item = entry.get_item()
if item.mimetype != "text/html": # mimetype is on Item, not Entry
continue
content = bytes(item.content)
# process content...
Important: Modern ZIMs (2022+) use the "new namespace scheme" — flat paths, no A//I//M/ prefixes. Do not use iter_by_namespace('A').
3.3 RECON ZIM Monitor Daemon
New thread in recon.service (daemon #8). Polls kiwix-serve's OPDS catalog (/catalog/v2/entries) every 60 seconds, compares against zim_sources table in recon.db, and queues new ZIMs for ingestion. Also detects removed ZIMs.
Do not trust OPDS articleCount — it's inflated 2-3× for large ZIMs. Use python-libzim's Counter metadata for accurate counts after detection.
4. ZIM Acquisition (How ZIMs Get In)
4.1 Direct Upload
User uploads a .zim file through the RECON dashboard. Dashboard saves to /mnt/kiwix/, runs kiwix-manage library.xml add <file.zim>, sends SIGHUP to kiwix-serve. ZIM monitor detects on next poll.
4.2 Catalog Pull
RECON dashboard exposes a "Browse Kiwix Catalog" page. Queries public OPDS catalog at https://library.kiwix.org/catalog/v2/entries with filters (lang, category, search). Response is Atom XML only — no JSON. User picks a ZIM, confirms, RECON downloads via torrent (preferred for large files) or direct HTTP to /mnt/kiwix/.
4.3 ZIM Variant Strategy
| Variant | Content | Use Case |
|---|---|---|
_maxi |
Full text + all images | Browsing via kiwix-serve. Plan for this as default. |
_nopic |
Full text, no images | RAG-only (if disk constrained). ~50% smaller. |
_mini |
Intro + infobox only | Not useful for RAG. |
Start small. Prove the pipeline with Wikivoyage, iFixit, or a focused Stack Exchange before tackling Wikipedia. Stack ZIMs and monitor storage/RAM budget as you go.
5. Tiered Ingestion Pipeline
When the ZIM monitor detects a new ZIM, it:
- Reads ZIM metadata via python-libzim (
Counter,Title,Description,Language,Tags,Name) - Classifies the source based on ZIM name/tag patterns
- Routes to the appropriate tier
- Samples first if the source is unknown (see §5.4)
5.1 Tier 1 — Known Large Sources (Deterministic Extractors)
Trigger: Known source name patterns, any article count
Enrichment: None — structural metadata from HTML
Cost: Zero
| ZIM Pattern | Domain/Subdomain | Metadata Source |
|---|---|---|
wikipedia_* |
Reference/Wikipedia | Title, categories, infobox fields, section headings |
wiktionary_* |
Reference/Wiktionary | Word, part of speech, definitions |
wikibooks_* |
Reference/Wikibooks | Book title, chapter, subject |
wikisource_* |
Reference/Wikisource | Work title, author, year |
wikiversity_* |
Reference/Wikiversity | Course, subject area |
wikivoyage_* |
Reference/Wikivoyage | Destination, region, travel topic |
stack_exchange_* |
Reference/StackExchange | Tags, vote score, accepted answer flag |
devdocs_* |
Reference/DevDocs | Language, framework, API |
ifixit_* |
Maintenance/Repair | Device, category, difficulty |
5.2 Tier 2 — Unknown Mid/Large Sources (Local Qwen3 Enrichment)
Trigger: >10K articles AND no Tier 1 extractor match
Enrichment: Local Ollama Qwen3 8B (aurora model)
Cost: Zero (cortex compute time only)
Lightweight prompt per article → JSON with domain, subdomain, summary, keywords.
5.3 Tier 3 — Small Unknown Sources (Gemini Enrichment)
Trigger: ≤10K articles AND no Tier 1 extractor match
Enrichment: Gemini API (existing RECON enrichment pipeline)
Cost: Low (small article count, few dollars max)
5.4 The Gate — Unknown Source Review
When a ZIM doesn't match any Tier 1 pattern AND exceeds Tier 3 threshold:
- Extract random sample of 50 articles
- Log to review queue in
recon.db - Dashboard notification: "New ZIM: obscure_wiki.zim — 47,832 articles — no known extractor — sample ready"
- Ingestion paused until user reviews and approves
- User can: approve for Tier 2, assign domain/subdomain, reject, or write a new extractor
6. Database Schema
New tables in /opt/recon/data/recon.db:
zim_sources
CREATE TABLE zim_sources (
id INTEGER PRIMARY KEY AUTOINCREMENT,
zim_filename TEXT NOT NULL UNIQUE,
zim_path TEXT NOT NULL,
zim_uuid TEXT,
title TEXT,
description TEXT,
language TEXT,
category TEXT,
article_count INTEGER DEFAULT 0, -- from Counter metadata, NOT OPDS
ingestion_tier INTEGER,
status TEXT DEFAULT 'detected', -- detected|sampling|review|ingesting|complete|error|rejected
processed_count INTEGER DEFAULT 0,
skipped_count INTEGER DEFAULT 0,
error_count INTEGER DEFAULT 0,
domain TEXT,
subdomain TEXT,
detected_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_checkpoint TEXT
);
zim_samples
CREATE TABLE zim_samples (
id INTEGER PRIMARY KEY AUTOINCREMENT,
zim_source_id INTEGER REFERENCES zim_sources(id),
article_path TEXT NOT NULL,
article_title TEXT,
text_preview TEXT,
metadata_json TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
zim_articles
CREATE TABLE zim_articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
zim_source_id INTEGER REFERENCES zim_sources(id),
article_path TEXT NOT NULL,
article_title TEXT,
qdrant_point_ids TEXT,
status TEXT DEFAULT 'pending', -- pending|embedded|skipped|error
processed_at TIMESTAMP,
UNIQUE(zim_source_id, article_path)
);
7. Article Processing Pipeline
ZIM Entry
├─ Is redirect? ────────────────────► SKIP
├─ mimetype != text/html? ──────────► SKIP
├─ Clean text < 200 chars? ─────────► SKIP (stub)
▼
HTML → Clean Text (lxml, not BeautifulSoup — 10× faster)
▼
Metadata Extraction (per tier)
▼
Chunking (~512 tokens, ~50 token overlap)
▼
Embedding (bge-m3 dense + sparse via cortex:8090/8091)
▼
Qdrant Upsert → recon_knowledge_hybrid
Payload: source_type, zim_file, zim_source_id, article_title,
article_path, domain, subdomain, chunk_index,
total_chunks, keywords, language
▼
Checkpoint (update zim_articles + zim_sources.processed_count)
Batching & Backpressure
- Batch size: 100 articles (configurable)
- Sleep between batches: 1s (configurable)
- ZIM ingestion runs at lower priority than real-time PDF/stream processing
- Progress logging every 1000 articles
- Resumable via
last_checkpoint - Sparse upserts are slower — known issue where on-disk sparse indexing progressively degrades. Budget for this.
Realistic Scale Estimates
| ZIM | Articles | Est. Chunks | Embedding Time (RTX 3090) |
|---|---|---|---|
| Appropedia EN | ~30K | ~60K | ~10 min |
| iFixit EN | ~90K | ~180K | ~25 min |
| Stack Overflow | ~500K | ~1.5M | ~3.5 hr |
| Wikipedia EN nopic | ~5M (non-stub) | ~10M | ~24 hr |
| Wikipedia EN maxi | Same text, +images | ~10M | ~24 hr (same — images not embedded) |
8. Systemd Integration
# /etc/systemd/system/kiwix.service
[Unit]
Description=Kiwix-serve for RECON
After=network.target
PartOf=recon.service
[Service]
User=zvx
ExecStart=/usr/bin/kiwix-serve \
--library /mnt/kiwix/library.xml \
--port 8430 \
--address 0.0.0.0 \
--threads 4 \
--nodatealias \
--blockexternal
ExecReload=kill -HUP $MAINPID
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Reload after adding ZIM: systemctl reload kiwix
9. Pre-Implementation Checks
Before writing any code:
-
Verify cortex:8091 sparse embeddings — If it's running TEI or Infinity, sparse vectors may be silently broken (dense-only output). Only native FlagEmbedding supports bge-m3 sparse. This affects the entire existing RECON pipeline, not just Kiwix.
-
Check Qdrant upgrade path — If Qdrant needs upgrading, must go stepwise: 1.14 → 1.15 → 1.16. Direct jumps corrupt data.
-
Check cortex RAM — Determines whether 10-30M additional vectors need quantization config changes.
10. Implementation Phases
Phase 1 — Foundation
- Install kiwix-tools via PPA on CT 130
- Create
/mnt/kiwix/directory (owned by zvx) - Set up kiwix.service systemd unit
- Create DB schema
- Download a small test ZIM (Appropedia EN — ~495MB maxi)
- Register via kiwix-manage, verify browsable at port 8430
- Implement ZIM monitor daemon (OPDS poll → zim_sources)
Phase 2 — Extraction & Embedding Pipeline
- Implement python-libzim article extraction with lxml
- Implement article filtering (redirects, stubs, non-HTML)
- Implement chunking (reuse existing RECON logic)
- Implement embedding + Qdrant upsert with ZIM payload schema
- Implement checkpointing and resume
- Implement Tier 1 Wikipedia extractor as proof of concept
- End-to-end test with Appropedia
Phase 3 — Tiered Enrichment & Gate
- Tier 2 local Qwen3 enrichment path
- Tier 3 Gemini enrichment routing
- Source classification and routing logic
- Review gate (sampling, dashboard notification)
- Test with an unknown ZIM
Phase 4 — Dashboard UI
- Kiwix library page (loaded ZIMs, status, progress)
- Upload ZIM form
- Catalog browser (OPDS query + download)
- Review queue
- Stats integration
Phase 5 — Scale Up
- Additional Tier 1 extractors (Stack Exchange, DevDocs, iFixit)
- Wikipedia EN (start with nopic, upgrade to maxi when confident)
- ZIM version management (replace old vectors when new ZIM version arrives)
11. Open Questions
-
ZIM version dedup — When
wikipedia_en_2026-06.zimreplaceswikipedia_en_2026-03.zim, purge old vectors byzim_source_idfilter + delete, then re-embed. Atomic cutover or incremental? -
Qdrant collection strategy — Same
recon_knowledge_hybridcollection, or separaterecon_kiwix_hybrid? Same collection means unified search. Separate means independent scaling but requires query fanout. -
Pi deployment packaging — Future exercise. Dense-only, Matryoshka 256-dim, int8 quant. ~3.8GB RAM for 15M vectors. Proven viable, not designed yet.