refactored-recon/KIWIX-INTEGRATION-v2.md

365 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# RECON × Kiwix — ZIM Integration Design (v2)
**Status:** Draft v2 (corrected post-stress-test)
**Date:** 2026-04-16
**Depends on:** RECON v1.0.0 (master, CT 130)
---
## 1. Goal
Integrate Kiwix into RECON as a first-class knowledge source. Users manage ZIM files through the RECON dashboard — uploading directly or pulling from the Kiwix catalog. Kiwix-serve provides a browsable web interface for all loaded ZIMs. RECON detects new ZIMs and runs tiered ingestion: classifying the source, extracting articles via python-libzim, generating metadata, and embedding into Qdrant for Aurora semantic search.
**Future portable deployment:** The full system is built on CT 130/cortex. A stripped-down offline copy (dense-only, reduced dimensions, int8 quantization) will later be packaged for a Raspberry Pi 5 8GB with 512GB NVMe. This is a future packaging exercise — design decisions should not block it, but we build for full capability first.
---
## 2. Where Everything Lives
All on CT 130 (192.168.1.130), running as `zvx`.
| Component | Location |
|-----------|----------|
| kiwix-serve | Installed via Kiwix PPA (`ppa:kiwixteam/release`) or Docker (`ghcr.io/kiwix/kiwix-serve:3.8.2`). **NOT `apt install kiwix-tools`** — Ubuntu 24.04 ships 3.5.0 (2023), missing required OPDS v2 endpoints. Need ≥3.7.0. |
| kiwix-manage | Same package as kiwix-serve. Note: may be deprecated in favor of directory-serving in future libkiwix versions. Design for both. |
| ZIM file storage | `/mnt/kiwix/` (separate bind-mount from data host `/mnt/data/kiwix/`, same SSD as library but NOT inside library — library is curated human-browsable PDFs) |
| Kiwix library XML | `/mnt/kiwix/library.xml` (auto-managed, treat as legacy intermediate) |
| ZIM tracking DB | `recon.db` (new tables, see §6) |
| Ingestion code | `/opt/recon/lib/zim_pipeline.py` (new module) |
| Dashboard UI | `recon.echo6.co` (new Kiwix management page) |
---
## 3. Service Architecture
### 3.1 kiwix-serve (browsing + OPDS catalog)
Runs as a companion systemd unit. Provides:
- **Web browsing UI** for all loaded ZIMs (Wikipedia, Stack Exchange, etc. — fully browsable with images if using `_maxi` variants)
- **OPDS v2 catalog** at `/catalog/v2/entries` for RECON to enumerate loaded ZIMs
- **Keyword search** via `/search` endpoint (HTML/XML only — **no JSON support exists**)
```
kiwix-serve --library /mnt/kiwix/library.xml \
--port 8430 \
--address 0.0.0.0 \
--threads 4 \
--nodatealias \
--blockexternal
```
- Bound to 0.0.0.0 — browsable from the local network (kiwix.echo6.co or similar)
- Port 8430 (next to RECON dashboard on 8420)
- `--blockexternal` prevents outbound link navigation from served content
- **NO `--monitorLibrary`** — documented bugs (zombie processes, high CPU). Use SIGHUP on demand: when RECON adds a new ZIM, it sends `kill -HUP <pid>` to trigger library reload.
### 3.2 python-libzim (article extraction)
kiwix-serve is for **browsing only**. All RAG article extraction uses `python-libzim` (PyPI package `libzim` v3.9.0) directly. This is faster, gives full control over filtering, and avoids the no-JSON problem with kiwix-serve's search API.
Key API patterns:
```python
from libzim.reader import Archive
zim = Archive("path/to/file.zim")
# Reliable article count (NOT zim.article_count which is inflated 2-3×):
counter_meta = zim.get_metadata("Counter").decode()
# Returns: "text/html=6467891;image/webp=3211054;text/css=23"
html_count = parse_counter(counter_meta)["text/html"]
# ZIM metadata:
title = zim.get_metadata("Title").decode()
description = zim.get_metadata("Description").decode()
language = zim.get_metadata("Language").decode()
# Article iteration:
for i in range(zim.entry_count):
entry = zim._get_entry_by_id(i)
if entry.is_redirect:
continue
item = entry.get_item()
if item.mimetype != "text/html": # mimetype is on Item, not Entry
continue
content = bytes(item.content)
# process content...
```
**Important:** Modern ZIMs (2022+) use the "new namespace scheme" — flat paths, no `A/`/`I/`/`M/` prefixes. Do not use `iter_by_namespace('A')`.
### 3.3 RECON ZIM Monitor Daemon
New thread in `recon.service` (daemon #8). Polls kiwix-serve's OPDS catalog (`/catalog/v2/entries`) every 60 seconds, compares against `zim_sources` table in `recon.db`, and queues new ZIMs for ingestion. Also detects removed ZIMs.
**Do not trust OPDS `articleCount`** — it's inflated 2-3× for large ZIMs. Use python-libzim's `Counter` metadata for accurate counts after detection.
---
## 4. ZIM Acquisition (How ZIMs Get In)
### 4.1 Direct Upload
User uploads a `.zim` file through the RECON dashboard. Dashboard saves to `/mnt/kiwix/`, runs `kiwix-manage library.xml add <file.zim>`, sends SIGHUP to kiwix-serve. ZIM monitor detects on next poll.
### 4.2 Catalog Pull
RECON dashboard exposes a "Browse Kiwix Catalog" page. Queries public OPDS catalog at `https://library.kiwix.org/catalog/v2/entries` with filters (lang, category, search). **Response is Atom XML only** — no JSON. User picks a ZIM, confirms, RECON downloads via torrent (preferred for large files) or direct HTTP to `/mnt/kiwix/`.
### 4.3 ZIM Variant Strategy
| Variant | Content | Use Case |
|---------|---------|----------|
| `_maxi` | Full text + all images | Browsing via kiwix-serve. Plan for this as default. |
| `_nopic` | Full text, no images | RAG-only (if disk constrained). ~50% smaller. |
| `_mini` | Intro + infobox only | Not useful for RAG. |
**Start small.** Prove the pipeline with Wikivoyage, iFixit, or a focused Stack Exchange before tackling Wikipedia. Stack ZIMs and monitor storage/RAM budget as you go.
---
## 5. Tiered Ingestion Pipeline
When the ZIM monitor detects a new ZIM, it:
1. **Reads ZIM metadata** via python-libzim (`Counter`, `Title`, `Description`, `Language`, `Tags`, `Name`)
2. **Classifies the source** based on ZIM name/tag patterns
3. **Routes to the appropriate tier**
4. **Samples first** if the source is unknown (see §5.4)
### 5.1 Tier 1 — Known Large Sources (Deterministic Extractors)
**Trigger:** Known source name patterns, any article count
**Enrichment:** None — structural metadata from HTML
**Cost:** Zero
| ZIM Pattern | Domain/Subdomain | Metadata Source |
|-------------|-----------------|-----------------|
| `wikipedia_*` | Reference/Wikipedia | Title, categories, infobox fields, section headings |
| `wiktionary_*` | Reference/Wiktionary | Word, part of speech, definitions |
| `wikibooks_*` | Reference/Wikibooks | Book title, chapter, subject |
| `wikisource_*` | Reference/Wikisource | Work title, author, year |
| `wikiversity_*` | Reference/Wikiversity | Course, subject area |
| `wikivoyage_*` | Reference/Wikivoyage | Destination, region, travel topic |
| `stack_exchange_*` | Reference/StackExchange | Tags, vote score, accepted answer flag |
| `devdocs_*` | Reference/DevDocs | Language, framework, API |
| `ifixit_*` | Maintenance/Repair | Device, category, difficulty |
### 5.2 Tier 2 — Unknown Mid/Large Sources (Local Qwen3 Enrichment)
**Trigger:** >10K articles AND no Tier 1 extractor match
**Enrichment:** Local Ollama Qwen3 8B (`aurora` model)
**Cost:** Zero (cortex compute time only)
Lightweight prompt per article → JSON with domain, subdomain, summary, keywords.
### 5.3 Tier 3 — Small Unknown Sources (Gemini Enrichment)
**Trigger:** ≤10K articles AND no Tier 1 extractor match
**Enrichment:** Gemini API (existing RECON enrichment pipeline)
**Cost:** Low (small article count, few dollars max)
### 5.4 The Gate — Unknown Source Review
When a ZIM doesn't match any Tier 1 pattern AND exceeds Tier 3 threshold:
1. Extract random sample of 50 articles
2. Log to review queue in `recon.db`
3. Dashboard notification: *"New ZIM: obscure_wiki.zim — 47,832 articles — no known extractor — sample ready"*
4. Ingestion **paused** until user reviews and approves
5. User can: approve for Tier 2, assign domain/subdomain, reject, or write a new extractor
---
## 6. Database Schema
New tables in `/opt/recon/data/recon.db`:
### zim_sources
```sql
CREATE TABLE zim_sources (
id INTEGER PRIMARY KEY AUTOINCREMENT,
zim_filename TEXT NOT NULL UNIQUE,
zim_path TEXT NOT NULL,
zim_uuid TEXT,
title TEXT,
description TEXT,
language TEXT,
category TEXT,
article_count INTEGER DEFAULT 0, -- from Counter metadata, NOT OPDS
ingestion_tier INTEGER,
status TEXT DEFAULT 'detected', -- detected|sampling|review|ingesting|complete|error|rejected
processed_count INTEGER DEFAULT 0,
skipped_count INTEGER DEFAULT 0,
error_count INTEGER DEFAULT 0,
domain TEXT,
subdomain TEXT,
detected_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
last_checkpoint TEXT
);
```
### zim_samples
```sql
CREATE TABLE zim_samples (
id INTEGER PRIMARY KEY AUTOINCREMENT,
zim_source_id INTEGER REFERENCES zim_sources(id),
article_path TEXT NOT NULL,
article_title TEXT,
text_preview TEXT,
metadata_json TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
### zim_articles
```sql
CREATE TABLE zim_articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
zim_source_id INTEGER REFERENCES zim_sources(id),
article_path TEXT NOT NULL,
article_title TEXT,
qdrant_point_ids TEXT,
status TEXT DEFAULT 'pending', -- pending|embedded|skipped|error
processed_at TIMESTAMP,
UNIQUE(zim_source_id, article_path)
);
```
---
## 7. Article Processing Pipeline
```
ZIM Entry
├─ Is redirect? ────────────────────► SKIP
├─ mimetype != text/html? ──────────► SKIP
├─ Clean text < 200 chars? ─────────► SKIP (stub)
HTML → Clean Text (lxml, not BeautifulSoup — 10× faster)
Metadata Extraction (per tier)
Chunking (~512 tokens, ~50 token overlap)
Embedding (bge-m3 dense + sparse via cortex:8090/8091)
Qdrant Upsert → recon_knowledge_hybrid
Payload: source_type, zim_file, zim_source_id, article_title,
article_path, domain, subdomain, chunk_index,
total_chunks, keywords, language
Checkpoint (update zim_articles + zim_sources.processed_count)
```
### Batching & Backpressure
- Batch size: 100 articles (configurable)
- Sleep between batches: 1s (configurable)
- ZIM ingestion runs at lower priority than real-time PDF/stream processing
- Progress logging every 1000 articles
- Resumable via `last_checkpoint`
- **Sparse upserts are slower** — known issue where on-disk sparse indexing progressively degrades. Budget for this.
### Realistic Scale Estimates
| ZIM | Articles | Est. Chunks | Embedding Time (RTX 3090) |
|-----|----------|-------------|---------------------------|
| Appropedia EN | ~30K | ~60K | ~10 min |
| iFixit EN | ~90K | ~180K | ~25 min |
| Stack Overflow | ~500K | ~1.5M | ~3.5 hr |
| Wikipedia EN nopic | ~5M (non-stub) | ~10M | ~24 hr |
| Wikipedia EN maxi | Same text, +images | ~10M | ~24 hr (same — images not embedded) |
---
## 8. Systemd Integration
```ini
# /etc/systemd/system/kiwix.service
[Unit]
Description=Kiwix-serve for RECON
After=network.target
PartOf=recon.service
[Service]
User=zvx
ExecStart=/usr/bin/kiwix-serve \
--library /mnt/kiwix/library.xml \
--port 8430 \
--address 0.0.0.0 \
--threads 4 \
--nodatealias \
--blockexternal
ExecReload=kill -HUP $MAINPID
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
```
Reload after adding ZIM: `systemctl reload kiwix`
---
## 9. Pre-Implementation Checks
Before writing any code:
1. **Verify cortex:8091 sparse embeddings** — If it's running TEI or Infinity, sparse vectors may be silently broken (dense-only output). Only native FlagEmbedding supports bge-m3 sparse. This affects the entire existing RECON pipeline, not just Kiwix.
2. **Check Qdrant upgrade path** — If Qdrant needs upgrading, must go stepwise: 1.14 → 1.15 → 1.16. Direct jumps corrupt data.
3. **Check cortex RAM** — Determines whether 10-30M additional vectors need quantization config changes.
---
## 10. Implementation Phases
### Phase 1 — Foundation
- Install kiwix-tools via PPA on CT 130
- Create `/mnt/kiwix/` directory (owned by zvx)
- Set up kiwix.service systemd unit
- Create DB schema
- Download a small test ZIM (Appropedia EN — ~495MB maxi)
- Register via kiwix-manage, verify browsable at port 8430
- Implement ZIM monitor daemon (OPDS poll → zim_sources)
### Phase 2 — Extraction & Embedding Pipeline
- Implement python-libzim article extraction with lxml
- Implement article filtering (redirects, stubs, non-HTML)
- Implement chunking (reuse existing RECON logic)
- Implement embedding + Qdrant upsert with ZIM payload schema
- Implement checkpointing and resume
- Implement Tier 1 Wikipedia extractor as proof of concept
- End-to-end test with Appropedia
### Phase 3 — Tiered Enrichment & Gate
- Tier 2 local Qwen3 enrichment path
- Tier 3 Gemini enrichment routing
- Source classification and routing logic
- Review gate (sampling, dashboard notification)
- Test with an unknown ZIM
### Phase 4 — Dashboard UI
- Kiwix library page (loaded ZIMs, status, progress)
- Upload ZIM form
- Catalog browser (OPDS query + download)
- Review queue
- Stats integration
### Phase 5 — Scale Up
- Additional Tier 1 extractors (Stack Exchange, DevDocs, iFixit)
- Wikipedia EN (start with nopic, upgrade to maxi when confident)
- ZIM version management (replace old vectors when new ZIM version arrives)
---
## 11. Open Questions
1. **ZIM version dedup** — When `wikipedia_en_2026-06.zim` replaces `wikipedia_en_2026-03.zim`, purge old vectors by `zim_source_id` filter + delete, then re-embed. Atomic cutover or incremental?
2. **Qdrant collection strategy** — Same `recon_knowledge_hybrid` collection, or separate `recon_kiwix_hybrid`? Same collection means unified search. Separate means independent scaling but requires query fanout.
3. **Pi deployment packaging** — Future exercise. Dense-only, Matryoshka 256-dim, int8 quant. ~3.8GB RAM for 15M vectors. Proven viable, not designed yet.