mirror of
https://github.com/zvx-echo6/refactored-recon.git
synced 2026-05-20 06:34:34 +02:00
365 lines
14 KiB
Markdown
365 lines
14 KiB
Markdown
|
|
# RECON × Kiwix — ZIM Integration Design (v2)
|
|||
|
|
|
|||
|
|
**Status:** Draft v2 (corrected post-stress-test)
|
|||
|
|
**Date:** 2026-04-16
|
|||
|
|
**Depends on:** RECON v1.0.0 (master, CT 130)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Goal
|
|||
|
|
|
|||
|
|
Integrate Kiwix into RECON as a first-class knowledge source. Users manage ZIM files through the RECON dashboard — uploading directly or pulling from the Kiwix catalog. Kiwix-serve provides a browsable web interface for all loaded ZIMs. RECON detects new ZIMs and runs tiered ingestion: classifying the source, extracting articles via python-libzim, generating metadata, and embedding into Qdrant for Aurora semantic search.
|
|||
|
|
|
|||
|
|
**Future portable deployment:** The full system is built on CT 130/cortex. A stripped-down offline copy (dense-only, reduced dimensions, int8 quantization) will later be packaged for a Raspberry Pi 5 8GB with 512GB NVMe. This is a future packaging exercise — design decisions should not block it, but we build for full capability first.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Where Everything Lives
|
|||
|
|
|
|||
|
|
All on CT 130 (192.168.1.130), running as `zvx`.
|
|||
|
|
|
|||
|
|
| Component | Location |
|
|||
|
|
|-----------|----------|
|
|||
|
|
| kiwix-serve | Installed via Kiwix PPA (`ppa:kiwixteam/release`) or Docker (`ghcr.io/kiwix/kiwix-serve:3.8.2`). **NOT `apt install kiwix-tools`** — Ubuntu 24.04 ships 3.5.0 (2023), missing required OPDS v2 endpoints. Need ≥3.7.0. |
|
|||
|
|
| kiwix-manage | Same package as kiwix-serve. Note: may be deprecated in favor of directory-serving in future libkiwix versions. Design for both. |
|
|||
|
|
| ZIM file storage | `/mnt/kiwix/` (separate bind-mount from data host `/mnt/data/kiwix/`, same SSD as library but NOT inside library — library is curated human-browsable PDFs) |
|
|||
|
|
| Kiwix library XML | `/mnt/kiwix/library.xml` (auto-managed, treat as legacy intermediate) |
|
|||
|
|
| ZIM tracking DB | `recon.db` (new tables, see §6) |
|
|||
|
|
| Ingestion code | `/opt/recon/lib/zim_pipeline.py` (new module) |
|
|||
|
|
| Dashboard UI | `recon.echo6.co` (new Kiwix management page) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Service Architecture
|
|||
|
|
|
|||
|
|
### 3.1 kiwix-serve (browsing + OPDS catalog)
|
|||
|
|
|
|||
|
|
Runs as a companion systemd unit. Provides:
|
|||
|
|
- **Web browsing UI** for all loaded ZIMs (Wikipedia, Stack Exchange, etc. — fully browsable with images if using `_maxi` variants)
|
|||
|
|
- **OPDS v2 catalog** at `/catalog/v2/entries` for RECON to enumerate loaded ZIMs
|
|||
|
|
- **Keyword search** via `/search` endpoint (HTML/XML only — **no JSON support exists**)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
kiwix-serve --library /mnt/kiwix/library.xml \
|
|||
|
|
--port 8430 \
|
|||
|
|
--address 0.0.0.0 \
|
|||
|
|
--threads 4 \
|
|||
|
|
--nodatealias \
|
|||
|
|
--blockexternal
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- Bound to 0.0.0.0 — browsable from the local network (kiwix.echo6.co or similar)
|
|||
|
|
- Port 8430 (next to RECON dashboard on 8420)
|
|||
|
|
- `--blockexternal` prevents outbound link navigation from served content
|
|||
|
|
- **NO `--monitorLibrary`** — documented bugs (zombie processes, high CPU). Use SIGHUP on demand: when RECON adds a new ZIM, it sends `kill -HUP <pid>` to trigger library reload.
|
|||
|
|
|
|||
|
|
### 3.2 python-libzim (article extraction)
|
|||
|
|
|
|||
|
|
kiwix-serve is for **browsing only**. All RAG article extraction uses `python-libzim` (PyPI package `libzim` v3.9.0) directly. This is faster, gives full control over filtering, and avoids the no-JSON problem with kiwix-serve's search API.
|
|||
|
|
|
|||
|
|
Key API patterns:
|
|||
|
|
```python
|
|||
|
|
from libzim.reader import Archive
|
|||
|
|
|
|||
|
|
zim = Archive("path/to/file.zim")
|
|||
|
|
|
|||
|
|
# Reliable article count (NOT zim.article_count which is inflated 2-3×):
|
|||
|
|
counter_meta = zim.get_metadata("Counter").decode()
|
|||
|
|
# Returns: "text/html=6467891;image/webp=3211054;text/css=23"
|
|||
|
|
html_count = parse_counter(counter_meta)["text/html"]
|
|||
|
|
|
|||
|
|
# ZIM metadata:
|
|||
|
|
title = zim.get_metadata("Title").decode()
|
|||
|
|
description = zim.get_metadata("Description").decode()
|
|||
|
|
language = zim.get_metadata("Language").decode()
|
|||
|
|
|
|||
|
|
# Article iteration:
|
|||
|
|
for i in range(zim.entry_count):
|
|||
|
|
entry = zim._get_entry_by_id(i)
|
|||
|
|
if entry.is_redirect:
|
|||
|
|
continue
|
|||
|
|
item = entry.get_item()
|
|||
|
|
if item.mimetype != "text/html": # mimetype is on Item, not Entry
|
|||
|
|
continue
|
|||
|
|
content = bytes(item.content)
|
|||
|
|
# process content...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Important:** Modern ZIMs (2022+) use the "new namespace scheme" — flat paths, no `A/`/`I/`/`M/` prefixes. Do not use `iter_by_namespace('A')`.
|
|||
|
|
|
|||
|
|
### 3.3 RECON ZIM Monitor Daemon
|
|||
|
|
|
|||
|
|
New thread in `recon.service` (daemon #8). Polls kiwix-serve's OPDS catalog (`/catalog/v2/entries`) every 60 seconds, compares against `zim_sources` table in `recon.db`, and queues new ZIMs for ingestion. Also detects removed ZIMs.
|
|||
|
|
|
|||
|
|
**Do not trust OPDS `articleCount`** — it's inflated 2-3× for large ZIMs. Use python-libzim's `Counter` metadata for accurate counts after detection.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. ZIM Acquisition (How ZIMs Get In)
|
|||
|
|
|
|||
|
|
### 4.1 Direct Upload
|
|||
|
|
|
|||
|
|
User uploads a `.zim` file through the RECON dashboard. Dashboard saves to `/mnt/kiwix/`, runs `kiwix-manage library.xml add <file.zim>`, sends SIGHUP to kiwix-serve. ZIM monitor detects on next poll.
|
|||
|
|
|
|||
|
|
### 4.2 Catalog Pull
|
|||
|
|
|
|||
|
|
RECON dashboard exposes a "Browse Kiwix Catalog" page. Queries public OPDS catalog at `https://library.kiwix.org/catalog/v2/entries` with filters (lang, category, search). **Response is Atom XML only** — no JSON. User picks a ZIM, confirms, RECON downloads via torrent (preferred for large files) or direct HTTP to `/mnt/kiwix/`.
|
|||
|
|
|
|||
|
|
### 4.3 ZIM Variant Strategy
|
|||
|
|
|
|||
|
|
| Variant | Content | Use Case |
|
|||
|
|
|---------|---------|----------|
|
|||
|
|
| `_maxi` | Full text + all images | Browsing via kiwix-serve. Plan for this as default. |
|
|||
|
|
| `_nopic` | Full text, no images | RAG-only (if disk constrained). ~50% smaller. |
|
|||
|
|
| `_mini` | Intro + infobox only | Not useful for RAG. |
|
|||
|
|
|
|||
|
|
**Start small.** Prove the pipeline with Wikivoyage, iFixit, or a focused Stack Exchange before tackling Wikipedia. Stack ZIMs and monitor storage/RAM budget as you go.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. Tiered Ingestion Pipeline
|
|||
|
|
|
|||
|
|
When the ZIM monitor detects a new ZIM, it:
|
|||
|
|
|
|||
|
|
1. **Reads ZIM metadata** via python-libzim (`Counter`, `Title`, `Description`, `Language`, `Tags`, `Name`)
|
|||
|
|
2. **Classifies the source** based on ZIM name/tag patterns
|
|||
|
|
3. **Routes to the appropriate tier**
|
|||
|
|
4. **Samples first** if the source is unknown (see §5.4)
|
|||
|
|
|
|||
|
|
### 5.1 Tier 1 — Known Large Sources (Deterministic Extractors)
|
|||
|
|
|
|||
|
|
**Trigger:** Known source name patterns, any article count
|
|||
|
|
**Enrichment:** None — structural metadata from HTML
|
|||
|
|
**Cost:** Zero
|
|||
|
|
|
|||
|
|
| ZIM Pattern | Domain/Subdomain | Metadata Source |
|
|||
|
|
|-------------|-----------------|-----------------|
|
|||
|
|
| `wikipedia_*` | Reference/Wikipedia | Title, categories, infobox fields, section headings |
|
|||
|
|
| `wiktionary_*` | Reference/Wiktionary | Word, part of speech, definitions |
|
|||
|
|
| `wikibooks_*` | Reference/Wikibooks | Book title, chapter, subject |
|
|||
|
|
| `wikisource_*` | Reference/Wikisource | Work title, author, year |
|
|||
|
|
| `wikiversity_*` | Reference/Wikiversity | Course, subject area |
|
|||
|
|
| `wikivoyage_*` | Reference/Wikivoyage | Destination, region, travel topic |
|
|||
|
|
| `stack_exchange_*` | Reference/StackExchange | Tags, vote score, accepted answer flag |
|
|||
|
|
| `devdocs_*` | Reference/DevDocs | Language, framework, API |
|
|||
|
|
| `ifixit_*` | Maintenance/Repair | Device, category, difficulty |
|
|||
|
|
|
|||
|
|
### 5.2 Tier 2 — Unknown Mid/Large Sources (Local Qwen3 Enrichment)
|
|||
|
|
|
|||
|
|
**Trigger:** >10K articles AND no Tier 1 extractor match
|
|||
|
|
**Enrichment:** Local Ollama Qwen3 8B (`aurora` model)
|
|||
|
|
**Cost:** Zero (cortex compute time only)
|
|||
|
|
|
|||
|
|
Lightweight prompt per article → JSON with domain, subdomain, summary, keywords.
|
|||
|
|
|
|||
|
|
### 5.3 Tier 3 — Small Unknown Sources (Gemini Enrichment)
|
|||
|
|
|
|||
|
|
**Trigger:** ≤10K articles AND no Tier 1 extractor match
|
|||
|
|
**Enrichment:** Gemini API (existing RECON enrichment pipeline)
|
|||
|
|
**Cost:** Low (small article count, few dollars max)
|
|||
|
|
|
|||
|
|
### 5.4 The Gate — Unknown Source Review
|
|||
|
|
|
|||
|
|
When a ZIM doesn't match any Tier 1 pattern AND exceeds Tier 3 threshold:
|
|||
|
|
|
|||
|
|
1. Extract random sample of 50 articles
|
|||
|
|
2. Log to review queue in `recon.db`
|
|||
|
|
3. Dashboard notification: *"New ZIM: obscure_wiki.zim — 47,832 articles — no known extractor — sample ready"*
|
|||
|
|
4. Ingestion **paused** until user reviews and approves
|
|||
|
|
5. User can: approve for Tier 2, assign domain/subdomain, reject, or write a new extractor
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. Database Schema
|
|||
|
|
|
|||
|
|
New tables in `/opt/recon/data/recon.db`:
|
|||
|
|
|
|||
|
|
### zim_sources
|
|||
|
|
```sql
|
|||
|
|
CREATE TABLE zim_sources (
|
|||
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|||
|
|
zim_filename TEXT NOT NULL UNIQUE,
|
|||
|
|
zim_path TEXT NOT NULL,
|
|||
|
|
zim_uuid TEXT,
|
|||
|
|
title TEXT,
|
|||
|
|
description TEXT,
|
|||
|
|
language TEXT,
|
|||
|
|
category TEXT,
|
|||
|
|
article_count INTEGER DEFAULT 0, -- from Counter metadata, NOT OPDS
|
|||
|
|
ingestion_tier INTEGER,
|
|||
|
|
status TEXT DEFAULT 'detected', -- detected|sampling|review|ingesting|complete|error|rejected
|
|||
|
|
processed_count INTEGER DEFAULT 0,
|
|||
|
|
skipped_count INTEGER DEFAULT 0,
|
|||
|
|
error_count INTEGER DEFAULT 0,
|
|||
|
|
domain TEXT,
|
|||
|
|
subdomain TEXT,
|
|||
|
|
detected_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|||
|
|
started_at TIMESTAMP,
|
|||
|
|
completed_at TIMESTAMP,
|
|||
|
|
last_checkpoint TEXT
|
|||
|
|
);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### zim_samples
|
|||
|
|
```sql
|
|||
|
|
CREATE TABLE zim_samples (
|
|||
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|||
|
|
zim_source_id INTEGER REFERENCES zim_sources(id),
|
|||
|
|
article_path TEXT NOT NULL,
|
|||
|
|
article_title TEXT,
|
|||
|
|
text_preview TEXT,
|
|||
|
|
metadata_json TEXT,
|
|||
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
|||
|
|
);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### zim_articles
|
|||
|
|
```sql
|
|||
|
|
CREATE TABLE zim_articles (
|
|||
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|||
|
|
zim_source_id INTEGER REFERENCES zim_sources(id),
|
|||
|
|
article_path TEXT NOT NULL,
|
|||
|
|
article_title TEXT,
|
|||
|
|
qdrant_point_ids TEXT,
|
|||
|
|
status TEXT DEFAULT 'pending', -- pending|embedded|skipped|error
|
|||
|
|
processed_at TIMESTAMP,
|
|||
|
|
UNIQUE(zim_source_id, article_path)
|
|||
|
|
);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. Article Processing Pipeline
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
ZIM Entry
|
|||
|
|
├─ Is redirect? ────────────────────► SKIP
|
|||
|
|
├─ mimetype != text/html? ──────────► SKIP
|
|||
|
|
├─ Clean text < 200 chars? ─────────► SKIP (stub)
|
|||
|
|
▼
|
|||
|
|
HTML → Clean Text (lxml, not BeautifulSoup — 10× faster)
|
|||
|
|
▼
|
|||
|
|
Metadata Extraction (per tier)
|
|||
|
|
▼
|
|||
|
|
Chunking (~512 tokens, ~50 token overlap)
|
|||
|
|
▼
|
|||
|
|
Embedding (bge-m3 dense + sparse via cortex:8090/8091)
|
|||
|
|
▼
|
|||
|
|
Qdrant Upsert → recon_knowledge_hybrid
|
|||
|
|
Payload: source_type, zim_file, zim_source_id, article_title,
|
|||
|
|
article_path, domain, subdomain, chunk_index,
|
|||
|
|
total_chunks, keywords, language
|
|||
|
|
▼
|
|||
|
|
Checkpoint (update zim_articles + zim_sources.processed_count)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Batching & Backpressure
|
|||
|
|
- Batch size: 100 articles (configurable)
|
|||
|
|
- Sleep between batches: 1s (configurable)
|
|||
|
|
- ZIM ingestion runs at lower priority than real-time PDF/stream processing
|
|||
|
|
- Progress logging every 1000 articles
|
|||
|
|
- Resumable via `last_checkpoint`
|
|||
|
|
- **Sparse upserts are slower** — known issue where on-disk sparse indexing progressively degrades. Budget for this.
|
|||
|
|
|
|||
|
|
### Realistic Scale Estimates
|
|||
|
|
|
|||
|
|
| ZIM | Articles | Est. Chunks | Embedding Time (RTX 3090) |
|
|||
|
|
|-----|----------|-------------|---------------------------|
|
|||
|
|
| Appropedia EN | ~30K | ~60K | ~10 min |
|
|||
|
|
| iFixit EN | ~90K | ~180K | ~25 min |
|
|||
|
|
| Stack Overflow | ~500K | ~1.5M | ~3.5 hr |
|
|||
|
|
| Wikipedia EN nopic | ~5M (non-stub) | ~10M | ~24 hr |
|
|||
|
|
| Wikipedia EN maxi | Same text, +images | ~10M | ~24 hr (same — images not embedded) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 8. Systemd Integration
|
|||
|
|
|
|||
|
|
```ini
|
|||
|
|
# /etc/systemd/system/kiwix.service
|
|||
|
|
[Unit]
|
|||
|
|
Description=Kiwix-serve for RECON
|
|||
|
|
After=network.target
|
|||
|
|
PartOf=recon.service
|
|||
|
|
|
|||
|
|
[Service]
|
|||
|
|
User=zvx
|
|||
|
|
ExecStart=/usr/bin/kiwix-serve \
|
|||
|
|
--library /mnt/kiwix/library.xml \
|
|||
|
|
--port 8430 \
|
|||
|
|
--address 0.0.0.0 \
|
|||
|
|
--threads 4 \
|
|||
|
|
--nodatealias \
|
|||
|
|
--blockexternal
|
|||
|
|
ExecReload=kill -HUP $MAINPID
|
|||
|
|
Restart=always
|
|||
|
|
RestartSec=5
|
|||
|
|
|
|||
|
|
[Install]
|
|||
|
|
WantedBy=multi-user.target
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Reload after adding ZIM: `systemctl reload kiwix`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 9. Pre-Implementation Checks
|
|||
|
|
|
|||
|
|
Before writing any code:
|
|||
|
|
|
|||
|
|
1. **Verify cortex:8091 sparse embeddings** — If it's running TEI or Infinity, sparse vectors may be silently broken (dense-only output). Only native FlagEmbedding supports bge-m3 sparse. This affects the entire existing RECON pipeline, not just Kiwix.
|
|||
|
|
|
|||
|
|
2. **Check Qdrant upgrade path** — If Qdrant needs upgrading, must go stepwise: 1.14 → 1.15 → 1.16. Direct jumps corrupt data.
|
|||
|
|
|
|||
|
|
3. **Check cortex RAM** — Determines whether 10-30M additional vectors need quantization config changes.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 10. Implementation Phases
|
|||
|
|
|
|||
|
|
### Phase 1 — Foundation
|
|||
|
|
- Install kiwix-tools via PPA on CT 130
|
|||
|
|
- Create `/mnt/kiwix/` directory (owned by zvx)
|
|||
|
|
- Set up kiwix.service systemd unit
|
|||
|
|
- Create DB schema
|
|||
|
|
- Download a small test ZIM (Appropedia EN — ~495MB maxi)
|
|||
|
|
- Register via kiwix-manage, verify browsable at port 8430
|
|||
|
|
- Implement ZIM monitor daemon (OPDS poll → zim_sources)
|
|||
|
|
|
|||
|
|
### Phase 2 — Extraction & Embedding Pipeline
|
|||
|
|
- Implement python-libzim article extraction with lxml
|
|||
|
|
- Implement article filtering (redirects, stubs, non-HTML)
|
|||
|
|
- Implement chunking (reuse existing RECON logic)
|
|||
|
|
- Implement embedding + Qdrant upsert with ZIM payload schema
|
|||
|
|
- Implement checkpointing and resume
|
|||
|
|
- Implement Tier 1 Wikipedia extractor as proof of concept
|
|||
|
|
- End-to-end test with Appropedia
|
|||
|
|
|
|||
|
|
### Phase 3 — Tiered Enrichment & Gate
|
|||
|
|
- Tier 2 local Qwen3 enrichment path
|
|||
|
|
- Tier 3 Gemini enrichment routing
|
|||
|
|
- Source classification and routing logic
|
|||
|
|
- Review gate (sampling, dashboard notification)
|
|||
|
|
- Test with an unknown ZIM
|
|||
|
|
|
|||
|
|
### Phase 4 — Dashboard UI
|
|||
|
|
- Kiwix library page (loaded ZIMs, status, progress)
|
|||
|
|
- Upload ZIM form
|
|||
|
|
- Catalog browser (OPDS query + download)
|
|||
|
|
- Review queue
|
|||
|
|
- Stats integration
|
|||
|
|
|
|||
|
|
### Phase 5 — Scale Up
|
|||
|
|
- Additional Tier 1 extractors (Stack Exchange, DevDocs, iFixit)
|
|||
|
|
- Wikipedia EN (start with nopic, upgrade to maxi when confident)
|
|||
|
|
- ZIM version management (replace old vectors when new ZIM version arrives)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 11. Open Questions
|
|||
|
|
|
|||
|
|
1. **ZIM version dedup** — When `wikipedia_en_2026-06.zim` replaces `wikipedia_en_2026-03.zim`, purge old vectors by `zim_source_id` filter + delete, then re-embed. Atomic cutover or incremental?
|
|||
|
|
|
|||
|
|
2. **Qdrant collection strategy** — Same `recon_knowledge_hybrid` collection, or separate `recon_kiwix_hybrid`? Same collection means unified search. Separate means independent scaling but requires query fanout.
|
|||
|
|
|
|||
|
|
3. **Pi deployment packaging** — Future exercise. Dense-only, Matryoshka 256-dim, int8 quant. ~3.8GB RAM for 15M vectors. Proven viable, not designed yet.
|