mirror of
https://github.com/zvx-echo6/recon.git
synced 2026-05-20 14:44:54 +02:00
785 lines
27 KiB
Markdown
785 lines
27 KiB
Markdown
|
|
# RECON Project Bible v2.0
|
||
|
|
|
||
|
|
*Last updated: 2026-02-16*
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. Mission Statement
|
||
|
|
|
||
|
|
RECON (Reconnaissance, Extraction, Conceptualization, and Operationalization of kNowledge) is a knowledge extraction pipeline that processes PDFs and web content into structured concepts stored in a Qdrant vector database. These concepts power Aurora, the RAG-enabled AI assistant running on OpenWebUI.
|
||
|
|
|
||
|
|
**The core loop:** Content in (PDF/web) -> Text extracted -> Concepts enriched (Gemini) -> Vectors embedded (TEI/BGE-M3) -> Searchable knowledge (Qdrant) -> Aurora answers questions with citations.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. Infrastructure
|
||
|
|
|
||
|
|
### Hosts
|
||
|
|
|
||
|
|
| Host | IP (Tailscale) | Role |
|
||
|
|
|------|---------------|------|
|
||
|
|
| recon LXC | 100.64.0.24 (CT 130 on toc) | RECON application, dashboard, pipeline |
|
||
|
|
| cortex VM | 100.64.0.14 (VM 150 on toc) | Qdrant, TEI, Ollama, OpenWebUI |
|
||
|
|
| pi-nas | 100.64.0.21 (192.168.1.245) | NFS file server for PDF library |
|
||
|
|
| Contabo VPS | 100.64.0.1 (5.189.158.149) | Backup destination |
|
||
|
|
|
||
|
|
### Services on cortex (100.64.0.14)
|
||
|
|
|
||
|
|
| Service | Port | Purpose |
|
||
|
|
|---------|------|---------|
|
||
|
|
| Qdrant | 6333 | Vector database (recon_knowledge collection) |
|
||
|
|
| TEI (text-embeddings-inference) | 8090 | Embedding server (bge-m3, 1024-dim, ~1,711 emb/sec) |
|
||
|
|
| Ollama | 11434 | LLM server + fallback embeddings (~8 emb/sec) |
|
||
|
|
| OpenWebUI | 8080 | Aurora chat interface (ai.echo6.co) |
|
||
|
|
|
||
|
|
### Services on recon LXC (100.64.0.24)
|
||
|
|
|
||
|
|
| Service | Port | Purpose |
|
||
|
|
|---------|------|---------|
|
||
|
|
| RECON Dashboard | 8420 | Web UI + API for pipeline management |
|
||
|
|
| File Server | 8888 | PDF downloads (files.echo6.co) |
|
||
|
|
|
||
|
|
### NFS Mount
|
||
|
|
|
||
|
|
```
|
||
|
|
pi-nas:/export/library -> /mnt/library (22TB, rw, NFSv3)
|
||
|
|
```
|
||
|
|
|
||
|
|
Contains ~13,000+ PDFs across:
|
||
|
|
- `Survival-Companion-Library/` (~12,900 PDFs in ~220 subdirectories)
|
||
|
|
- `Army_Pubs/` (~160 military field manuals)
|
||
|
|
- Other: `Gaming/`, `Reference/`, `Technical/`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. Architecture Overview
|
||
|
|
|
||
|
|
```
|
||
|
|
/mnt/library/ (NFS)
|
||
|
|
|
|
||
|
|
[recon scan]
|
||
|
|
|
|
||
|
|
catalogue (SQLite)
|
||
|
|
|
|
||
|
|
[recon queue]
|
||
|
|
|
|
||
|
|
+-----------+ [recon extract] +-----------+
|
||
|
|
| PyPDF2 |--> data/text/ | Gemini |
|
||
|
|
| pdftotext | {hash}/page_N.txt | Flash |
|
||
|
|
| tesseract | | | 4 keys |
|
||
|
|
+-----------+ [recon enrich] +-----------+
|
||
|
|
|
|
||
|
|
data/concepts/
|
||
|
|
{hash}/window_N.json
|
||
|
|
|
|
||
|
|
[recon embed]
|
||
|
|
|
|
||
|
|
+----------+-----------+
|
||
|
|
| TEI (primary) |
|
||
|
|
| bge-m3, 1024-dim |
|
||
|
|
| 1,711 emb/sec |
|
||
|
|
+----------+-----------+
|
||
|
|
|
|
||
|
|
Qdrant (cortex:6333)
|
||
|
|
recon_knowledge collection
|
||
|
|
|
|
||
|
|
Aurora (OpenWebUI)
|
||
|
|
RAG search + citations
|
||
|
|
```
|
||
|
|
|
||
|
|
### Web Content Path
|
||
|
|
|
||
|
|
```
|
||
|
|
URL(s) ──> [recon ingest-url / crawl]
|
||
|
|
|
|
||
|
|
trafilatura extraction
|
||
|
|
chunk into ~2000-word pages
|
||
|
|
|
|
||
|
|
data/text/{hash}/page_N.txt
|
||
|
|
(enters at "extracted" status)
|
||
|
|
|
|
||
|
|
[enrich] -> [embed]
|
||
|
|
(same as PDF path)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. Pipeline Stages
|
||
|
|
|
||
|
|
### Status Flow
|
||
|
|
|
||
|
|
```
|
||
|
|
catalogued -> queued -> extracting -> extracted -> enriching -> enriched -> embedding -> complete
|
||
|
|
\-> failed
|
||
|
|
```
|
||
|
|
|
||
|
|
Web content enters at `extracted` status (text already extracted by trafilatura).
|
||
|
|
|
||
|
|
### Stage Details
|
||
|
|
|
||
|
|
| Stage | Tool | Input | Output | Speed |
|
||
|
|
|-------|------|-------|--------|-------|
|
||
|
|
| Scan | `recon scan` | /mnt/library/*.pdf | catalogue table | ~13K PDFs in ~30 min |
|
||
|
|
| Queue | `recon queue` | catalogue entries | documents table (status=queued) | Instant |
|
||
|
|
| Extract | `recon extract` | PDF files | data/text/{hash}/page_NNNN.txt | 4 workers, ~200/hr |
|
||
|
|
| Enrich | `recon enrich` | Text pages (10-page windows) | data/concepts/{hash}/window_N.json | 16 workers, 4 Gemini keys |
|
||
|
|
| Embed | `recon embed` | Concept JSONs | Qdrant vectors | TEI: 1,711 emb/sec |
|
||
|
|
|
||
|
|
### Extraction Fallback Chain
|
||
|
|
|
||
|
|
1. **PyPDF2** (fast, clean text) -> 2. **pdftotext** (handles complex layouts) -> 3. **Tesseract OCR** (scanned documents)
|
||
|
|
|
||
|
|
### Enrichment Details
|
||
|
|
|
||
|
|
- Model: `gemini-2.0-flash`
|
||
|
|
- Window size: 10 pages per API call (configurable)
|
||
|
|
- Workers: 16 concurrent (4 API keys x 4 workers each)
|
||
|
|
- Output format: JSON array of concept objects
|
||
|
|
- **CRITICAL**: Concept JSONs are saved to disk BEFORE any database operations
|
||
|
|
- Key rotation via `KeyRotator` class distributing across 4 Gemini API keys
|
||
|
|
|
||
|
|
### Embedding Details
|
||
|
|
|
||
|
|
- **Primary**: TEI at cortex:8090 (bge-m3 model, 1024 dimensions, ~1,711 embeddings/sec)
|
||
|
|
- **Fallback**: Ollama at cortex:11434 (bge-m3 model, ~8 embeddings/sec)
|
||
|
|
- Batch size: 128 embeddings per TEI request
|
||
|
|
- Distance metric: Cosine similarity
|
||
|
|
- **CRITICAL**: Dimensions are 1024 (bge-m3), NOT 384. Getting this wrong creates silent failures.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. Directory Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
/opt/recon/ # Application root
|
||
|
|
recon.py # CLI entry point
|
||
|
|
config.yaml # Central configuration
|
||
|
|
.env # Gemini API keys (4 keys)
|
||
|
|
requirements.txt # Python dependencies
|
||
|
|
PROJECT-BIBLE.md # This file
|
||
|
|
README.md # Quick-start reference
|
||
|
|
run-full-pipeline.sh # Background pipeline runner
|
||
|
|
|
||
|
|
lib/ # Core modules
|
||
|
|
__init__.py
|
||
|
|
api.py # Flask web dashboard + API (port 8420)
|
||
|
|
crawler.py # Site crawler (sitemap + BFS link-following)
|
||
|
|
embedder.py # Concept -> vector embedding (TEI/Ollama -> Qdrant)
|
||
|
|
enricher.py # Text -> concept extraction (Gemini)
|
||
|
|
extractor.py # PDF -> text extraction (PyPDF2/pdftotext/OCR)
|
||
|
|
ingester.py # ARGUS intel feed intake
|
||
|
|
status.py # SQLite DB operations (catalogue + documents)
|
||
|
|
utils.py # Config, hashing, URL generation, logging
|
||
|
|
web_scraper.py # URL -> text extraction (trafilatura)
|
||
|
|
|
||
|
|
scripts/ # Operational scripts
|
||
|
|
backup.sh # Automated backup to Contabo (cron every 6h)
|
||
|
|
rebuild_qdrant.py # Nuclear recovery: re-embed all concepts
|
||
|
|
validate.py # Pipeline consistency validation
|
||
|
|
|
||
|
|
data/ # Pipeline data (on local disk)
|
||
|
|
recon.db # SQLite status database
|
||
|
|
text/ # Extracted text
|
||
|
|
{content_hash}/
|
||
|
|
meta.json # Document metadata
|
||
|
|
page_0001.txt # Page text (4-digit, 1-indexed)
|
||
|
|
page_0002.txt
|
||
|
|
...
|
||
|
|
concepts/ # Enriched concepts (**BACK THESE UP**)
|
||
|
|
{content_hash}/
|
||
|
|
window_1.json # Concept JSON array (10-page window)
|
||
|
|
window_2.json
|
||
|
|
...
|
||
|
|
intel/ # ARGUS intel feeds
|
||
|
|
|
||
|
|
logs/ # Application logs
|
||
|
|
recon.log # Main rotating log
|
||
|
|
backup.log # Backup operation log
|
||
|
|
backup_cron.log # Cron backup log
|
||
|
|
|
||
|
|
venv/ # Python virtual environment
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 6. Database Schema
|
||
|
|
|
||
|
|
### SQLite (data/recon.db)
|
||
|
|
|
||
|
|
Two tables in WAL mode with thread-local connections.
|
||
|
|
|
||
|
|
#### catalogue
|
||
|
|
|
||
|
|
| Column | Type | Description |
|
||
|
|
|--------|------|-------------|
|
||
|
|
| hash | TEXT PK | MD5 content hash |
|
||
|
|
| filename | TEXT | Original filename |
|
||
|
|
| path | TEXT | Full filesystem path |
|
||
|
|
| size_bytes | INTEGER | File size |
|
||
|
|
| source | TEXT | Top-level directory (e.g., "Survival-Companion-Library") |
|
||
|
|
| category | TEXT | Second-level directory (e.g., "Bushcraft") |
|
||
|
|
| status | TEXT | "catalogued" or "processed" |
|
||
|
|
| discovered_at | TEXT | ISO timestamp |
|
||
|
|
|
||
|
|
#### documents
|
||
|
|
|
||
|
|
| Column | Type | Description |
|
||
|
|
|--------|------|-------------|
|
||
|
|
| hash | TEXT PK | MD5 content hash |
|
||
|
|
| filename | TEXT | Original filename |
|
||
|
|
| path | TEXT | Full path or URL |
|
||
|
|
| size_bytes | INTEGER | File/content size |
|
||
|
|
| page_count | INTEGER | Number of text pages |
|
||
|
|
| book_title | TEXT | Gemini-extracted title |
|
||
|
|
| book_author | TEXT | Gemini-extracted author |
|
||
|
|
| status | TEXT | Pipeline status |
|
||
|
|
| pages_extracted | INTEGER | Pages extracted |
|
||
|
|
| concepts_extracted | INTEGER | Concepts generated |
|
||
|
|
| vectors_inserted | INTEGER | Vectors in Qdrant |
|
||
|
|
| error_message | TEXT | Last error (if failed) |
|
||
|
|
| retry_count | INTEGER | Failure retry count |
|
||
|
|
| created_at | TEXT | ISO timestamp |
|
||
|
|
| updated_at | TEXT | ISO timestamp |
|
||
|
|
|
||
|
|
### Qdrant (cortex:6333)
|
||
|
|
|
||
|
|
Collection: `recon_knowledge`
|
||
|
|
|
||
|
|
| Field | Type | Description |
|
||
|
|
|-------|------|-------------|
|
||
|
|
| vector | float[1024] | BGE-M3 embedding |
|
||
|
|
| doc_hash | keyword | Links to SQLite document |
|
||
|
|
| filename | keyword | Source filename |
|
||
|
|
| book_title | keyword | Document title |
|
||
|
|
| book_author | keyword | Author name |
|
||
|
|
| source_type | keyword | "document", "web", or "intel_feed" |
|
||
|
|
| download_url | keyword | files.echo6.co URL or source URL |
|
||
|
|
| content | text | Concept text (searchable) |
|
||
|
|
| summary | text | Concept summary |
|
||
|
|
| title | keyword | Concept title |
|
||
|
|
| domain | keyword | Knowledge domain |
|
||
|
|
| subdomain | keyword | Knowledge subdomain |
|
||
|
|
| keywords | keyword[] | Concept keywords |
|
||
|
|
| skill_level | keyword | beginner/intermediate/advanced/expert |
|
||
|
|
| key_facts | text[] | Key facts list |
|
||
|
|
| scenario_applicable | text[] | Applicable scenarios |
|
||
|
|
| cross_domain_tags | keyword[] | Cross-references |
|
||
|
|
| chapter | keyword | Source chapter |
|
||
|
|
| page_ref | keyword | Source page reference |
|
||
|
|
| notes | text | Additional notes |
|
||
|
|
| _window | integer | Source window number |
|
||
|
|
| _start_page | integer | Starting page in document |
|
||
|
|
| verification_status | keyword | "unverified" (default) |
|
||
|
|
| credibility_score | float | 0.7 (default) |
|
||
|
|
| language | keyword | "en" (default) |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 7. CLI Reference
|
||
|
|
|
||
|
|
```
|
||
|
|
recon <command> [options]
|
||
|
|
```
|
||
|
|
|
||
|
|
| Command | Description | Key Options |
|
||
|
|
|---------|-------------|-------------|
|
||
|
|
| `scan` | Scan library, catalogue new PDFs | `--path` |
|
||
|
|
| `queue` | Queue catalogued docs for processing | `--hash`, `--source`, `--category`, `--limit` |
|
||
|
|
| `extract` | Extract text from queued PDFs | `--workers` |
|
||
|
|
| `enrich` | Enrich extracted text via Gemini | `--workers`, `--limit` |
|
||
|
|
| `embed` | Embed concepts into Qdrant | `--workers`, `--limit` |
|
||
|
|
| `run` | Full pipeline (extract->enrich->embed) | `--workers`, `--enrich-workers`, `--limit` |
|
||
|
|
| `status` | Show pipeline status counts | |
|
||
|
|
| `catalogue` | Browse catalogue | `--sources`, `--categories`, `--source`, `--limit` |
|
||
|
|
| `failures` | Show failed documents | `--retry` |
|
||
|
|
| `search` | Semantic search | `query`, `--limit` |
|
||
|
|
| `upload` | Upload PDFs | `--file`, `--dir`, `--category` |
|
||
|
|
| `ingest-url` | Ingest web content | `url`, `--file`, `--category`, `--process` |
|
||
|
|
| `crawl` | Crawl a site | `url`, `--category`, `--include`, `--exclude`, `--max-pages`, `--dry-run`, `--process` |
|
||
|
|
| `validate` | Check pipeline consistency | `--deep` |
|
||
|
|
| `rebuild` | Rebuild Qdrant from concept JSONs | |
|
||
|
|
| `serve` | Start web dashboard (port 8420) | |
|
||
|
|
| `ingest` | Ingest ARGUS intel JSON | `--file`, `--directory` |
|
||
|
|
|
||
|
|
### Common Workflows
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Full library processing
|
||
|
|
recon scan && recon queue && recon run
|
||
|
|
|
||
|
|
# Ingest a single web page with full processing
|
||
|
|
recon ingest-url "https://example.com/article" --category "Reference" --process
|
||
|
|
|
||
|
|
# Dry-run crawl to preview URLs
|
||
|
|
recon crawl "https://docs.example.com" --include /docs/ --dry-run
|
||
|
|
|
||
|
|
# Full crawl with processing
|
||
|
|
recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process
|
||
|
|
|
||
|
|
# Upload a PDF
|
||
|
|
recon upload --file /path/to/document.pdf --category "Technical"
|
||
|
|
|
||
|
|
# Check what failed and retry
|
||
|
|
recon failures
|
||
|
|
recon failures --retry
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 8. Web Dashboard
|
||
|
|
|
||
|
|
### URL
|
||
|
|
|
||
|
|
```
|
||
|
|
http://100.64.0.24:8420
|
||
|
|
```
|
||
|
|
|
||
|
|
### Pages
|
||
|
|
|
||
|
|
| Route | Page | Description |
|
||
|
|
|-------|------|-------------|
|
||
|
|
| `/` | Dashboard | Knowledge base overview: document/concept/vector counts, source table, domain distribution bars, skill level breakdown, Qdrant health, recent completions, pipeline status |
|
||
|
|
| `/search` | Search | Semantic search with score bars, Web/PDF badges, download links |
|
||
|
|
| `/catalogue` | Catalogue | Browse all catalogued PDFs with source/category filters |
|
||
|
|
| `/upload` | Upload | PDF upload form with category datalist, recent uploads table |
|
||
|
|
| `/web-ingest` | Web Ingest | Two tabs: Single/Batch URL ingest, Site Crawl with preview |
|
||
|
|
| `/failures` | Failures | Failed documents with error messages and retry button |
|
||
|
|
|
||
|
|
### API Endpoints
|
||
|
|
|
||
|
|
| Method | Endpoint | Description |
|
||
|
|
|--------|----------|-------------|
|
||
|
|
| GET | `/api/search?q=...&limit=N` | Semantic search |
|
||
|
|
| GET | `/api/catalogue?source=...&limit=N` | Browse catalogue |
|
||
|
|
| GET | `/api/knowledge-stats` | Dashboard aggregation (totals, sources, domains, skills, Qdrant health) |
|
||
|
|
| POST | `/api/upload` | Upload PDF (multipart: file + category) |
|
||
|
|
| GET | `/api/upload/<hash>/status` | Check upload processing status |
|
||
|
|
| GET | `/api/upload/categories` | List available categories |
|
||
|
|
| POST | `/api/ingest-url` | Ingest single URL (json: url, category, process) |
|
||
|
|
| POST | `/api/ingest-urls` | Ingest multiple URLs (json: urls, category, process) |
|
||
|
|
| POST | `/api/crawl` | Crawl a site (json: url, category, include, exclude, max_pages, dry_run) |
|
||
|
|
| GET | `/api/crawl/<id>/status` | Poll crawl/pipeline progress |
|
||
|
|
| POST | `/api/failures/retry` | Re-queue all failed documents |
|
||
|
|
|
||
|
|
### Dashboard Features
|
||
|
|
|
||
|
|
- **Auto-refresh**: Every 30 seconds via JavaScript fetch
|
||
|
|
- **Knowledge cards**: Total documents, concepts, vectors, pages
|
||
|
|
- **Source table**: Per-source breakdown with document/concept/vector counts and PDF/WEB type badges
|
||
|
|
- **Domain distribution**: Horizontal bars showing top knowledge domains
|
||
|
|
- **Skill level breakdown**: beginner/intermediate/advanced/expert percentages
|
||
|
|
- **Qdrant health**: Connection status, points count, segments
|
||
|
|
- **Pipeline status**: Compact display of documents in each stage
|
||
|
|
- **Crawl polling**: Real-time stage tracking (ingesting -> enriching -> embedding)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 9. Concept JSON Schema
|
||
|
|
|
||
|
|
Each window file (`data/concepts/{hash}/window_N.json`) contains a JSON array of concept objects:
|
||
|
|
|
||
|
|
```json
|
||
|
|
[
|
||
|
|
{
|
||
|
|
"title": "Water Purification Methods",
|
||
|
|
"content": "Detailed text about the concept...",
|
||
|
|
"summary": "Brief summary of the concept",
|
||
|
|
"domain": "Survival",
|
||
|
|
"subdomain": "Water",
|
||
|
|
"keywords": ["purification", "filtration", "boiling"],
|
||
|
|
"skill_level": "beginner",
|
||
|
|
"key_facts": ["Boiling kills 99.9% of pathogens", "..."],
|
||
|
|
"scenario_applicable": ["wilderness survival", "disaster preparedness"],
|
||
|
|
"cross_domain_tags": ["health", "camping"],
|
||
|
|
"chapter": "Chapter 3",
|
||
|
|
"page_ref": "pp. 45-48",
|
||
|
|
"notes": "Additional context or caveats",
|
||
|
|
"_window": 1,
|
||
|
|
"_start_page": 1
|
||
|
|
}
|
||
|
|
]
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 10. Web Ingestion
|
||
|
|
|
||
|
|
### Single URL
|
||
|
|
|
||
|
|
```bash
|
||
|
|
recon ingest-url "https://example.com/article" --category "Reference" --process
|
||
|
|
```
|
||
|
|
|
||
|
|
Or via API:
|
||
|
|
```bash
|
||
|
|
curl -X POST http://100.64.0.24:8420/api/ingest-url \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{"url": "https://example.com/article", "category": "Reference", "process": true}'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Site Crawl
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Preview what would be crawled
|
||
|
|
recon crawl "https://docs.example.com" --include /docs/ --dry-run
|
||
|
|
|
||
|
|
# Full crawl
|
||
|
|
recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process
|
||
|
|
```
|
||
|
|
|
||
|
|
### How It Works
|
||
|
|
|
||
|
|
1. **URL discovery** (crawler.py):
|
||
|
|
- Tries sitemap.xml first (preferred, finds all pages)
|
||
|
|
- Falls back to BFS link-following if no sitemap
|
||
|
|
- Filters by include/exclude patterns
|
||
|
|
|
||
|
|
2. **Content extraction** (web_scraper.py):
|
||
|
|
- Uses trafilatura for clean text extraction
|
||
|
|
- Chunks into ~2,000-word pages
|
||
|
|
- Same output format as PDF extractor: `data/text/{hash}/page_NNNN.txt`
|
||
|
|
- Content hash is MD5 of extracted text (deduplication)
|
||
|
|
|
||
|
|
3. **Pipeline integration**:
|
||
|
|
- Web content enters at `extracted` status (no PDF extraction needed)
|
||
|
|
- Enrichment and embedding proceed identically to PDF content
|
||
|
|
- Qdrant vectors get `source_type: "web"` and `download_url` pointing to source URL
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 11. Configuration Reference
|
||
|
|
|
||
|
|
### config.yaml
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# Root path for the PDF library (NFS mount from pi-nas)
|
||
|
|
library_root: /mnt/library
|
||
|
|
|
||
|
|
processing:
|
||
|
|
extract_workers: 4 # Concurrent PDF extraction threads
|
||
|
|
enrich_workers: 16 # Concurrent Gemini enrichment threads (4 keys x 4)
|
||
|
|
embed_workers: 4 # Concurrent embedding threads
|
||
|
|
enrich_window_size: 5 # Pages per enrichment window (sent to Gemini)
|
||
|
|
embed_batch_size: 500 # Vectors per Qdrant upsert batch
|
||
|
|
rate_limit_delay: 0.1 # Delay between Gemini API calls (seconds)
|
||
|
|
max_retries: 5 # Max retries for failed documents
|
||
|
|
|
||
|
|
embedding:
|
||
|
|
backend: tei # "tei" (primary, ~1,711 emb/sec) or "ollama" (fallback, ~8 emb/sec)
|
||
|
|
tei_host: 100.64.0.14 # TEI server (cortex)
|
||
|
|
tei_port: 8090 # TEI HTTP port
|
||
|
|
ollama_host: 100.64.0.14 # Ollama server (cortex) — fallback only
|
||
|
|
ollama_port: 11434 # Ollama HTTP port
|
||
|
|
model: bge-m3 # Embedding model name
|
||
|
|
dimensions: 1024 # CRITICAL: bge-m3 is 1024-dim, NOT 384
|
||
|
|
batch_size: 128 # Embeddings per TEI batch request
|
||
|
|
|
||
|
|
vector_db:
|
||
|
|
host: 100.64.0.14 # Qdrant server (cortex)
|
||
|
|
port: 6333 # Qdrant HTTP port
|
||
|
|
collection: recon_knowledge # Collection name
|
||
|
|
|
||
|
|
gemini:
|
||
|
|
model: gemini-2.0-flash # Gemini model for enrichment
|
||
|
|
response_mime_type: application/json # Force JSON output
|
||
|
|
|
||
|
|
web:
|
||
|
|
port: 8420 # Dashboard HTTP port
|
||
|
|
host: 0.0.0.0 # Bind to all interfaces
|
||
|
|
|
||
|
|
paths:
|
||
|
|
base: /opt/recon # Application root
|
||
|
|
data: /opt/recon/data # Data directory
|
||
|
|
text: /opt/recon/data/text # Extracted text output
|
||
|
|
concepts: /opt/recon/data/concepts # Enriched concept JSONs
|
||
|
|
intel: /opt/recon/data/intel # ARGUS intel feeds
|
||
|
|
logs: /opt/recon/logs # Log files
|
||
|
|
db: /opt/recon/data/recon.db # SQLite database
|
||
|
|
|
||
|
|
book_server:
|
||
|
|
base_url: https://files.echo6.co # Public URL prefix for PDF downloads
|
||
|
|
strip_prefix: /mnt/library # Path prefix to strip when generating URLs
|
||
|
|
|
||
|
|
upload_paths: # Category -> filesystem path mapping for uploads
|
||
|
|
Survival Reference: /mnt/library/Survival-Companion-Library/Uploads
|
||
|
|
Military Doctrine: /mnt/library/Army_Pubs/Uploads
|
||
|
|
Gaming: /mnt/library/Gaming
|
||
|
|
Reference: /mnt/library/Reference
|
||
|
|
Technical: /mnt/library/Technical
|
||
|
|
default: /mnt/library # Fallback for unknown categories
|
||
|
|
|
||
|
|
web_scraper:
|
||
|
|
words_per_page: 2000 # Target words per page chunk
|
||
|
|
fetch_timeout: 30 # HTTP request timeout (seconds)
|
||
|
|
rate_limit_delay: 1.0 # Delay between URL fetches (seconds)
|
||
|
|
max_batch_size: 50 # Max URLs per batch ingest
|
||
|
|
user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
|
||
|
|
|
||
|
|
crawler:
|
||
|
|
user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
|
||
|
|
fetch_timeout: 30 # HTTP request timeout (seconds)
|
||
|
|
rate_limit_delay: 1.0 # Delay between page fetches (seconds)
|
||
|
|
max_pages: 500 # Max pages to discover per crawl
|
||
|
|
max_depth: 3 # Max link-following depth (BFS only)
|
||
|
|
default_exclude: # URL patterns to always skip
|
||
|
|
- /search
|
||
|
|
- /404
|
||
|
|
- /login
|
||
|
|
- /signup
|
||
|
|
- /auth/
|
||
|
|
- /api/
|
||
|
|
- /assets/
|
||
|
|
- /static/
|
||
|
|
```
|
||
|
|
|
||
|
|
### .env
|
||
|
|
|
||
|
|
```
|
||
|
|
GEMINI_KEY_1=<key>
|
||
|
|
GEMINI_KEY_2=<key>
|
||
|
|
GEMINI_KEY_3=<key>
|
||
|
|
GEMINI_KEY_4=<key>
|
||
|
|
```
|
||
|
|
|
||
|
|
Four Gemini API keys rotated across 16 enrichment workers via `KeyRotator`.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 12. Aurora RAG Integration
|
||
|
|
|
||
|
|
Aurora is the RAG-enabled AI assistant running on OpenWebUI (ai.echo6.co).
|
||
|
|
|
||
|
|
### How It Works
|
||
|
|
|
||
|
|
1. User asks a question in OpenWebUI
|
||
|
|
2. Aurora's OpenWebUI function/filter embeds the query via TEI (cortex:8090)
|
||
|
|
3. Searches Qdrant `recon_knowledge` collection for similar concepts
|
||
|
|
4. Top results are injected into the prompt as context
|
||
|
|
5. JOSIEFIED Qwen3 8B generates an answer with citations
|
||
|
|
6. Citations include `download_url` links (PDF files via files.echo6.co, web content via source URL)
|
||
|
|
|
||
|
|
### Key Components
|
||
|
|
|
||
|
|
- **Embedding**: Same TEI endpoint + bge-m3 model as RECON pipeline (ensures vector compatibility)
|
||
|
|
- **Search**: Cosine similarity, top-5 results by default
|
||
|
|
- **LLM**: `goekdenizguelmez/JOSIEFIED-Qwen3:8b` on Ollama (cortex:11434)
|
||
|
|
- **Citations**: Each result includes `download_url` — either `https://files.echo6.co/...` for PDFs or the original URL for web content
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 13. Backup & Recovery
|
||
|
|
|
||
|
|
### Automated Backups
|
||
|
|
|
||
|
|
**Script**: `/opt/recon/scripts/backup.sh`
|
||
|
|
**Destination**: Contabo VPS (`root@100.64.0.1:/opt/backups/recon/`)
|
||
|
|
**Schedule** (cron):
|
||
|
|
- Every 6 hours: Full backup (concepts, text, DB, config, intel)
|
||
|
|
- Every 2 hours (off-hours): SQLite DB snapshot only
|
||
|
|
|
||
|
|
### What's Backed Up
|
||
|
|
|
||
|
|
| Component | Size | Priority | Notes |
|
||
|
|
|-----------|------|----------|-------|
|
||
|
|
| data/concepts/ | ~11M | **CRITICAL** | $130+ of Gemini API work |
|
||
|
|
| data/text/ | ~203M | High | Hours to regenerate |
|
||
|
|
| data/recon.db | ~6.5M | **CRITICAL** | All pipeline state |
|
||
|
|
| config.yaml + .env | ~2K | Important | Configuration |
|
||
|
|
| data/intel/ | ~4K | Low | Intel feed data |
|
||
|
|
|
||
|
|
### What's NOT Backed Up
|
||
|
|
|
||
|
|
- **Qdrant vectors**: Rebuilt from concept JSONs in ~10 minutes via `recon rebuild`
|
||
|
|
- **PDF library**: Lives on pi-nas NFS, backed up separately
|
||
|
|
- **venv/**: Recreated from requirements.txt
|
||
|
|
|
||
|
|
### Recovery Procedures
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Restore from backup
|
||
|
|
scp -r root@100.64.0.1:/opt/backups/recon/concepts/ /opt/recon/data/concepts/
|
||
|
|
scp -r root@100.64.0.1:/opt/backups/recon/text/ /opt/recon/data/text/
|
||
|
|
scp root@100.64.0.1:/opt/backups/recon/recon_LATEST.db /opt/recon/data/recon.db
|
||
|
|
|
||
|
|
# Rebuild Qdrant vectors from concept JSONs
|
||
|
|
cd /opt/recon && source venv/bin/activate
|
||
|
|
python3 scripts/rebuild_qdrant.py
|
||
|
|
# Type REBUILD when prompted
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 14. Embedding Performance
|
||
|
|
|
||
|
|
### TEI (Primary) vs Ollama (Fallback)
|
||
|
|
|
||
|
|
| Metric | TEI (cortex:8090) | Ollama (cortex:11434) |
|
||
|
|
|--------|-------------------|----------------------|
|
||
|
|
| Speed | ~1,711 emb/sec | ~8 emb/sec |
|
||
|
|
| Model | bge-m3 | bge-m3 |
|
||
|
|
| Dimensions | 1024 | 1024 |
|
||
|
|
| Batch size | 128 | 1 |
|
||
|
|
| Cosine similarity | 0.999900 | 0.999900 |
|
||
|
|
|
||
|
|
TEI is ~214x faster than Ollama for embeddings. Always use TEI unless it's down.
|
||
|
|
|
||
|
|
### Qdrant Configuration
|
||
|
|
|
||
|
|
- Collection: `recon_knowledge`
|
||
|
|
- Distance: Cosine
|
||
|
|
- HNSW indexing threshold: 20,000 (below this, brute-force search is used)
|
||
|
|
- Current state: Brute-force (under 20K vectors) — this is normal and performant at current scale
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 15. Content Hashing
|
||
|
|
|
||
|
|
- **PDF content**: `MD5(file_bytes)` — stable across renames, detects exact duplicates
|
||
|
|
- **Web content**: `MD5(extracted_text)` — deduplicates by content, not URL
|
||
|
|
- Hash is used as the primary key in both SQLite tables and as the directory name for text/concept storage
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 16. Source Type Handling
|
||
|
|
|
||
|
|
| Source | Path Format | source_type | download_url | Badge |
|
||
|
|
|--------|-------------|-------------|--------------|-------|
|
||
|
|
| PDF | `/mnt/library/...` | document | `https://files.echo6.co/...` | PDF |
|
||
|
|
| Web | `https://...` | web | Original URL | Web |
|
||
|
|
| Intel | JSON feed | intel_feed | — | — |
|
||
|
|
|
||
|
|
The `generate_download_url()` function in utils.py handles the routing:
|
||
|
|
- URLs starting with `http://` or `https://` are returned as-is
|
||
|
|
- File paths are converted to `files.echo6.co` URLs
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 17. Lessons Learned
|
||
|
|
|
||
|
|
### RECON Rebuild Lessons
|
||
|
|
|
||
|
|
1. **Verify infrastructure before writing code.** Check Qdrant, TEI, Ollama connectivity first.
|
||
|
|
2. **Dimensions are 1024, NOT 384.** BGE-M3 uses 1024-dimensional vectors. This caused silent failures in early builds.
|
||
|
|
3. **TEI >> Ollama for embeddings.** 1,711 vs 8 embeddings/sec. A 214x speedup that makes batch processing viable.
|
||
|
|
4. **Dynamic discovery over hardcoded paths.** Let the pipeline discover what's on disk rather than maintaining static file lists.
|
||
|
|
5. **Web content uses the same pipeline.** After text extraction, web and PDF content follow identical enrichment and embedding paths.
|
||
|
|
6. **Sitemap > link-following.** Sitemaps discover all pages reliably; BFS link-following misses orphaned pages and is slower.
|
||
|
|
7. **Save to disk before DB operations.** Concept JSONs are written to disk first, then the database is updated. This means recovery is always possible from the JSON files.
|
||
|
|
8. **NFS over large file sets is slow.** Scanning 13K PDFs over NFS takes ~30 minutes due to MD5 hashing over the network. Plan accordingly.
|
||
|
|
|
||
|
|
### Operational Gotchas
|
||
|
|
|
||
|
|
- `recon scan` can appear stuck on large PDFs over NFS — it's hashing, not hung
|
||
|
|
- Some PDFs have corrupt metadata that crashes PyPDF2 — the extractor catches this and falls back
|
||
|
|
- Gemini rate limits hit with 16 workers — the `KeyRotator` distributes across 4 keys to mitigate
|
||
|
|
- `iptables-persistent` hangs on interactive prompts in LXC containers — use manual persistence
|
||
|
|
- The recon LXC has no tmux/screen — use `nohup` for long-running background tasks
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 18. Monitoring
|
||
|
|
|
||
|
|
### Pipeline Status
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Quick status
|
||
|
|
recon status
|
||
|
|
|
||
|
|
# Dashboard
|
||
|
|
http://100.64.0.24:8420
|
||
|
|
|
||
|
|
# Tail logs
|
||
|
|
tail -f /opt/recon/logs/recon.log
|
||
|
|
|
||
|
|
# Pipeline run log (when running full background pipeline)
|
||
|
|
tail -f /opt/recon/pipeline.log
|
||
|
|
```
|
||
|
|
|
||
|
|
### Health Checks
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Qdrant
|
||
|
|
curl -s http://100.64.0.14:6333/collections/recon_knowledge | python3 -m json.tool
|
||
|
|
|
||
|
|
# TEI
|
||
|
|
curl -s http://100.64.0.14:8090/info
|
||
|
|
|
||
|
|
# Ollama
|
||
|
|
curl -s http://100.64.0.14:11434/api/tags | python3 -m json.tool
|
||
|
|
|
||
|
|
# NFS mount
|
||
|
|
df -h /mnt/library
|
||
|
|
|
||
|
|
# Backup logs
|
||
|
|
tail -20 /opt/recon/logs/backup.log
|
||
|
|
```
|
||
|
|
|
||
|
|
### Validation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Quick validation
|
||
|
|
recon validate
|
||
|
|
|
||
|
|
# Deep validation (checks all files on disk)
|
||
|
|
recon validate --deep
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 19. Current State
|
||
|
|
|
||
|
|
*As of 2026-02-16*
|
||
|
|
|
||
|
|
### Pipeline Progress
|
||
|
|
|
||
|
|
| Status | Count |
|
||
|
|
|--------|-------|
|
||
|
|
| Catalogued | 10,162 |
|
||
|
|
| Queued | 8,982 |
|
||
|
|
| Extracted | 872 |
|
||
|
|
| Complete | 302 |
|
||
|
|
| Failed | 2 |
|
||
|
|
|
||
|
|
### Vector Database
|
||
|
|
|
||
|
|
- Qdrant points: 4,661 (3,144 PDF + 1,517 web)
|
||
|
|
- Segments: 8
|
||
|
|
- Indexing: Brute-force (under 20K threshold)
|
||
|
|
|
||
|
|
### Active Processing
|
||
|
|
|
||
|
|
Full pipeline running in background via `nohup` — extracting through the 8,982 queued documents. Expected to take ~40 hours for full extract -> enrich -> embed cycle.
|
||
|
|
|
||
|
|
### Backups
|
||
|
|
|
||
|
|
- Schedule: Every 6 hours (full) + every 2 hours (DB only)
|
||
|
|
- Destination: Contabo VPS (`/opt/backups/recon/`)
|
||
|
|
- Last verified: 2026-02-16 (220M total backup size)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 20. Dependencies
|
||
|
|
|
||
|
|
### System Packages
|
||
|
|
|
||
|
|
- Python 3.11+
|
||
|
|
- pdftotext (poppler-utils)
|
||
|
|
- tesseract-ocr
|
||
|
|
- sqlite3
|
||
|
|
|
||
|
|
### Python Packages (key)
|
||
|
|
|
||
|
|
| Package | Version | Purpose |
|
||
|
|
|---------|---------|---------|
|
||
|
|
| Flask | 3.1.2 | Web dashboard |
|
||
|
|
| google-generativeai | 0.8.6 | Gemini API for enrichment |
|
||
|
|
| qdrant-client | 1.16.2 | Vector database client |
|
||
|
|
| PyPDF2 | 3.0.1 | PDF text extraction |
|
||
|
|
| trafilatura | 2.0.0 | Web content extraction |
|
||
|
|
| beautifulsoup4 | 4.14.3 | HTML parsing for crawler |
|
||
|
|
| lxml | 6.0.2 | XML/HTML parsing |
|
||
|
|
| pytesseract | 0.3.13 | OCR fallback |
|
||
|
|
| requests | 2.32.5 | HTTP client |
|
||
|
|
| PyYAML | 6.0.3 | Config file parsing |
|
||
|
|
|
||
|
|
Full list in `requirements.txt`.
|