matt/recon

mirror of https://github.com/zvx-echo6/recon.git synced 2026-05-20 06:34:40 +02:00

Matt 563c16bb71 Initial commit: RECON codebase baseline

Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-14 14:57:23 +00:00

27 KiB

Raw Blame History

RECON Project Bible v2.0

Last updated: 2026-02-16

1. Mission Statement

RECON (Reconnaissance, Extraction, Conceptualization, and Operationalization of kNowledge) is a knowledge extraction pipeline that processes PDFs and web content into structured concepts stored in a Qdrant vector database. These concepts power Aurora, the RAG-enabled AI assistant running on OpenWebUI.

The core loop: Content in (PDF/web) -> Text extracted -> Concepts enriched (Gemini) -> Vectors embedded (TEI/BGE-M3) -> Searchable knowledge (Qdrant) -> Aurora answers questions with citations.

2. Infrastructure

Hosts

Host	IP (Tailscale)	Role
recon LXC	100.64.0.24 (CT 130 on toc)	RECON application, dashboard, pipeline
cortex VM	100.64.0.14 (VM 150 on toc)	Qdrant, TEI, Ollama, OpenWebUI
pi-nas	100.64.0.21 (192.168.1.245)	NFS file server for PDF library
Contabo VPS	100.64.0.1 (5.189.158.149)	Backup destination

Services on cortex (100.64.0.14)

Service	Port	Purpose
Qdrant	6333	Vector database (recon_knowledge collection)
TEI (text-embeddings-inference)	8090	Embedding server (bge-m3, 1024-dim, ~1,711 emb/sec)
Ollama	11434	LLM server + fallback embeddings (~8 emb/sec)
OpenWebUI	8080	Aurora chat interface (ai.echo6.co)

Services on recon LXC (100.64.0.24)

Service	Port	Purpose
RECON Dashboard	8420	Web UI + API for pipeline management
File Server	8888	PDF downloads (files.echo6.co)

NFS Mount

pi-nas:/export/library -> /mnt/library (22TB, rw, NFSv3)

Contains ~13,000+ PDFs across:

Survival-Companion-Library/ (~12,900 PDFs in ~220 subdirectories)
Army_Pubs/ (~160 military field manuals)
Other: Gaming/, Reference/, Technical/

3. Architecture Overview

                    /mnt/library/ (NFS)
                         |
                    [recon scan]
                         |
                    catalogue (SQLite)
                         |
                    [recon queue]
                         |
    +-----------+   [recon extract]   +-----------+
    |  PyPDF2   |-->  data/text/      |  Gemini   |
    | pdftotext |   {hash}/page_N.txt |  Flash    |
    | tesseract |        |            |  4 keys   |
    +-----------+   [recon enrich]    +-----------+
                         |
                    data/concepts/
                    {hash}/window_N.json
                         |
                    [recon embed]
                         |
              +----------+-----------+
              |   TEI (primary)      |
              |   bge-m3, 1024-dim   |
              |   1,711 emb/sec      |
              +----------+-----------+
                         |
                    Qdrant (cortex:6333)
                    recon_knowledge collection
                         |
                    Aurora (OpenWebUI)
                    RAG search + citations

Web Content Path

    URL(s) ──> [recon ingest-url / crawl]
                         |
                    trafilatura extraction
                    chunk into ~2000-word pages
                         |
                    data/text/{hash}/page_N.txt
                    (enters at "extracted" status)
                         |
                    [enrich] -> [embed]
                    (same as PDF path)

4. Pipeline Stages

Status Flow

catalogued -> queued -> extracting -> extracted -> enriching -> enriched -> embedding -> complete
                                                                                    \-> failed

Web content enters at extracted status (text already extracted by trafilatura).

Stage Details

Stage	Tool	Input	Output	Speed
Scan	`recon scan`	/mnt/library/*.pdf	catalogue table	~13K PDFs in ~30 min
Queue	`recon queue`	catalogue entries	documents table (status=queued)	Instant
Extract	`recon extract`	PDF files	data/text/{hash}/page_NNNN.txt	4 workers, ~200/hr
Enrich	`recon enrich`	Text pages (10-page windows)	data/concepts/{hash}/window_N.json	16 workers, 4 Gemini keys
Embed	`recon embed`	Concept JSONs	Qdrant vectors	TEI: 1,711 emb/sec

Extraction Fallback Chain

PyPDF2 (fast, clean text) -> 2. pdftotext (handles complex layouts) -> 3. Tesseract OCR (scanned documents)

Enrichment Details

Model: gemini-2.0-flash
Window size: 10 pages per API call (configurable)
Workers: 16 concurrent (4 API keys x 4 workers each)
Output format: JSON array of concept objects
CRITICAL: Concept JSONs are saved to disk BEFORE any database operations
Key rotation via KeyRotator class distributing across 4 Gemini API keys

Embedding Details

Primary: TEI at cortex:8090 (bge-m3 model, 1024 dimensions, ~1,711 embeddings/sec)
Fallback: Ollama at cortex:11434 (bge-m3 model, ~8 embeddings/sec)
Batch size: 128 embeddings per TEI request
Distance metric: Cosine similarity
CRITICAL: Dimensions are 1024 (bge-m3), NOT 384. Getting this wrong creates silent failures.

5. Directory Structure

/opt/recon/                          # Application root
  recon.py                           # CLI entry point
  config.yaml                        # Central configuration
  .env                               # Gemini API keys (4 keys)
  requirements.txt                   # Python dependencies
  PROJECT-BIBLE.md                   # This file
  README.md                          # Quick-start reference
  run-full-pipeline.sh               # Background pipeline runner

  lib/                               # Core modules
    __init__.py
    api.py                           # Flask web dashboard + API (port 8420)
    crawler.py                       # Site crawler (sitemap + BFS link-following)
    embedder.py                      # Concept -> vector embedding (TEI/Ollama -> Qdrant)
    enricher.py                      # Text -> concept extraction (Gemini)
    extractor.py                     # PDF -> text extraction (PyPDF2/pdftotext/OCR)
    ingester.py                      # ARGUS intel feed intake
    status.py                        # SQLite DB operations (catalogue + documents)
    utils.py                         # Config, hashing, URL generation, logging
    web_scraper.py                   # URL -> text extraction (trafilatura)

  scripts/                           # Operational scripts
    backup.sh                        # Automated backup to Contabo (cron every 6h)
    rebuild_qdrant.py                # Nuclear recovery: re-embed all concepts
    validate.py                      # Pipeline consistency validation

  data/                              # Pipeline data (on local disk)
    recon.db                         # SQLite status database
    text/                            # Extracted text
      {content_hash}/
        meta.json                    # Document metadata
        page_0001.txt                # Page text (4-digit, 1-indexed)
        page_0002.txt
        ...
    concepts/                        # Enriched concepts (**BACK THESE UP**)
      {content_hash}/
        window_1.json                # Concept JSON array (10-page window)
        window_2.json
        ...
    intel/                           # ARGUS intel feeds

  logs/                              # Application logs
    recon.log                        # Main rotating log
    backup.log                       # Backup operation log
    backup_cron.log                  # Cron backup log

  venv/                              # Python virtual environment

6. Database Schema

SQLite (data/recon.db)

Two tables in WAL mode with thread-local connections.

catalogue

Column	Type	Description
hash	TEXT PK	MD5 content hash
filename	TEXT	Original filename
path	TEXT	Full filesystem path
size_bytes	INTEGER	File size
source	TEXT	Top-level directory (e.g., "Survival-Companion-Library")
category	TEXT	Second-level directory (e.g., "Bushcraft")
status	TEXT	"catalogued" or "processed"
discovered_at	TEXT	ISO timestamp

documents

Column	Type	Description
hash	TEXT PK	MD5 content hash
filename	TEXT	Original filename
path	TEXT	Full path or URL
size_bytes	INTEGER	File/content size
page_count	INTEGER	Number of text pages
book_title	TEXT	Gemini-extracted title
book_author	TEXT	Gemini-extracted author
status	TEXT	Pipeline status
pages_extracted	INTEGER	Pages extracted
concepts_extracted	INTEGER	Concepts generated
vectors_inserted	INTEGER	Vectors in Qdrant
error_message	TEXT	Last error (if failed)
retry_count	INTEGER	Failure retry count
created_at	TEXT	ISO timestamp
updated_at	TEXT	ISO timestamp

Qdrant (cortex:6333)

Collection: recon_knowledge

Field	Type	Description
vector	float[1024]	BGE-M3 embedding
doc_hash	keyword	Links to SQLite document
filename	keyword	Source filename
book_title	keyword	Document title
book_author	keyword	Author name
source_type	keyword	"document", "web", or "intel_feed"
download_url	keyword	files.echo6.co URL or source URL
content	text	Concept text (searchable)
summary	text	Concept summary
title	keyword	Concept title
domain	keyword	Knowledge domain
subdomain	keyword	Knowledge subdomain
keywords	keyword[]	Concept keywords
skill_level	keyword	beginner/intermediate/advanced/expert
key_facts	text[]	Key facts list
scenario_applicable	text[]	Applicable scenarios
cross_domain_tags	keyword[]	Cross-references
chapter	keyword	Source chapter
page_ref	keyword	Source page reference
notes	text	Additional notes
_window	integer	Source window number
_start_page	integer	Starting page in document
verification_status	keyword	"unverified" (default)
credibility_score	float	0.7 (default)
language	keyword	"en" (default)

7. CLI Reference

recon <command> [options]

Command	Description	Key Options
`scan`	Scan library, catalogue new PDFs	`--path`
`queue`	Queue catalogued docs for processing	`--hash`, `--source`, `--category`, `--limit`
`extract`	Extract text from queued PDFs	`--workers`
`enrich`	Enrich extracted text via Gemini	`--workers`, `--limit`
`embed`	Embed concepts into Qdrant	`--workers`, `--limit`
`run`	Full pipeline (extract->enrich->embed)	`--workers`, `--enrich-workers`, `--limit`
`status`	Show pipeline status counts
`catalogue`	Browse catalogue	`--sources`, `--categories`, `--source`, `--limit`
`failures`	Show failed documents	`--retry`
`search`	Semantic search	`query`, `--limit`
`upload`	Upload PDFs	`--file`, `--dir`, `--category`
`ingest-url`	Ingest web content	`url`, `--file`, `--category`, `--process`
`crawl`	Crawl a site	`url`, `--category`, `--include`, `--exclude`, `--max-pages`, `--dry-run`, `--process`
`validate`	Check pipeline consistency	`--deep`
`rebuild`	Rebuild Qdrant from concept JSONs
`serve`	Start web dashboard (port 8420)
`ingest`	Ingest ARGUS intel JSON	`--file`, `--directory`

Common Workflows

# Full library processing
recon scan && recon queue && recon run

# Ingest a single web page with full processing
recon ingest-url "https://example.com/article" --category "Reference" --process

# Dry-run crawl to preview URLs
recon crawl "https://docs.example.com" --include /docs/ --dry-run

# Full crawl with processing
recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process

# Upload a PDF
recon upload --file /path/to/document.pdf --category "Technical"

# Check what failed and retry
recon failures
recon failures --retry

8. Web Dashboard

URL

http://100.64.0.24:8420

Pages

Route	Page	Description
`/`	Dashboard	Knowledge base overview: document/concept/vector counts, source table, domain distribution bars, skill level breakdown, Qdrant health, recent completions, pipeline status
`/search`	Search	Semantic search with score bars, Web/PDF badges, download links
`/catalogue`	Catalogue	Browse all catalogued PDFs with source/category filters
`/upload`	Upload	PDF upload form with category datalist, recent uploads table
`/web-ingest`	Web Ingest	Two tabs: Single/Batch URL ingest, Site Crawl with preview
`/failures`	Failures	Failed documents with error messages and retry button

API Endpoints

Method	Endpoint	Description
GET	`/api/search?q=...&limit=N`	Semantic search
GET	`/api/catalogue?source=...&limit=N`	Browse catalogue
GET	`/api/knowledge-stats`	Dashboard aggregation (totals, sources, domains, skills, Qdrant health)
POST	`/api/upload`	Upload PDF (multipart: file + category)
GET	`/api/upload/<hash>/status`	Check upload processing status
GET	`/api/upload/categories`	List available categories
POST	`/api/ingest-url`	Ingest single URL (json: url, category, process)
POST	`/api/ingest-urls`	Ingest multiple URLs (json: urls, category, process)
POST	`/api/crawl`	Crawl a site (json: url, category, include, exclude, max_pages, dry_run)
GET	`/api/crawl/<id>/status`	Poll crawl/pipeline progress
POST	`/api/failures/retry`	Re-queue all failed documents

Dashboard Features

Auto-refresh: Every 30 seconds via JavaScript fetch
Knowledge cards: Total documents, concepts, vectors, pages
Source table: Per-source breakdown with document/concept/vector counts and PDF/WEB type badges
Domain distribution: Horizontal bars showing top knowledge domains
Skill level breakdown: beginner/intermediate/advanced/expert percentages
Qdrant health: Connection status, points count, segments
Pipeline status: Compact display of documents in each stage
Crawl polling: Real-time stage tracking (ingesting -> enriching -> embedding)

9. Concept JSON Schema

Each window file (data/concepts/{hash}/window_N.json) contains a JSON array of concept objects:

[
  {
    "title": "Water Purification Methods",
    "content": "Detailed text about the concept...",
    "summary": "Brief summary of the concept",
    "domain": "Survival",
    "subdomain": "Water",
    "keywords": ["purification", "filtration", "boiling"],
    "skill_level": "beginner",
    "key_facts": ["Boiling kills 99.9% of pathogens", "..."],
    "scenario_applicable": ["wilderness survival", "disaster preparedness"],
    "cross_domain_tags": ["health", "camping"],
    "chapter": "Chapter 3",
    "page_ref": "pp. 45-48",
    "notes": "Additional context or caveats",
    "_window": 1,
    "_start_page": 1
  }
]

10. Web Ingestion

Single URL

recon ingest-url "https://example.com/article" --category "Reference" --process

Or via API:

curl -X POST http://100.64.0.24:8420/api/ingest-url \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article", "category": "Reference", "process": true}'

Site Crawl

# Preview what would be crawled
recon crawl "https://docs.example.com" --include /docs/ --dry-run

# Full crawl
recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process

How It Works

URL discovery (crawler.py):
- Tries sitemap.xml first (preferred, finds all pages)
- Falls back to BFS link-following if no sitemap
- Filters by include/exclude patterns
Content extraction (web_scraper.py):
- Uses trafilatura for clean text extraction
- Chunks into ~2,000-word pages
- Same output format as PDF extractor: data/text/{hash}/page_NNNN.txt
- Content hash is MD5 of extracted text (deduplication)
Pipeline integration:
- Web content enters at extracted status (no PDF extraction needed)
- Enrichment and embedding proceed identically to PDF content
- Qdrant vectors get source_type: "web" and download_url pointing to source URL

11. Configuration Reference

config.yaml

# Root path for the PDF library (NFS mount from pi-nas)
library_root: /mnt/library

processing:
  extract_workers: 4        # Concurrent PDF extraction threads
  enrich_workers: 16         # Concurrent Gemini enrichment threads (4 keys x 4)
  embed_workers: 4           # Concurrent embedding threads
  enrich_window_size: 5      # Pages per enrichment window (sent to Gemini)
  embed_batch_size: 500      # Vectors per Qdrant upsert batch
  rate_limit_delay: 0.1      # Delay between Gemini API calls (seconds)
  max_retries: 5             # Max retries for failed documents

embedding:
  backend: tei               # "tei" (primary, ~1,711 emb/sec) or "ollama" (fallback, ~8 emb/sec)
  tei_host: 100.64.0.14      # TEI server (cortex)
  tei_port: 8090             # TEI HTTP port
  ollama_host: 100.64.0.14   # Ollama server (cortex) — fallback only
  ollama_port: 11434         # Ollama HTTP port
  model: bge-m3              # Embedding model name
  dimensions: 1024           # CRITICAL: bge-m3 is 1024-dim, NOT 384
  batch_size: 128            # Embeddings per TEI batch request

vector_db:
  host: 100.64.0.14          # Qdrant server (cortex)
  port: 6333                 # Qdrant HTTP port
  collection: recon_knowledge  # Collection name

gemini:
  model: gemini-2.0-flash    # Gemini model for enrichment
  response_mime_type: application/json  # Force JSON output

web:
  port: 8420                 # Dashboard HTTP port
  host: 0.0.0.0              # Bind to all interfaces

paths:
  base: /opt/recon           # Application root
  data: /opt/recon/data      # Data directory
  text: /opt/recon/data/text  # Extracted text output
  concepts: /opt/recon/data/concepts  # Enriched concept JSONs
  intel: /opt/recon/data/intel  # ARGUS intel feeds
  logs: /opt/recon/logs      # Log files
  db: /opt/recon/data/recon.db  # SQLite database

book_server:
  base_url: https://files.echo6.co  # Public URL prefix for PDF downloads
  strip_prefix: /mnt/library  # Path prefix to strip when generating URLs

upload_paths:                 # Category -> filesystem path mapping for uploads
  Survival Reference: /mnt/library/Survival-Companion-Library/Uploads
  Military Doctrine: /mnt/library/Army_Pubs/Uploads
  Gaming: /mnt/library/Gaming
  Reference: /mnt/library/Reference
  Technical: /mnt/library/Technical
  default: /mnt/library      # Fallback for unknown categories

web_scraper:
  words_per_page: 2000       # Target words per page chunk
  fetch_timeout: 30          # HTTP request timeout (seconds)
  rate_limit_delay: 1.0      # Delay between URL fetches (seconds)
  max_batch_size: 50         # Max URLs per batch ingest
  user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"

crawler:
  user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
  fetch_timeout: 30          # HTTP request timeout (seconds)
  rate_limit_delay: 1.0      # Delay between page fetches (seconds)
  max_pages: 500             # Max pages to discover per crawl
  max_depth: 3               # Max link-following depth (BFS only)
  default_exclude:            # URL patterns to always skip
    - /search
    - /404
    - /login
    - /signup
    - /auth/
    - /api/
    - /assets/
    - /static/

.env

GEMINI_KEY_1=<key>
GEMINI_KEY_2=<key>
GEMINI_KEY_3=<key>
GEMINI_KEY_4=<key>

Four Gemini API keys rotated across 16 enrichment workers via KeyRotator.

12. Aurora RAG Integration

Aurora is the RAG-enabled AI assistant running on OpenWebUI (ai.echo6.co).

How It Works

User asks a question in OpenWebUI
Aurora's OpenWebUI function/filter embeds the query via TEI (cortex:8090)
Searches Qdrant recon_knowledge collection for similar concepts
Top results are injected into the prompt as context
JOSIEFIED Qwen3 8B generates an answer with citations
Citations include download_url links (PDF files via files.echo6.co, web content via source URL)

Key Components

Embedding: Same TEI endpoint + bge-m3 model as RECON pipeline (ensures vector compatibility)
Search: Cosine similarity, top-5 results by default
LLM: goekdenizguelmez/JOSIEFIED-Qwen3:8b on Ollama (cortex:11434)
Citations: Each result includes download_url — either https://files.echo6.co/... for PDFs or the original URL for web content

13. Backup & Recovery

Automated Backups

Script: /opt/recon/scripts/backup.sh Destination: Contabo VPS (root@100.64.0.1:/opt/backups/recon/) Schedule (cron):

Every 6 hours: Full backup (concepts, text, DB, config, intel)
Every 2 hours (off-hours): SQLite DB snapshot only

What's Backed Up

Component	Size	Priority	Notes
data/concepts/	~11M	CRITICAL	$130+ of Gemini API work
data/text/	~203M	High	Hours to regenerate
data/recon.db	~6.5M	CRITICAL	All pipeline state
config.yaml + .env	~2K	Important	Configuration
data/intel/	~4K	Low	Intel feed data

What's NOT Backed Up

Qdrant vectors: Rebuilt from concept JSONs in ~10 minutes via recon rebuild
PDF library: Lives on pi-nas NFS, backed up separately
venv/: Recreated from requirements.txt

Recovery Procedures

# Restore from backup
scp -r root@100.64.0.1:/opt/backups/recon/concepts/ /opt/recon/data/concepts/
scp -r root@100.64.0.1:/opt/backups/recon/text/ /opt/recon/data/text/
scp root@100.64.0.1:/opt/backups/recon/recon_LATEST.db /opt/recon/data/recon.db

# Rebuild Qdrant vectors from concept JSONs
cd /opt/recon && source venv/bin/activate
python3 scripts/rebuild_qdrant.py
# Type REBUILD when prompted

14. Embedding Performance

TEI (Primary) vs Ollama (Fallback)

Metric	TEI (cortex:8090)	Ollama (cortex:11434)
Speed	~1,711 emb/sec	~8 emb/sec
Model	bge-m3	bge-m3
Dimensions	1024	1024
Batch size	128	1
Cosine similarity	0.999900	0.999900

TEI is ~214x faster than Ollama for embeddings. Always use TEI unless it's down.

Qdrant Configuration

Collection: recon_knowledge
Distance: Cosine
HNSW indexing threshold: 20,000 (below this, brute-force search is used)
Current state: Brute-force (under 20K vectors) — this is normal and performant at current scale

15. Content Hashing

PDF content: MD5(file_bytes) — stable across renames, detects exact duplicates
Web content: MD5(extracted_text) — deduplicates by content, not URL
Hash is used as the primary key in both SQLite tables and as the directory name for text/concept storage

16. Source Type Handling

Source	Path Format	source_type	download_url	Badge
PDF	`/mnt/library/...`	document	`https://files.echo6.co/...`	PDF
Web	`https://...`	web	Original URL	Web
Intel	JSON feed	intel_feed	—	—

The generate_download_url() function in utils.py handles the routing:

URLs starting with http:// or https:// are returned as-is
File paths are converted to files.echo6.co URLs

17. Lessons Learned

RECON Rebuild Lessons

Verify infrastructure before writing code. Check Qdrant, TEI, Ollama connectivity first.
Dimensions are 1024, NOT 384. BGE-M3 uses 1024-dimensional vectors. This caused silent failures in early builds.
TEI >> Ollama for embeddings. 1,711 vs 8 embeddings/sec. A 214x speedup that makes batch processing viable.
Dynamic discovery over hardcoded paths. Let the pipeline discover what's on disk rather than maintaining static file lists.
Web content uses the same pipeline. After text extraction, web and PDF content follow identical enrichment and embedding paths.
Sitemap > link-following. Sitemaps discover all pages reliably; BFS link-following misses orphaned pages and is slower.
Save to disk before DB operations. Concept JSONs are written to disk first, then the database is updated. This means recovery is always possible from the JSON files.
NFS over large file sets is slow. Scanning 13K PDFs over NFS takes ~30 minutes due to MD5 hashing over the network. Plan accordingly.

Operational Gotchas

recon scan can appear stuck on large PDFs over NFS — it's hashing, not hung
Some PDFs have corrupt metadata that crashes PyPDF2 — the extractor catches this and falls back
Gemini rate limits hit with 16 workers — the KeyRotator distributes across 4 keys to mitigate
iptables-persistent hangs on interactive prompts in LXC containers — use manual persistence
The recon LXC has no tmux/screen — use nohup for long-running background tasks

18. Monitoring

Pipeline Status

# Quick status
recon status

# Dashboard
http://100.64.0.24:8420

# Tail logs
tail -f /opt/recon/logs/recon.log

# Pipeline run log (when running full background pipeline)
tail -f /opt/recon/pipeline.log

Health Checks

# Qdrant
curl -s http://100.64.0.14:6333/collections/recon_knowledge | python3 -m json.tool

# TEI
curl -s http://100.64.0.14:8090/info

# Ollama
curl -s http://100.64.0.14:11434/api/tags | python3 -m json.tool

# NFS mount
df -h /mnt/library

# Backup logs
tail -20 /opt/recon/logs/backup.log

Validation

# Quick validation
recon validate

# Deep validation (checks all files on disk)
recon validate --deep

19. Current State

As of 2026-02-16

Pipeline Progress

Status	Count
Catalogued	10,162
Queued	8,982
Extracted	872
Complete	302
Failed	2

Vector Database

Qdrant points: 4,661 (3,144 PDF + 1,517 web)
Segments: 8
Indexing: Brute-force (under 20K threshold)

Active Processing

Full pipeline running in background via nohup — extracting through the 8,982 queued documents. Expected to take ~40 hours for full extract -> enrich -> embed cycle.

Backups

Schedule: Every 6 hours (full) + every 2 hours (DB only)
Destination: Contabo VPS (/opt/backups/recon/)
Last verified: 2026-02-16 (220M total backup size)

20. Dependencies

System Packages

Python 3.11+
pdftotext (poppler-utils)
tesseract-ocr
sqlite3

Python Packages (key)

Package	Version	Purpose
Flask	3.1.2	Web dashboard
google-generativeai	0.8.6	Gemini API for enrichment
qdrant-client	1.16.2	Vector database client
PyPDF2	3.0.1	PDF text extraction
trafilatura	2.0.0	Web content extraction
beautifulsoup4	4.14.3	HTML parsing for crawler
lxml	6.0.2	XML/HTML parsing
pytesseract	0.3.13	OCR fallback
requests	2.32.5	HTTP client
PyYAML	6.0.3	Config file parsing

Full list in requirements.txt.

27 KiB Raw Blame History

RECON Project Bible v2.0

1. Mission Statement

2. Infrastructure

Hosts

Services on cortex (100.64.0.14)

Services on recon LXC (100.64.0.24)

NFS Mount

3. Architecture Overview

Web Content Path

4. Pipeline Stages

Status Flow

Stage Details

Extraction Fallback Chain

Enrichment Details

Embedding Details

5. Directory Structure

6. Database Schema

SQLite (data/recon.db)

catalogue

documents

Qdrant (cortex:6333)

7. CLI Reference

Common Workflows

8. Web Dashboard

URL

Pages

API Endpoints

Dashboard Features

9. Concept JSON Schema

10. Web Ingestion

Single URL

Site Crawl

How It Works

11. Configuration Reference

config.yaml

.env

12. Aurora RAG Integration

How It Works

Key Components

13. Backup & Recovery

Automated Backups

What's Backed Up

What's NOT Backed Up

Recovery Procedures

14. Embedding Performance

TEI (Primary) vs Ollama (Fallback)

Qdrant Configuration

15. Content Hashing

16. Source Type Handling

17. Lessons Learned

RECON Rebuild Lessons

Operational Gotchas

18. Monitoring

Pipeline Status

Health Checks

Validation

19. Current State

Pipeline Progress

Vector Database

Active Processing

Backups

20. Dependencies

System Packages

Python Packages (key)

27 KiB

Raw Blame History