Initial commit: RECON codebase baseline

Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Matt 2026-04-14 14:57:23 +00:00
commit 563c16bb71
59 changed files with 18327 additions and 0 deletions

26
.gitignore vendored Normal file
View file

@ -0,0 +1,26 @@
# Python
venv/
__pycache__/
*.pyc
*.pyo
# Secrets
.env
# Runtime data
data/
logs/
pipeline.log
recon.db
# Backups
*.bak
*.bak-*
*.bak.*
*.bak2.*
# Junk
-.png
# OS
.DS_Store

785
PROJECT-BIBLE.md Normal file
View file

@ -0,0 +1,785 @@
# RECON Project Bible v2.0
*Last updated: 2026-02-16*
---
## 1. Mission Statement
RECON (Reconnaissance, Extraction, Conceptualization, and Operationalization of kNowledge) is a knowledge extraction pipeline that processes PDFs and web content into structured concepts stored in a Qdrant vector database. These concepts power Aurora, the RAG-enabled AI assistant running on OpenWebUI.
**The core loop:** Content in (PDF/web) -> Text extracted -> Concepts enriched (Gemini) -> Vectors embedded (TEI/BGE-M3) -> Searchable knowledge (Qdrant) -> Aurora answers questions with citations.
---
## 2. Infrastructure
### Hosts
| Host | IP (Tailscale) | Role |
|------|---------------|------|
| recon LXC | 100.64.0.24 (CT 130 on toc) | RECON application, dashboard, pipeline |
| cortex VM | 100.64.0.14 (VM 150 on toc) | Qdrant, TEI, Ollama, OpenWebUI |
| pi-nas | 100.64.0.21 (192.168.1.245) | NFS file server for PDF library |
| Contabo VPS | 100.64.0.1 (5.189.158.149) | Backup destination |
### Services on cortex (100.64.0.14)
| Service | Port | Purpose |
|---------|------|---------|
| Qdrant | 6333 | Vector database (recon_knowledge collection) |
| TEI (text-embeddings-inference) | 8090 | Embedding server (bge-m3, 1024-dim, ~1,711 emb/sec) |
| Ollama | 11434 | LLM server + fallback embeddings (~8 emb/sec) |
| OpenWebUI | 8080 | Aurora chat interface (ai.echo6.co) |
### Services on recon LXC (100.64.0.24)
| Service | Port | Purpose |
|---------|------|---------|
| RECON Dashboard | 8420 | Web UI + API for pipeline management |
| File Server | 8888 | PDF downloads (files.echo6.co) |
### NFS Mount
```
pi-nas:/export/library -> /mnt/library (22TB, rw, NFSv3)
```
Contains ~13,000+ PDFs across:
- `Survival-Companion-Library/` (~12,900 PDFs in ~220 subdirectories)
- `Army_Pubs/` (~160 military field manuals)
- Other: `Gaming/`, `Reference/`, `Technical/`
---
## 3. Architecture Overview
```
/mnt/library/ (NFS)
|
[recon scan]
|
catalogue (SQLite)
|
[recon queue]
|
+-----------+ [recon extract] +-----------+
| PyPDF2 |--> data/text/ | Gemini |
| pdftotext | {hash}/page_N.txt | Flash |
| tesseract | | | 4 keys |
+-----------+ [recon enrich] +-----------+
|
data/concepts/
{hash}/window_N.json
|
[recon embed]
|
+----------+-----------+
| TEI (primary) |
| bge-m3, 1024-dim |
| 1,711 emb/sec |
+----------+-----------+
|
Qdrant (cortex:6333)
recon_knowledge collection
|
Aurora (OpenWebUI)
RAG search + citations
```
### Web Content Path
```
URL(s) ──> [recon ingest-url / crawl]
|
trafilatura extraction
chunk into ~2000-word pages
|
data/text/{hash}/page_N.txt
(enters at "extracted" status)
|
[enrich] -> [embed]
(same as PDF path)
```
---
## 4. Pipeline Stages
### Status Flow
```
catalogued -> queued -> extracting -> extracted -> enriching -> enriched -> embedding -> complete
\-> failed
```
Web content enters at `extracted` status (text already extracted by trafilatura).
### Stage Details
| Stage | Tool | Input | Output | Speed |
|-------|------|-------|--------|-------|
| Scan | `recon scan` | /mnt/library/*.pdf | catalogue table | ~13K PDFs in ~30 min |
| Queue | `recon queue` | catalogue entries | documents table (status=queued) | Instant |
| Extract | `recon extract` | PDF files | data/text/{hash}/page_NNNN.txt | 4 workers, ~200/hr |
| Enrich | `recon enrich` | Text pages (10-page windows) | data/concepts/{hash}/window_N.json | 16 workers, 4 Gemini keys |
| Embed | `recon embed` | Concept JSONs | Qdrant vectors | TEI: 1,711 emb/sec |
### Extraction Fallback Chain
1. **PyPDF2** (fast, clean text) -> 2. **pdftotext** (handles complex layouts) -> 3. **Tesseract OCR** (scanned documents)
### Enrichment Details
- Model: `gemini-2.0-flash`
- Window size: 10 pages per API call (configurable)
- Workers: 16 concurrent (4 API keys x 4 workers each)
- Output format: JSON array of concept objects
- **CRITICAL**: Concept JSONs are saved to disk BEFORE any database operations
- Key rotation via `KeyRotator` class distributing across 4 Gemini API keys
### Embedding Details
- **Primary**: TEI at cortex:8090 (bge-m3 model, 1024 dimensions, ~1,711 embeddings/sec)
- **Fallback**: Ollama at cortex:11434 (bge-m3 model, ~8 embeddings/sec)
- Batch size: 128 embeddings per TEI request
- Distance metric: Cosine similarity
- **CRITICAL**: Dimensions are 1024 (bge-m3), NOT 384. Getting this wrong creates silent failures.
---
## 5. Directory Structure
```
/opt/recon/ # Application root
recon.py # CLI entry point
config.yaml # Central configuration
.env # Gemini API keys (4 keys)
requirements.txt # Python dependencies
PROJECT-BIBLE.md # This file
README.md # Quick-start reference
run-full-pipeline.sh # Background pipeline runner
lib/ # Core modules
__init__.py
api.py # Flask web dashboard + API (port 8420)
crawler.py # Site crawler (sitemap + BFS link-following)
embedder.py # Concept -> vector embedding (TEI/Ollama -> Qdrant)
enricher.py # Text -> concept extraction (Gemini)
extractor.py # PDF -> text extraction (PyPDF2/pdftotext/OCR)
ingester.py # ARGUS intel feed intake
status.py # SQLite DB operations (catalogue + documents)
utils.py # Config, hashing, URL generation, logging
web_scraper.py # URL -> text extraction (trafilatura)
scripts/ # Operational scripts
backup.sh # Automated backup to Contabo (cron every 6h)
rebuild_qdrant.py # Nuclear recovery: re-embed all concepts
validate.py # Pipeline consistency validation
data/ # Pipeline data (on local disk)
recon.db # SQLite status database
text/ # Extracted text
{content_hash}/
meta.json # Document metadata
page_0001.txt # Page text (4-digit, 1-indexed)
page_0002.txt
...
concepts/ # Enriched concepts (**BACK THESE UP**)
{content_hash}/
window_1.json # Concept JSON array (10-page window)
window_2.json
...
intel/ # ARGUS intel feeds
logs/ # Application logs
recon.log # Main rotating log
backup.log # Backup operation log
backup_cron.log # Cron backup log
venv/ # Python virtual environment
```
---
## 6. Database Schema
### SQLite (data/recon.db)
Two tables in WAL mode with thread-local connections.
#### catalogue
| Column | Type | Description |
|--------|------|-------------|
| hash | TEXT PK | MD5 content hash |
| filename | TEXT | Original filename |
| path | TEXT | Full filesystem path |
| size_bytes | INTEGER | File size |
| source | TEXT | Top-level directory (e.g., "Survival-Companion-Library") |
| category | TEXT | Second-level directory (e.g., "Bushcraft") |
| status | TEXT | "catalogued" or "processed" |
| discovered_at | TEXT | ISO timestamp |
#### documents
| Column | Type | Description |
|--------|------|-------------|
| hash | TEXT PK | MD5 content hash |
| filename | TEXT | Original filename |
| path | TEXT | Full path or URL |
| size_bytes | INTEGER | File/content size |
| page_count | INTEGER | Number of text pages |
| book_title | TEXT | Gemini-extracted title |
| book_author | TEXT | Gemini-extracted author |
| status | TEXT | Pipeline status |
| pages_extracted | INTEGER | Pages extracted |
| concepts_extracted | INTEGER | Concepts generated |
| vectors_inserted | INTEGER | Vectors in Qdrant |
| error_message | TEXT | Last error (if failed) |
| retry_count | INTEGER | Failure retry count |
| created_at | TEXT | ISO timestamp |
| updated_at | TEXT | ISO timestamp |
### Qdrant (cortex:6333)
Collection: `recon_knowledge`
| Field | Type | Description |
|-------|------|-------------|
| vector | float[1024] | BGE-M3 embedding |
| doc_hash | keyword | Links to SQLite document |
| filename | keyword | Source filename |
| book_title | keyword | Document title |
| book_author | keyword | Author name |
| source_type | keyword | "document", "web", or "intel_feed" |
| download_url | keyword | files.echo6.co URL or source URL |
| content | text | Concept text (searchable) |
| summary | text | Concept summary |
| title | keyword | Concept title |
| domain | keyword | Knowledge domain |
| subdomain | keyword | Knowledge subdomain |
| keywords | keyword[] | Concept keywords |
| skill_level | keyword | beginner/intermediate/advanced/expert |
| key_facts | text[] | Key facts list |
| scenario_applicable | text[] | Applicable scenarios |
| cross_domain_tags | keyword[] | Cross-references |
| chapter | keyword | Source chapter |
| page_ref | keyword | Source page reference |
| notes | text | Additional notes |
| _window | integer | Source window number |
| _start_page | integer | Starting page in document |
| verification_status | keyword | "unverified" (default) |
| credibility_score | float | 0.7 (default) |
| language | keyword | "en" (default) |
---
## 7. CLI Reference
```
recon <command> [options]
```
| Command | Description | Key Options |
|---------|-------------|-------------|
| `scan` | Scan library, catalogue new PDFs | `--path` |
| `queue` | Queue catalogued docs for processing | `--hash`, `--source`, `--category`, `--limit` |
| `extract` | Extract text from queued PDFs | `--workers` |
| `enrich` | Enrich extracted text via Gemini | `--workers`, `--limit` |
| `embed` | Embed concepts into Qdrant | `--workers`, `--limit` |
| `run` | Full pipeline (extract->enrich->embed) | `--workers`, `--enrich-workers`, `--limit` |
| `status` | Show pipeline status counts | |
| `catalogue` | Browse catalogue | `--sources`, `--categories`, `--source`, `--limit` |
| `failures` | Show failed documents | `--retry` |
| `search` | Semantic search | `query`, `--limit` |
| `upload` | Upload PDFs | `--file`, `--dir`, `--category` |
| `ingest-url` | Ingest web content | `url`, `--file`, `--category`, `--process` |
| `crawl` | Crawl a site | `url`, `--category`, `--include`, `--exclude`, `--max-pages`, `--dry-run`, `--process` |
| `validate` | Check pipeline consistency | `--deep` |
| `rebuild` | Rebuild Qdrant from concept JSONs | |
| `serve` | Start web dashboard (port 8420) | |
| `ingest` | Ingest ARGUS intel JSON | `--file`, `--directory` |
### Common Workflows
```bash
# Full library processing
recon scan && recon queue && recon run
# Ingest a single web page with full processing
recon ingest-url "https://example.com/article" --category "Reference" --process
# Dry-run crawl to preview URLs
recon crawl "https://docs.example.com" --include /docs/ --dry-run
# Full crawl with processing
recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process
# Upload a PDF
recon upload --file /path/to/document.pdf --category "Technical"
# Check what failed and retry
recon failures
recon failures --retry
```
---
## 8. Web Dashboard
### URL
```
http://100.64.0.24:8420
```
### Pages
| Route | Page | Description |
|-------|------|-------------|
| `/` | Dashboard | Knowledge base overview: document/concept/vector counts, source table, domain distribution bars, skill level breakdown, Qdrant health, recent completions, pipeline status |
| `/search` | Search | Semantic search with score bars, Web/PDF badges, download links |
| `/catalogue` | Catalogue | Browse all catalogued PDFs with source/category filters |
| `/upload` | Upload | PDF upload form with category datalist, recent uploads table |
| `/web-ingest` | Web Ingest | Two tabs: Single/Batch URL ingest, Site Crawl with preview |
| `/failures` | Failures | Failed documents with error messages and retry button |
### API Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/search?q=...&limit=N` | Semantic search |
| GET | `/api/catalogue?source=...&limit=N` | Browse catalogue |
| GET | `/api/knowledge-stats` | Dashboard aggregation (totals, sources, domains, skills, Qdrant health) |
| POST | `/api/upload` | Upload PDF (multipart: file + category) |
| GET | `/api/upload/<hash>/status` | Check upload processing status |
| GET | `/api/upload/categories` | List available categories |
| POST | `/api/ingest-url` | Ingest single URL (json: url, category, process) |
| POST | `/api/ingest-urls` | Ingest multiple URLs (json: urls, category, process) |
| POST | `/api/crawl` | Crawl a site (json: url, category, include, exclude, max_pages, dry_run) |
| GET | `/api/crawl/<id>/status` | Poll crawl/pipeline progress |
| POST | `/api/failures/retry` | Re-queue all failed documents |
### Dashboard Features
- **Auto-refresh**: Every 30 seconds via JavaScript fetch
- **Knowledge cards**: Total documents, concepts, vectors, pages
- **Source table**: Per-source breakdown with document/concept/vector counts and PDF/WEB type badges
- **Domain distribution**: Horizontal bars showing top knowledge domains
- **Skill level breakdown**: beginner/intermediate/advanced/expert percentages
- **Qdrant health**: Connection status, points count, segments
- **Pipeline status**: Compact display of documents in each stage
- **Crawl polling**: Real-time stage tracking (ingesting -> enriching -> embedding)
---
## 9. Concept JSON Schema
Each window file (`data/concepts/{hash}/window_N.json`) contains a JSON array of concept objects:
```json
[
{
"title": "Water Purification Methods",
"content": "Detailed text about the concept...",
"summary": "Brief summary of the concept",
"domain": "Survival",
"subdomain": "Water",
"keywords": ["purification", "filtration", "boiling"],
"skill_level": "beginner",
"key_facts": ["Boiling kills 99.9% of pathogens", "..."],
"scenario_applicable": ["wilderness survival", "disaster preparedness"],
"cross_domain_tags": ["health", "camping"],
"chapter": "Chapter 3",
"page_ref": "pp. 45-48",
"notes": "Additional context or caveats",
"_window": 1,
"_start_page": 1
}
]
```
---
## 10. Web Ingestion
### Single URL
```bash
recon ingest-url "https://example.com/article" --category "Reference" --process
```
Or via API:
```bash
curl -X POST http://100.64.0.24:8420/api/ingest-url \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article", "category": "Reference", "process": true}'
```
### Site Crawl
```bash
# Preview what would be crawled
recon crawl "https://docs.example.com" --include /docs/ --dry-run
# Full crawl
recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process
```
### How It Works
1. **URL discovery** (crawler.py):
- Tries sitemap.xml first (preferred, finds all pages)
- Falls back to BFS link-following if no sitemap
- Filters by include/exclude patterns
2. **Content extraction** (web_scraper.py):
- Uses trafilatura for clean text extraction
- Chunks into ~2,000-word pages
- Same output format as PDF extractor: `data/text/{hash}/page_NNNN.txt`
- Content hash is MD5 of extracted text (deduplication)
3. **Pipeline integration**:
- Web content enters at `extracted` status (no PDF extraction needed)
- Enrichment and embedding proceed identically to PDF content
- Qdrant vectors get `source_type: "web"` and `download_url` pointing to source URL
---
## 11. Configuration Reference
### config.yaml
```yaml
# Root path for the PDF library (NFS mount from pi-nas)
library_root: /mnt/library
processing:
extract_workers: 4 # Concurrent PDF extraction threads
enrich_workers: 16 # Concurrent Gemini enrichment threads (4 keys x 4)
embed_workers: 4 # Concurrent embedding threads
enrich_window_size: 5 # Pages per enrichment window (sent to Gemini)
embed_batch_size: 500 # Vectors per Qdrant upsert batch
rate_limit_delay: 0.1 # Delay between Gemini API calls (seconds)
max_retries: 5 # Max retries for failed documents
embedding:
backend: tei # "tei" (primary, ~1,711 emb/sec) or "ollama" (fallback, ~8 emb/sec)
tei_host: 100.64.0.14 # TEI server (cortex)
tei_port: 8090 # TEI HTTP port
ollama_host: 100.64.0.14 # Ollama server (cortex) — fallback only
ollama_port: 11434 # Ollama HTTP port
model: bge-m3 # Embedding model name
dimensions: 1024 # CRITICAL: bge-m3 is 1024-dim, NOT 384
batch_size: 128 # Embeddings per TEI batch request
vector_db:
host: 100.64.0.14 # Qdrant server (cortex)
port: 6333 # Qdrant HTTP port
collection: recon_knowledge # Collection name
gemini:
model: gemini-2.0-flash # Gemini model for enrichment
response_mime_type: application/json # Force JSON output
web:
port: 8420 # Dashboard HTTP port
host: 0.0.0.0 # Bind to all interfaces
paths:
base: /opt/recon # Application root
data: /opt/recon/data # Data directory
text: /opt/recon/data/text # Extracted text output
concepts: /opt/recon/data/concepts # Enriched concept JSONs
intel: /opt/recon/data/intel # ARGUS intel feeds
logs: /opt/recon/logs # Log files
db: /opt/recon/data/recon.db # SQLite database
book_server:
base_url: https://files.echo6.co # Public URL prefix for PDF downloads
strip_prefix: /mnt/library # Path prefix to strip when generating URLs
upload_paths: # Category -> filesystem path mapping for uploads
Survival Reference: /mnt/library/Survival-Companion-Library/Uploads
Military Doctrine: /mnt/library/Army_Pubs/Uploads
Gaming: /mnt/library/Gaming
Reference: /mnt/library/Reference
Technical: /mnt/library/Technical
default: /mnt/library # Fallback for unknown categories
web_scraper:
words_per_page: 2000 # Target words per page chunk
fetch_timeout: 30 # HTTP request timeout (seconds)
rate_limit_delay: 1.0 # Delay between URL fetches (seconds)
max_batch_size: 50 # Max URLs per batch ingest
user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
crawler:
user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
fetch_timeout: 30 # HTTP request timeout (seconds)
rate_limit_delay: 1.0 # Delay between page fetches (seconds)
max_pages: 500 # Max pages to discover per crawl
max_depth: 3 # Max link-following depth (BFS only)
default_exclude: # URL patterns to always skip
- /search
- /404
- /login
- /signup
- /auth/
- /api/
- /assets/
- /static/
```
### .env
```
GEMINI_KEY_1=<key>
GEMINI_KEY_2=<key>
GEMINI_KEY_3=<key>
GEMINI_KEY_4=<key>
```
Four Gemini API keys rotated across 16 enrichment workers via `KeyRotator`.
---
## 12. Aurora RAG Integration
Aurora is the RAG-enabled AI assistant running on OpenWebUI (ai.echo6.co).
### How It Works
1. User asks a question in OpenWebUI
2. Aurora's OpenWebUI function/filter embeds the query via TEI (cortex:8090)
3. Searches Qdrant `recon_knowledge` collection for similar concepts
4. Top results are injected into the prompt as context
5. JOSIEFIED Qwen3 8B generates an answer with citations
6. Citations include `download_url` links (PDF files via files.echo6.co, web content via source URL)
### Key Components
- **Embedding**: Same TEI endpoint + bge-m3 model as RECON pipeline (ensures vector compatibility)
- **Search**: Cosine similarity, top-5 results by default
- **LLM**: `goekdenizguelmez/JOSIEFIED-Qwen3:8b` on Ollama (cortex:11434)
- **Citations**: Each result includes `download_url` — either `https://files.echo6.co/...` for PDFs or the original URL for web content
---
## 13. Backup & Recovery
### Automated Backups
**Script**: `/opt/recon/scripts/backup.sh`
**Destination**: Contabo VPS (`root@100.64.0.1:/opt/backups/recon/`)
**Schedule** (cron):
- Every 6 hours: Full backup (concepts, text, DB, config, intel)
- Every 2 hours (off-hours): SQLite DB snapshot only
### What's Backed Up
| Component | Size | Priority | Notes |
|-----------|------|----------|-------|
| data/concepts/ | ~11M | **CRITICAL** | $130+ of Gemini API work |
| data/text/ | ~203M | High | Hours to regenerate |
| data/recon.db | ~6.5M | **CRITICAL** | All pipeline state |
| config.yaml + .env | ~2K | Important | Configuration |
| data/intel/ | ~4K | Low | Intel feed data |
### What's NOT Backed Up
- **Qdrant vectors**: Rebuilt from concept JSONs in ~10 minutes via `recon rebuild`
- **PDF library**: Lives on pi-nas NFS, backed up separately
- **venv/**: Recreated from requirements.txt
### Recovery Procedures
```bash
# Restore from backup
scp -r root@100.64.0.1:/opt/backups/recon/concepts/ /opt/recon/data/concepts/
scp -r root@100.64.0.1:/opt/backups/recon/text/ /opt/recon/data/text/
scp root@100.64.0.1:/opt/backups/recon/recon_LATEST.db /opt/recon/data/recon.db
# Rebuild Qdrant vectors from concept JSONs
cd /opt/recon && source venv/bin/activate
python3 scripts/rebuild_qdrant.py
# Type REBUILD when prompted
```
---
## 14. Embedding Performance
### TEI (Primary) vs Ollama (Fallback)
| Metric | TEI (cortex:8090) | Ollama (cortex:11434) |
|--------|-------------------|----------------------|
| Speed | ~1,711 emb/sec | ~8 emb/sec |
| Model | bge-m3 | bge-m3 |
| Dimensions | 1024 | 1024 |
| Batch size | 128 | 1 |
| Cosine similarity | 0.999900 | 0.999900 |
TEI is ~214x faster than Ollama for embeddings. Always use TEI unless it's down.
### Qdrant Configuration
- Collection: `recon_knowledge`
- Distance: Cosine
- HNSW indexing threshold: 20,000 (below this, brute-force search is used)
- Current state: Brute-force (under 20K vectors) — this is normal and performant at current scale
---
## 15. Content Hashing
- **PDF content**: `MD5(file_bytes)` — stable across renames, detects exact duplicates
- **Web content**: `MD5(extracted_text)` — deduplicates by content, not URL
- Hash is used as the primary key in both SQLite tables and as the directory name for text/concept storage
---
## 16. Source Type Handling
| Source | Path Format | source_type | download_url | Badge |
|--------|-------------|-------------|--------------|-------|
| PDF | `/mnt/library/...` | document | `https://files.echo6.co/...` | PDF |
| Web | `https://...` | web | Original URL | Web |
| Intel | JSON feed | intel_feed | — | — |
The `generate_download_url()` function in utils.py handles the routing:
- URLs starting with `http://` or `https://` are returned as-is
- File paths are converted to `files.echo6.co` URLs
---
## 17. Lessons Learned
### RECON Rebuild Lessons
1. **Verify infrastructure before writing code.** Check Qdrant, TEI, Ollama connectivity first.
2. **Dimensions are 1024, NOT 384.** BGE-M3 uses 1024-dimensional vectors. This caused silent failures in early builds.
3. **TEI >> Ollama for embeddings.** 1,711 vs 8 embeddings/sec. A 214x speedup that makes batch processing viable.
4. **Dynamic discovery over hardcoded paths.** Let the pipeline discover what's on disk rather than maintaining static file lists.
5. **Web content uses the same pipeline.** After text extraction, web and PDF content follow identical enrichment and embedding paths.
6. **Sitemap > link-following.** Sitemaps discover all pages reliably; BFS link-following misses orphaned pages and is slower.
7. **Save to disk before DB operations.** Concept JSONs are written to disk first, then the database is updated. This means recovery is always possible from the JSON files.
8. **NFS over large file sets is slow.** Scanning 13K PDFs over NFS takes ~30 minutes due to MD5 hashing over the network. Plan accordingly.
### Operational Gotchas
- `recon scan` can appear stuck on large PDFs over NFS — it's hashing, not hung
- Some PDFs have corrupt metadata that crashes PyPDF2 — the extractor catches this and falls back
- Gemini rate limits hit with 16 workers — the `KeyRotator` distributes across 4 keys to mitigate
- `iptables-persistent` hangs on interactive prompts in LXC containers — use manual persistence
- The recon LXC has no tmux/screen — use `nohup` for long-running background tasks
---
## 18. Monitoring
### Pipeline Status
```bash
# Quick status
recon status
# Dashboard
http://100.64.0.24:8420
# Tail logs
tail -f /opt/recon/logs/recon.log
# Pipeline run log (when running full background pipeline)
tail -f /opt/recon/pipeline.log
```
### Health Checks
```bash
# Qdrant
curl -s http://100.64.0.14:6333/collections/recon_knowledge | python3 -m json.tool
# TEI
curl -s http://100.64.0.14:8090/info
# Ollama
curl -s http://100.64.0.14:11434/api/tags | python3 -m json.tool
# NFS mount
df -h /mnt/library
# Backup logs
tail -20 /opt/recon/logs/backup.log
```
### Validation
```bash
# Quick validation
recon validate
# Deep validation (checks all files on disk)
recon validate --deep
```
---
## 19. Current State
*As of 2026-02-16*
### Pipeline Progress
| Status | Count |
|--------|-------|
| Catalogued | 10,162 |
| Queued | 8,982 |
| Extracted | 872 |
| Complete | 302 |
| Failed | 2 |
### Vector Database
- Qdrant points: 4,661 (3,144 PDF + 1,517 web)
- Segments: 8
- Indexing: Brute-force (under 20K threshold)
### Active Processing
Full pipeline running in background via `nohup` — extracting through the 8,982 queued documents. Expected to take ~40 hours for full extract -> enrich -> embed cycle.
### Backups
- Schedule: Every 6 hours (full) + every 2 hours (DB only)
- Destination: Contabo VPS (`/opt/backups/recon/`)
- Last verified: 2026-02-16 (220M total backup size)
---
## 20. Dependencies
### System Packages
- Python 3.11+
- pdftotext (poppler-utils)
- tesseract-ocr
- sqlite3
### Python Packages (key)
| Package | Version | Purpose |
|---------|---------|---------|
| Flask | 3.1.2 | Web dashboard |
| google-generativeai | 0.8.6 | Gemini API for enrichment |
| qdrant-client | 1.16.2 | Vector database client |
| PyPDF2 | 3.0.1 | PDF text extraction |
| trafilatura | 2.0.0 | Web content extraction |
| beautifulsoup4 | 4.14.3 | HTML parsing for crawler |
| lxml | 6.0.2 | XML/HTML parsing |
| pytesseract | 0.3.13 | OCR fallback |
| requests | 2.32.5 | HTTP client |
| PyYAML | 6.0.3 | Config file parsing |
Full list in `requirements.txt`.

89
README.md Normal file
View file

@ -0,0 +1,89 @@
# RECON -- Knowledge Extraction Pipeline
Extracts structured knowledge from PDFs and web content into a Qdrant vector database for RAG retrieval by Aurora.
## Quick Start
```bash
# Activate
cd /opt/recon && source venv/bin/activate
# Scan library for new PDFs
recon scan
# Queue and process
recon queue
recon extract
recon enrich
recon embed
# Or run full pipeline
recon run
# Ingest a web page
recon ingest-url "https://example.com/article" --category "Category" --process
# Crawl an entire docs site
recon crawl "https://docs.example.com" --include /docs/ --category "Category" --process
# Upload a PDF
recon upload --file /path/to/document.pdf --category "Category"
# Search
recon search "water purification methods"
# Check status
recon status
recon failures
```
## Dashboard
http://100.64.0.24:8420
## Services
| Service | Location | Purpose |
|---------|----------|---------|
| RECON Dashboard | recon:8420 | Pipeline management + API |
| Qdrant | cortex:6333 | Vector database |
| TEI | cortex:8090 | Embeddings (1,711/sec) |
| Ollama | cortex:11434 | Chat + fallback embeddings |
| OpenWebUI | cortex:8080 (ai.echo6.co) | Aurora chat with RAG |
| File Server | recon:8888 (files.echo6.co) | PDF downloads |
## Key Paths
| Path | Contents |
|------|----------|
| /opt/recon/ | Application code |
| /opt/recon/data/concepts/ | Gemini extractions (**CRITICAL -- back these up**) |
| /opt/recon/data/text/ | Extracted text |
| /opt/recon/data/recon.db | SQLite status DB |
| /mnt/library/ | PDF library (NFS from pi-nas) |
## Backups
Automated every 6 hours to Contabo VPS via `/opt/recon/scripts/backup.sh`.
Concept JSONs are the most valuable data ($130+ of Gemini API work).
Qdrant is NOT backed up -- rebuilt from JSONs in ~10 minutes via `recon rebuild`.
## Monitoring
```bash
# Pipeline status
recon status
# Tail logs
tail -f /opt/recon/logs/recon.log
# Pipeline run log
tail -f /opt/recon/pipeline.log
# Validate consistency
recon validate --deep
```
## Full Documentation
See [PROJECT-BIBLE.md](PROJECT-BIBLE.md) for complete system documentation.

348
api.py Normal file
View file

@ -0,0 +1,348 @@
import json
import os
import requests as http_requests
from flask import Flask, request, jsonify, redirect
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
from .utils import get_config, content_hash, setup_logging
from .status import StatusDB
logger = setup_logging('recon.api')
app = Flask(__name__)
HTML_TEMPLATE = """<!DOCTYPE html>
<html>
<head>
<title>RECON</title>
<meta charset="utf-8">
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body { font-family: 'Courier New', monospace; background: #0a0a0a; color: #c0c0c0; }
.header { background: #111; border-bottom: 1px solid #333; padding: 12px 24px; display: flex; justify-content: space-between; align-items: center; }
.header h1 { color: #00ff41; font-size: 18px; letter-spacing: 2px; }
.header .stats { font-size: 12px; color: #666; }
.nav { background: #0d0d0d; border-bottom: 1px solid #222; padding: 8px 24px; }
.nav a { color: #888; text-decoration: none; margin-right: 16px; font-size: 13px; }
.nav a:hover, .nav a.active { color: #00ff41; }
.content { padding: 24px; max-width: 1400px; margin: 0 auto; }
.search-box { width: 100%; padding: 10px 16px; background: #111; border: 1px solid #333; color: #c0c0c0; font-family: inherit; font-size: 14px; margin-bottom: 16px; }
.search-box:focus { outline: none; border-color: #00ff41; }
table { width: 100%; border-collapse: collapse; font-size: 13px; }
th { background: #111; color: #00ff41; text-align: left; padding: 8px 12px; border-bottom: 1px solid #333; }
td { padding: 6px 12px; border-bottom: 1px solid #1a1a1a; }
tr:hover { background: #111; }
.status { padding: 2px 8px; border-radius: 3px; font-size: 11px; }
.status-complete { color: #00ff41; }
.status-enriched { color: #00bfff; }
.status-extracted { color: #ffa500; }
.status-failed { color: #ff4444; }
.status-queued { color: #888; }
.stat-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; margin-bottom: 24px; }
.stat-card { background: #111; border: 1px solid #222; padding: 16px; }
.stat-card .label { color: #666; font-size: 11px; text-transform: uppercase; }
.stat-card .value { color: #00ff41; font-size: 28px; margin-top: 4px; }
.result { background: #111; border: 1px solid #222; padding: 16px; margin-bottom: 12px; }
.result .title { color: #00ff41; font-size: 14px; margin-bottom: 4px; }
.result .meta { color: #666; font-size: 11px; margin-bottom: 8px; }
.result .content-text { color: #999; font-size: 12px; line-height: 1.5; }
.result .score { color: #ffa500; font-size: 12px; float: right; }
.btn { background: #1a1a1a; border: 1px solid #333; color: #c0c0c0; padding: 6px 14px; cursor: pointer; font-family: inherit; font-size: 12px; }
.btn:hover { border-color: #00ff41; color: #00ff41; }
.domain-tag { display: inline-block; background: #1a1a1a; border: 1px solid #333; padding: 1px 6px; margin: 1px; font-size: 10px; color: #888; }
</style>
</head>
<body>
<div class="header">
<h1>RECON</h1>
<div class="stats">Knowledge Base Management System</div>
</div>
<div class="nav">
<a href="/" id="nav-dash">Dashboard</a>
<a href="/search" id="nav-search">Search</a>
<a href="/catalogue" id="nav-cat">Catalogue</a>
<a href="/failures" id="nav-fail">Failures</a>
</div>
<div class="content" id="main">
{{CONTENT}}
</div>
</body>
</html>"""
def render(content):
return HTML_TEMPLATE.replace('{{CONTENT}}', content)
@app.route('/')
def dashboard():
db = StatusDB()
counts = db.get_status_counts()
cat = counts.get('catalogue', {})
doc = counts.get('documents', {})
total_cat = sum(cat.values())
total_doc = sum(doc.values())
complete = doc.get('complete', 0)
failed = doc.get('failed', 0)
stats = f"""
<div class="stat-grid">
<div class="stat-card"><div class="label">Catalogued PDFs</div><div class="value">{total_cat}</div></div>
<div class="stat-card"><div class="label">In Pipeline</div><div class="value">{total_doc}</div></div>
<div class="stat-card"><div class="label">Complete</div><div class="value">{complete}</div></div>
<div class="stat-card"><div class="label">Failed</div><div class="value">{failed}</div></div>
</div>
<h3 style="color:#00ff41;margin-bottom:12px;">Pipeline Status</h3>
<table>
<tr><th>Status</th><th>Count</th></tr>
"""
for status in ['queued', 'extracting', 'extracted', 'enriching', 'enriched', 'embedding', 'complete', 'failed']:
count = doc.get(status, 0)
stats += f'<tr><td><span class="status status-{status}">{status}</span></td><td>{count}</td></tr>\n'
stats += "</table>"
sources = db.source_breakdown()
if sources:
stats += '<h3 style="color:#00ff41;margin:24px 0 12px;">Sources</h3><table><tr><th>Source</th><th>Count</th><th>Size</th></tr>'
for s in sources:
size_mb = (s.get('total_bytes', 0) or 0) / (1024 * 1024)
stats += f"<tr><td>{s['source']}</td><td>{s['count']}</td><td>{size_mb:.1f} MB</td></tr>"
stats += "</table>"
return render(stats)
@app.route('/search')
def search_page():
query = request.args.get('q', '')
if not query:
content = """
<h3 style="color:#00ff41;margin-bottom:16px;">Semantic Search</h3>
<form method="get" action="/search">
<input type="text" name="q" class="search-box" placeholder="Search the knowledge base..." autofocus>
</form>
<p style="color:#666;font-size:12px;margin-top:8px;">Enter a query to search across all embedded concepts.</p>
"""
return render(content)
config = get_config()
limit = int(request.args.get('limit', 20))
source_filter = request.args.get('source_type', None)
try:
url = f"http://{config['embedding']['host']}:{config['embedding']['port']}/api/embed"
resp = http_requests.post(url, json={
"model": config['embedding']['model'],
"input": query
}, timeout=120)
resp.raise_for_status()
query_vector = resp.json()['embeddings'][0]
qdrant = QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=60
)
search_filter = None
if source_filter:
search_filter = Filter(must=[
FieldCondition(key="source_type", match=MatchValue(value=source_filter))
])
results = qdrant.query_points(
collection_name=config['vector_db']['collection'],
query=query_vector,
limit=limit,
query_filter=search_filter
).points
content = f"""
<h3 style="color:#00ff41;margin-bottom:16px;">Results for: {query}</h3>
<form method="get" action="/search">
<input type="text" name="q" class="search-box" value="{query}">
</form>
<p style="color:#666;font-size:12px;margin-bottom:16px;">{len(results)} results</p>
"""
for r in results:
p = r.payload
title = p.get('title', 'Untitled')
summary = p.get('summary', p.get('content', '')[:200])
score = r.score
domains = p.get('domain', [])
book = p.get('book_title', p.get('filename', ''))
source_type = p.get('source_type', 'document')
domain_tags = ''.join(f'<span class="domain-tag">{d}</span>' for d in (domains if isinstance(domains, list) else []))
content += f"""
<div class="result">
<span class="score">{score:.4f}</span>
<div class="title">{title}</div>
<div class="meta">{book} | {source_type} | {p.get('skill_level', 'unknown')}</div>
<div class="content-text">{summary}</div>
<div style="margin-top:6px;">{domain_tags}</div>
</div>
"""
return render(content)
except Exception as e:
return render(f'<p style="color:#ff4444;">Search error: {e}</p>')
@app.route('/catalogue')
def catalogue_page():
db = StatusDB()
source = request.args.get('source', None)
category = request.args.get('category', None)
limit = int(request.args.get('limit', 100))
docs = db.get_all_documents(source=source, category=category, limit=limit)
content = '<h3 style="color:#00ff41;margin-bottom:16px;">Document Catalogue</h3>'
sources = db.get_sources()
if sources:
content += '<div style="margin-bottom:12px;">'
content += '<a href="/catalogue" class="btn" style="margin-right:4px;">All</a>'
for s in sources:
content += f'<a href="/catalogue?source={s}" class="btn" style="margin-right:4px;">{s}</a>'
content += '</div>'
content += """<table>
<tr><th>Filename</th><th>Source</th><th>Status</th><th>Pages</th><th>Concepts</th><th>Vectors</th></tr>"""
for d in docs:
status = d.get('status', 'unknown')
content += f"""<tr>
<td>{d.get('filename', '?')}</td>
<td>{d.get('source', '')}</td>
<td><span class="status status-{status}">{status}</span></td>
<td>{d.get('pages_extracted', 0)}</td>
<td>{d.get('concepts_extracted', 0)}</td>
<td>{d.get('vectors_inserted', 0)}</td>
</tr>"""
content += "</table>"
return render(content)
@app.route('/failures')
def failures_page():
db = StatusDB()
failures = db.get_failures()
content = '<h3 style="color:#ff4444;margin-bottom:16px;">Failed Documents</h3>'
if not failures:
content += '<p style="color:#666;">No failures.</p>'
return render(content)
content += '<table><tr><th>Filename</th><th>Error</th><th>Retries</th><th>Actions</th></tr>'
for f in failures:
content += f"""<tr>
<td>{f.get('filename', '?')}</td>
<td style="color:#ff4444;font-size:11px;">{f.get('error_message', 'unknown')[:100]}</td>
<td>{f.get('retry_count', 0)}</td>
<td><form method="post" action="/api/retry/{f['hash']}" style="display:inline;">
<button class="btn" type="submit">Retry</button>
</form></td>
</tr>"""
content += "</table>"
return render(content)
@app.route('/api/search', methods=['POST'])
def api_search():
config = get_config()
data = request.get_json()
if not data or 'query' not in data:
return jsonify({'error': 'Missing query'}), 400
query = data['query']
limit = data.get('limit', 20)
source_type = data.get('source_type', None)
try:
url = f"http://{config['embedding']['host']}:{config['embedding']['port']}/api/embed"
resp = http_requests.post(url, json={
"model": config['embedding']['model'],
"input": query
}, timeout=120)
resp.raise_for_status()
query_vector = resp.json()['embeddings'][0]
qdrant = QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=60
)
search_filter = None
if source_type:
search_filter = Filter(must=[
FieldCondition(key="source_type", match=MatchValue(value=source_type))
])
results = qdrant.query_points(
collection_name=config['vector_db']['collection'],
query=query_vector,
limit=limit,
query_filter=search_filter
).points
return jsonify({
'query': query,
'results': [
{
'score': r.score,
'payload': r.payload
}
for r in results
]
})
except Exception as e:
return jsonify({'error': str(e)}), 500
@app.route('/api/status')
def api_status():
db = StatusDB()
return jsonify(db.get_status_counts())
@app.route('/api/retry/<file_hash>', methods=['POST'])
def api_retry(file_hash):
db = StatusDB()
db.increment_retry(file_hash)
return redirect('/failures')
@app.route('/api/ingest', methods=['POST'])
def api_ingest():
from .ingester import ingest_intel
data = request.get_json()
if not data:
return jsonify({'error': 'No JSON body'}), 400
config = get_config()
result = ingest_intel(data, config)
if result is not None:
return jsonify({'intel_id': result})
return jsonify({'error': 'Ingestion failed'}), 500
def run_server():
config = get_config()
host = config['web']['host']
port = config['web']['port']
logger.info(f"Starting RECON web dashboard on {host}:{port}")
app.run(host=host, port=port, debug=False)

440
config.yaml Normal file
View file

@ -0,0 +1,440 @@
# RECON Configuration
# See PROJECT-BIBLE.md Section 11 for full documentation
# Root path for the PDF library (NFS mount from pi-nas)
library_root: /mnt/library
processing:
max_pdf_size_mb: 2000 # Raised from 200MB default for large scanned books
extract_workers: 4 # Concurrent PDF extraction threads
enrich_workers: 16 # Concurrent Gemini enrichment threads (4 keys x 4)
embed_workers: 4 # Concurrent embedding threads
enrich_window_size: 5 # Pages per enrichment window (sent to Gemini)
embed_batch_size: 500 # Vectors per Qdrant upsert batch
rate_limit_delay: 0.1 # Delay between Gemini API calls (seconds)
max_retries: 5 # Max retries for failed documents
extract_timeout: 1800 # Max seconds per document extraction (30 min, allows vision OCR)
page_timeout: 30 # Max seconds per page extraction
enrich_max_retries: 5 # Max retries per enrichment window
enrich_base_delay: 5.0 # Base backoff delay (seconds) — ~5s, 10s, 20s, 40s, 80s
enrich_max_delay: 120.0 # Maximum backoff delay cap (seconds)
embedding:
backend: tei # "tei" (primary, ~1,711 emb/sec) or "ollama" (fallback, ~8 emb/sec)
tei_host: 100.64.0.14 # TEI server (cortex)
tei_port: 8090 # TEI HTTP port
ollama_host: 100.64.0.14 # Ollama server (cortex) — fallback only
ollama_port: 11434 # Ollama HTTP port
model: bge-m3 # Embedding model name
dimensions: 1024 # CRITICAL: bge-m3 is 1024-dim, NOT 384
batch_size: 128 # Embeddings per TEI batch request
sparse_embedding:
enabled: true
host: 100.64.0.14 # Sparse embedding service (cortex)
port: 8091 # Sparse embedding HTTP port
vector_db:
host: 100.64.0.14 # Qdrant server (cortex)
port: 6333 # Qdrant HTTP port
collection: recon_knowledge_hybrid # Collection name
gemini:
model: gemini-2.0-flash # Gemini model for enrichment
response_mime_type: application/json # Force JSON output from Gemini
web:
port: 8420 # Dashboard HTTP port
host: 0.0.0.0 # Bind address (all interfaces)
paths:
base: /opt/recon # Application root
data: /opt/recon/data # Data directory
text: /opt/recon/data/text # Extracted text output (data/text/{hash}/page_NNNN.txt)
concepts: /opt/recon/data/concepts # Enriched concept JSONs (data/concepts/{hash}/window_N.json)
intel: /opt/recon/data/intel # ARGUS intel feeds
logs: /opt/recon/logs # Log files
db: /opt/recon/data/recon.db # SQLite database (WAL mode)
book_server:
base_url: https://files.echo6.co # Public URL prefix for PDF downloads
strip_prefix: /mnt/library # Path prefix stripped when generating download URLs
upload_paths: # Category -> filesystem path mapping for uploads
Survival Reference: /mnt/library/Survival-Companion-Library/Uploads
Military Doctrine: /mnt/library/Army_Pubs/Uploads
Gaming: /mnt/library/Gaming
Reference: /mnt/library/Reference
Technical: /mnt/library/Technical
default: /mnt/library # Fallback for unknown categories
web_scraper:
words_per_page: 2000 # Target words per page chunk for web content
fetch_timeout: 30 # HTTP request timeout (seconds)
rate_limit_delay: 1.0 # Delay between URL fetches (seconds)
max_batch_size: 50 # Max URLs per batch ingest
user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
crawler:
user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
fetch_timeout: 30 # HTTP request timeout (seconds)
rate_limit_delay: 1.0 # Delay between page fetches (seconds)
max_pages: 500 # Max pages to discover per crawl
max_depth: 3 # Max link-following depth (BFS only, not sitemap)
inter_site_cooldown: 30 # Seconds to wait between crawling different sites
recrawl_interval_days: 7 # Skip sites crawled within this many days
default_exclude: # URL patterns always excluded from crawling
- /search
- /404
- /login
- /signup
- /auth/
- /api/
- /assets/
- /static/
- /cart
- /checkout
- /account
- /register
- /subscribe
- /membership
- /shop
- /store
- /product
- /wp-admin
- /feed
- /wp-json
- /xmlrpc
- /.well-known
- /cdn-cgi
# ─── Crawl Targets ─────────────────────────────────────────────
# Sites are crawled by the scheduler loop in tier order (1 first).
# Per-site delay overrides global rate_limit_delay for that site.
# Per-site max_pages/max_depth override global defaults.
# Disabled 2026-04-14 for refactor — see refactored-recon repo for context
sites: []
# sites:
#
# # ═══ TIER 1 — Free, authoritative, high-density ═══
#
# - url: https://hesperian.org/all-hesperian-health-guides
# category: Medical
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Free health guides — WTIND, midwives, community health"
#
# - url: https://swsbm.com
# category: Medical
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Michael Moore's entire free clinical herbal library — PDFs"
#
# - url: https://swsbm.henriettesherbal.com
# category: Medical
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Mirror of Moore's library — grab both"
#
# - url: https://nchfp.uga.edu
# category: Sustainment Systems
# max_depth: 3
# delay: 2.0
# tier: 1
# notes: "USDA canning/preservation safety authority"
#
# - url: https://extension.uidaho.edu
# category: Foundational Skills
# max_depth: 3
# delay: 2.0
# tier: 1
# notes: "Idaho-specific — soil, water, crops, livestock"
#
# - url: https://extension.usu.edu
# category: Foundational Skills
# max_depth: 3
# delay: 2.0
# tier: 1
# notes: "Utah State — Idaho-adjacent climate"
#
# - url: https://attra.ncat.org
# category: Sustainment Systems
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "ATTRA sustainable ag — hundreds of free publications"
#
# - url: https://pfaf.org
# category: Sustainment Systems
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Plants For A Future — 7,000+ edible/medicinal plant profiles"
#
# - url: https://eattheweeds.com
# category: Sustainment Systems
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Green Deane — 1,000+ foraging plant articles"
#
# - url: https://lowtechmagazine.com
# category: Off-Grid Systems
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Exceptional low-tech systems analysis"
#
# - url: https://appropedia.org
# category: Off-Grid Systems
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Appropriate technology wiki"
#
# - url: https://journeytoforever.org
# category: Off-Grid Systems
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "VITA manuals, biodiesel, biogas, hand tools archive"
#
# - url: https://cd3wd.com
# category: Off-Grid Systems
# max_depth: 2
# delay: 3.0
# tier: 1
# notes: "1,050+ appropriate technology eBooks — index pages only"
#
# - url: https://practicalselfreliance.com
# category: Sustainment Systems
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Ashley Adamant — foraging, preservation, homesteading"
#
# - url: https://open.oregonstate.edu/permaculture
# category: Off-Grid Systems
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Millison's free permaculture textbook"
#
# - url: https://open.oregonstate.edu/permaculturedesign
# category: Off-Grid Systems
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Millison's advanced permaculture textbook"
#
# - url: https://mushroomexpert.com
# category: Sustainment Systems
# max_depth: 3
# delay: 3.0
# tier: 1
# notes: "Michael Kuo — mushroom ID, taxonomy, regional coverage"
#
# # ═══ TIER 2 — High value, second pass ═══
#
# - url: https://motherearthnews.com
# category: Foundational Skills
# max_depth: 2
# max_pages: 200
# delay: 8.0
# tier: 2
# notes: "50 years of homesteading archive — large commercial site, be polite"
#
# - url: https://permacultureresearchinstitute.com
# category: Off-Grid Systems
# max_depth: 3
# delay: 5.0
# tier: 2
# notes: "Geoff Lawton — articles, case studies"
#
# - url: https://learnyourland.com
# category: Sustainment Systems
# max_depth: 3
# delay: 5.0
# tier: 2
# notes: "Adam Haritan — foraging articles"
#
# - url: https://herbswithRosalee.com
# category: Medical
# max_depth: 3
# delay: 5.0
# tier: 2
# notes: "Rosalee de la Foret — clinical herbalism articles"
#
# - url: https://commonwealthherbs.com
# category: Medical
# max_depth: 3
# delay: 5.0
# tier: 2
# notes: "Katja and Ryn — clinical herbalism"
#
# - url: https://soilfoodweb.com
# category: Off-Grid Systems
# max_depth: 3
# delay: 5.0
# tier: 2
# notes: "Elaine Ingham soil biology — archive before it goes dark"
#
# - url: https://rocketstoves.com
# category: Off-Grid Systems
# max_depth: 3
# delay: 5.0
# tier: 2
# notes: "Ianto Evans — rocket mass heater designs and PDFs"
#
# - url: https://farmsteadmeatsmith.com
# category: Sustainment Systems
# max_depth: 2
# delay: 5.0
# tier: 2
# notes: "Brandon Sheard — butchering articles (free content only)"
#
# - url: https://deeranddeerhunting.com
# category: Sustainment Systems
# max_depth: 2
# delay: 5.0
# tier: 2
# notes: "Field dressing, processing, hunting technique library"
#
# # ═══ TIER 3 — Government (authoritative) ═══
#
# - url: https://plants.usda.gov
# category: Sustainment Systems
# max_depth: 2
# delay: 2.0
# tier: 3
# notes: "USDA native plant database"
#
# - url: https://ars.usda.gov
# category: Sustainment Systems
# max_depth: 2
# delay: 2.0
# tier: 3
# notes: "USDA Agricultural Research publications"
#
# - url: https://nrcs.usda.gov
# category: Off-Grid Systems
# max_depth: 2
# delay: 2.0
# tier: 3
# notes: "Soil surveys, conservation practice standards"
#
# - url: https://ready.gov
# category: Scenario Playbooks
# max_depth: 3
# delay: 2.0
# tier: 3
# notes: "FEMA emergency preparedness guides"
#
# - url: https://emergency.cdc.gov
# category: Medical
# max_depth: 3
# delay: 2.0
# tier: 3
# notes: "Public health emergency references"
#
# - url: https://agri.idaho.gov
# category: Foundational Skills
# max_depth: 2
# delay: 2.0
# tier: 3
# notes: "Idaho Dept of Agriculture — local relevance"
#
# - url: https://driveonwood.com
# category: Off-Grid Systems
# max_depth: 3
# delay: 3.0
# tier: 3
# notes: "Wood gasification — FEMA manual + modern improvements"
#
# # ═══ TIER 4 — Selective scrape (specific sections only) ═══
#
# - url: https://richsoil.com
# category: Off-Grid Systems
# max_depth: 2
# delay: 5.0
# tier: 4
# notes: "Paul Wheaton — rocket mass heaters, natural building"
#
# - url: https://wildfoodgirl.com
# category: Sustainment Systems
# max_depth: 3
# delay: 5.0
# tier: 4
# notes: "Colorado foraging — Mountain West species"
#
# - url: https://foragersharvest.com
# category: Sustainment Systems
# max_depth: 3
# delay: 5.0
# tier: 4
# notes: "Sam Thayer's site — articles"
#
# - url: https://mountainroseherbs.com/blog
# category: Medical
# max_depth: 2
# delay: 5.0
# tier: 4
# notes: "Herb profiles and preparations — blog section only"
#
# - url: https://herbalprepper.com
# category: Medical
# max_depth: 3
# delay: 5.0
# tier: 4
# notes: "Cat Ellis — grid-down herbalism"
#
# - url: https://prolongedfieldcare.org
# category: Medical
# max_depth: 3
# delay: 5.0
# tier: 4
# notes: "PFC Collective — austere medical protocols"
#
service:
scan_interval: 3600 # Seconds between library scans (1 hour)
stage_poll_interval: 30 # Seconds stages sleep when idle
progress_interval: 60 # Seconds between progress log lines
peertube:
api_base: http://192.168.1.170 # Internal PeerTube API (CT 110 nginx)
public_url: https://stream.echo6.co # Public URL for video links
fetch_timeout: 30 # HTTP timeout for API/VTT requests
rate_limit_delay: 0.5 # Delay between video ingestions (seconds)
# Stream B: New Library Pipeline
new_pipeline:
# Disabled 2026-04-14 for refactor — see refactored-recon repo for context
enabled: false
acquired_dir: /mnt/library/_acquired
ingest_dir: /mnt/library/_ingest
duplicates_dir: /mnt/library/_ingest/_duplicates
failed_dir: /mnt/library/_ingest/_failed
poll_interval: 60
mtime_stability: 10
pilot_domain: "Civil Organization"
spaces_to_underscores: true
# Refactored pipeline configuration (2026-04-14)
# See https://forge.echo6.co/matt/refactored-recon for design
pipeline:
acquired_root: /opt/recon/data/acquired
processing_root: /opt/recon/data/processing
# Subfolder name -> processor module mapping
# Processors do not exist yet; this is scaffolding for Phase 3+
dispatch:
pdf: pdf_processor
stream: transcript_processor
html: html_processor
# mtime stability threshold for picking up files from acquired/
mtime_stability_seconds: 10

264
enricher.py Normal file
View file

@ -0,0 +1,264 @@
import json
import os
import re
import time
import traceback
from concurrent.futures import ThreadPoolExecutor, as_completed
import google.generativeai as genai
from .utils import get_config, setup_logging
from .status import StatusDB
logger = setup_logging('recon.enricher')
def repair_json(text):
"""Attempt to repair common LLM JSON output issues including truncation."""
# Remove control characters except newlines and tabs
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
# Remove trailing commas before } or ]
text = re.sub(r',\s*([}\]])', r'\1', text)
# Handle truncated JSON: try to find the last complete object in the array
try:
json.loads(text, strict=False)
return text
except json.JSONDecodeError:
pass
# Find the last complete }, then close the array
# Walk backward to find the last valid closing brace
last_complete = -1
depth_brace = 0
depth_bracket = 0
in_string = False
escape = False
for i, ch in enumerate(text):
if escape:
escape = False
continue
if ch == '\\' and in_string:
escape = True
continue
if ch == '"' and not escape:
in_string = not in_string
continue
if in_string:
continue
if ch == '{':
depth_brace += 1
elif ch == '}':
depth_brace -= 1
if depth_brace == 0:
last_complete = i
elif ch == '[':
depth_bracket += 1
elif ch == ']':
depth_bracket -= 1
if last_complete > 0:
truncated = text[:last_complete + 1].rstrip().rstrip(',')
# Close any open arrays
open_brackets = truncated.count('[') - truncated.count(']')
truncated += ']' * open_brackets
return truncated
return text
ENRICH_PROMPT = """Extract knowledge concepts from this document text.
A concept is a SELF-CONTAINED piece of knowledge that can stand alone.
For each concept, provide ALL fields:
Required:
- content: Full text of the concept (complete procedure, definition, etc.)
- summary: 1-2 sentence summary
- title: Brief descriptive title
- domain: Array of 1-5 from: Foundational Skills, Sustainment Systems, Defense & Tactics, Off-Grid Systems, Communications, Scenario Playbooks, Reference
- subdomain: Array of specific subcategories (up to 10)
- keywords: Array of 3-30 searchable terms
- skill_level: novice | intermediate | advanced
- key_facts: Array of specific extractable claims, measurements, data points
Optional (include when present):
- scenario_applicable: Array from: tuesday_prepper, month_prepper, year_prepper, multi_year, eotwawki
- cross_domain_tags: Array from: sustainment, medical, security, communications, leadership, logistics, navigation, power_systems, water_systems, food_systems, tactical_ops, community_coordination
- chapter: Chapter name if identifiable
- page_ref: Page reference
- notes: Any additional context
Return JSON array. If no extractable concepts, return [].
Document text:
"""
class KeyRotator:
def __init__(self, keys):
self.keys = keys
self.index = 0
def next(self):
if not self.keys:
raise ValueError("No Gemini API keys configured")
key = self.keys[self.index % len(self.keys)]
self.index += 1
return key
def enrich_window(text, key, config):
genai.configure(api_key=key)
model = genai.GenerativeModel(
config['gemini']['model'],
generation_config={"response_mime_type": config['gemini']['response_mime_type']}
)
response = model.generate_content(ENRICH_PROMPT + text)
raw = response.text
try:
return json.loads(raw, strict=False)
except json.JSONDecodeError:
repaired = repair_json(raw)
return json.loads(repaired, strict=False)
def enrich_single(file_hash, db, config, key_rotator):
doc = db.get_document(file_hash)
if not doc:
return False
text_dir = os.path.join(config['paths']['text'], file_hash)
concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
window_size = config['processing']['enrich_window_size']
delay = config['processing']['rate_limit_delay']
max_retries = config['processing']['max_retries']
if not os.path.exists(text_dir):
db.mark_failed(file_hash, f"Text directory not found: {text_dir}")
return False
db.update_status(file_hash, 'enriching')
try:
os.makedirs(concepts_dir, exist_ok=True)
page_files = sorted([f for f in os.listdir(text_dir) if f.startswith('page_') and f.endswith('.txt')])
if not page_files:
db.mark_failed(file_hash, "No page files found")
return False
pages_text = []
for pf in page_files:
with open(os.path.join(text_dir, pf), encoding='utf-8') as f:
pages_text.append(f.read())
windows = []
for i in range(0, len(pages_text), window_size):
window_pages = pages_text[i:i + window_size]
combined = "\n\n".join(f"--- Page {i + j + 1} ---\n{t}" for j, t in enumerate(window_pages))
windows.append((i, combined))
total_concepts = 0
for w_idx, (start_page, window_text) in enumerate(windows):
window_file = os.path.join(concepts_dir, f"window_{w_idx+1:04d}.json")
if os.path.exists(window_file):
with open(window_file, encoding='utf-8') as f:
existing = json.load(f)
total_concepts += len(existing)
logger.debug(f" Window {w_idx+1} already exists, skipping")
continue
if len(window_text.strip()) < 50:
with open(window_file, 'w') as f:
json.dump([], f)
continue
concepts = None
for attempt in range(max_retries):
try:
key = key_rotator.next()
concepts = enrich_window(window_text, key, config)
break
except Exception as e:
logger.warning(f" Window {w_idx+1} attempt {attempt+1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(delay * (attempt + 1) * 2)
if concepts is None:
db.mark_failed(file_hash, f"All retries failed for window {w_idx+1}")
return False
if not isinstance(concepts, list):
concepts = [concepts] if isinstance(concepts, dict) else []
for c_idx, concept in enumerate(concepts):
concept['_window'] = w_idx + 1
concept['_start_page'] = start_page + 1
concept['_doc_hash'] = file_hash
# JSON FIRST: save before anything else
with open(window_file, 'w', encoding='utf-8') as f:
json.dump(concepts, f, indent=2, ensure_ascii=False)
total_concepts += len(concepts)
logger.debug(f" Window {w_idx+1}/{len(windows)}: {len(concepts)} concepts")
time.sleep(delay)
meta = {
'hash': file_hash,
'total_windows': len(windows),
'total_concepts': total_concepts,
'window_size': window_size,
'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
}
with open(os.path.join(concepts_dir, 'meta.json'), 'w') as f:
json.dump(meta, f, indent=2)
db.update_status(file_hash, 'enriched', concepts_extracted=total_concepts)
logger.info(f"Enriched {doc['filename']}: {total_concepts} concepts from {len(windows)} windows")
return True
except Exception as e:
logger.error(f"Enrichment failed for {file_hash}: {e}\n{traceback.format_exc()}")
db.mark_failed(file_hash, str(e))
return False
def run_enrichment(workers=None, limit=None):
config = get_config()
db = StatusDB()
workers = workers or config['processing']['enrich_workers']
keys = config.get('gemini_keys', [])
if not keys:
logger.error("No Gemini API keys configured in .env")
return 0
key_rotator = KeyRotator(keys)
extracted = db.get_by_status('extracted', limit=limit)
if not extracted:
logger.info("No extracted documents to enrich")
return 0
logger.info(f"Enriching {len(extracted)} documents with {workers} workers, {len(keys)} API key(s)")
success = 0
with ThreadPoolExecutor(max_workers=workers) as pool:
futures = {
pool.submit(enrich_single, doc['hash'], StatusDB(), config, key_rotator): doc
for doc in extracted
}
for future in as_completed(futures):
doc = futures[future]
try:
if future.result():
success += 1
except Exception as e:
logger.error(f"Worker error for {doc['hash']}: {e}")
logger.info(f"Enrichment complete: {success}/{len(extracted)} succeeded")
return success

0
lib/__init__.py Normal file
View file

1930
lib/api.py Normal file

File diff suppressed because it is too large Load diff

432
lib/crawler.py Normal file
View file

@ -0,0 +1,432 @@
"""
RECON Site Crawler URL discovery for bulk web ingestion.
Two discovery strategies:
1. Sitemap-based (preferred) parses sitemap.xml for all URLs
2. Link-following (fallback) crawls from root URL following internal links
Discovered URLs are fed into web_scraper.ingest_url() for processing.
"""
import re
import time
from collections import deque
from urllib.parse import urlparse, urljoin, urldefrag
import requests
from lxml import etree
from .utils import get_config, setup_logging
logger = setup_logging('recon.crawler')
def _get_crawler_config(config=None):
"""Load crawler config with defaults."""
if config is None:
config = get_config()
crawler_cfg = config.get('crawler', {})
web_cfg = config.get('web_scraper', {})
return {
'user_agent': (
crawler_cfg.get('user_agent') or
web_cfg.get('user_agent') or
'Mozilla/5.0 (compatible; RECON/1.0)'
),
'fetch_timeout': crawler_cfg.get('fetch_timeout', 30),
'rate_limit_delay': crawler_cfg.get('rate_limit_delay', 1.0),
'max_pages': crawler_cfg.get('max_pages', 500),
'max_depth': crawler_cfg.get('max_depth', 3),
'default_exclude': crawler_cfg.get('default_exclude', [
'/search', '/404', '/login', '/signup', '/auth/', '/api/', '/assets/', '/static/'
]),
}
# ─── Sitemap Discovery ─────────────────────────────────────────────
def discover_sitemap_url(base_url, config=None):
"""
Find the sitemap URL for a site.
Checks: robots.txt Sitemap: directive, /sitemap.xml,
/sitemap_index.xml, /sitemap-0.xml.
Returns sitemap URL or None.
"""
cfg = _get_crawler_config(config)
headers = {'User-Agent': cfg['user_agent']}
parsed = urlparse(base_url)
root = f"{parsed.scheme}://{parsed.netloc}"
# Check robots.txt first
try:
resp = requests.get(
f"{root}/robots.txt",
headers=headers,
timeout=cfg['fetch_timeout']
)
if resp.status_code == 200:
for line in resp.text.splitlines():
if line.strip().lower().startswith('sitemap:'):
sitemap_url = line.split(':', 1)[1].strip()
# Handle "Sitemap: https://..." — split(':',1) keeps the URL intact
# but "Sitemap: https://..." splits into "Sitemap" and " https://..."
# Need to rejoin properly
if not sitemap_url.startswith('http'):
sitemap_url = line[line.index(':') + 1:].strip()
logger.info(f"Found sitemap in robots.txt: {sitemap_url}")
return sitemap_url
except Exception as e:
logger.debug(f"robots.txt fetch failed: {e}")
# Try common sitemap locations
candidates = [
f"{root}/sitemap.xml",
f"{root}/sitemap_index.xml",
f"{root}/sitemap-0.xml",
]
for url in candidates:
try:
resp = requests.head(
url,
headers=headers,
timeout=cfg['fetch_timeout'],
allow_redirects=True
)
if resp.status_code == 200:
logger.info(f"Found sitemap at: {url}")
return url
except Exception:
continue
logger.warning(f"No sitemap found for {base_url}")
return None
def parse_sitemap(sitemap_url, config=None):
"""
Parse a sitemap XML and return all page URLs.
Handles standard sitemaps (<urlset>) and sitemap indexes
(<sitemapindex>) with recursive sub-sitemap fetching.
"""
cfg = _get_crawler_config(config)
headers = {'User-Agent': cfg['user_agent']}
all_urls = []
def _fetch_and_parse(url, depth=0):
if depth > 3:
return
try:
resp = requests.get(url, headers=headers, timeout=cfg['fetch_timeout'])
resp.raise_for_status()
except Exception as e:
logger.error(f"Failed to fetch sitemap {url}: {e}")
return
try:
root = etree.fromstring(resp.content)
except etree.XMLSyntaxError as e:
logger.error(f"Invalid XML in sitemap {url}: {e}")
return
nsmap = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
# Check if this is a sitemap index
sitemap_locs = root.findall('.//ns:sitemap/ns:loc', nsmap)
if sitemap_locs:
logger.info(f"Sitemap index at {url}{len(sitemap_locs)} sub-sitemaps")
for loc in sitemap_locs:
if loc.text:
_fetch_and_parse(loc.text.strip(), depth + 1)
return
# Standard sitemap — extract URLs
url_locs = root.findall('.//ns:loc', nsmap)
# Fallback: try without namespace
if not url_locs:
url_locs = root.findall('.//loc')
for loc in url_locs:
if loc.text:
all_urls.append(loc.text.strip())
logger.info(f"Parsed {len(url_locs)} URLs from {url}")
_fetch_and_parse(sitemap_url)
# Deduplicate preserving order
seen = set()
unique = []
for url in all_urls:
url_clean = urldefrag(url)[0]
if url_clean not in seen:
seen.add(url_clean)
unique.append(url_clean)
logger.info(f"Total unique URLs from sitemap: {len(unique)}")
return unique
# ─── Link-Following Discovery (Fallback) ───────────────────────────
def crawl_links(base_url, max_depth=3, max_pages=500, config=None):
"""
Discover URLs by following internal links (BFS).
Fallback when no sitemap is available.
"""
from bs4 import BeautifulSoup
cfg = _get_crawler_config(config)
headers = {'User-Agent': cfg['user_agent']}
parsed_base = urlparse(base_url)
base_domain = parsed_base.netloc
discovered = []
visited = set()
queue = deque([(base_url, 0)])
skip_extensions = (
'.pdf', '.png', '.jpg', '.jpeg', '.gif', '.svg',
'.css', '.js', '.zip', '.tar', '.gz', '.mp4', '.mp3',
'.ico', '.woff', '.woff2', '.ttf', '.eot',
)
skip_paths = (
'/tag/', '/tags/', '/page/', '/feed/', '/rss/',
'/wp-json/', '/wp-admin/', '/wp-includes/',
)
while queue and len(discovered) < max_pages:
url, depth = queue.popleft()
url = urldefrag(url)[0]
if url in visited:
continue
if depth > max_depth:
continue
visited.add(url)
discovered.append(url)
if depth >= max_depth:
continue
try:
resp = requests.get(url, headers=headers, timeout=cfg['fetch_timeout'])
if resp.status_code != 200:
continue
if 'text/html' not in resp.headers.get('content-type', ''):
continue
except Exception:
continue
try:
soup = BeautifulSoup(resp.text, 'lxml')
except Exception:
continue
for a_tag in soup.find_all('a', href=True):
href = a_tag['href']
full_url = urljoin(url, href)
full_url = urldefrag(full_url)[0]
parsed = urlparse(full_url)
if parsed.netloc != base_domain:
continue
if any(parsed.path.lower().endswith(ext) for ext in skip_extensions):
continue
if any(skip in parsed.path.lower() for skip in skip_paths):
continue
if full_url not in visited:
queue.append((full_url, depth + 1))
time.sleep(cfg['rate_limit_delay'])
logger.info(f"Link crawl: {len(discovered)} URLs (visited {len(visited)}, depth {max_depth})")
return discovered
# ─── URL Filtering ──────────────────────────────────────────────────
def filter_urls(urls, include=None, exclude=None):
"""
Filter URLs by path prefix include/exclude rules.
include: URL must match at least one prefix (if provided)
exclude: URL must not match any prefix
"""
filtered = []
for url in urls:
path = urlparse(url).path
if include:
if not any(path.startswith(prefix) for prefix in include):
continue
if exclude:
if any(path.startswith(prefix) for prefix in exclude):
continue
filtered.append(url)
logger.info(f"Filtered {len(urls)} -> {len(filtered)} URLs "
f"(include={include}, exclude={exclude})")
return filtered
# ─── Main Crawl Orchestrator ────────────────────────────────────────
def crawl_site(
base_url,
category='Web',
source=None,
include=None,
exclude=None,
max_pages=None,
max_depth=None,
delay=None,
dry_run=False,
use_sitemap=True,
use_links=True,
config=None,
):
"""
Crawl a site and ingest all discovered pages.
1. Discover URLs via sitemap or link-following
2. Apply include/exclude filters
3. Feed each URL through web_scraper.ingest_url()
Returns summary dict with counts and per-URL results.
"""
if config is None:
config = get_config()
cfg = _get_crawler_config(config)
if max_pages is None:
max_pages = cfg['max_pages']
if max_depth is None:
max_depth = cfg['max_depth']
if delay is None:
delay = cfg['rate_limit_delay']
if source is None:
source = urlparse(base_url).netloc
logger.info(f"Crawling {base_url} (category={category}, max_pages={max_pages})")
# ── Phase 1: Discover URLs ──
urls = []
discovery_method = None
if use_sitemap:
sitemap_url = discover_sitemap_url(base_url, config)
if sitemap_url:
urls = parse_sitemap(sitemap_url, config)
discovery_method = 'sitemap'
if not urls and use_links:
logger.info("No sitemap URLs, falling back to link crawl...")
urls = crawl_links(base_url, max_depth=max_depth, max_pages=max_pages, config=config)
discovery_method = 'link_crawl'
if not urls:
logger.warning(f"No URLs discovered for {base_url}")
return {
'site': base_url,
'discovery_method': None,
'urls_discovered': 0,
'urls_after_filter': 0,
'results': [],
'summary': {'total': 0, 'succeeded': 0, 'duplicates': 0, 'failed': 0},
}
# ── Phase 2: Filter URLs ──
all_exclude = list(cfg['default_exclude'])
if exclude:
all_exclude.extend(exclude)
urls = filter_urls(urls, include=include, exclude=all_exclude)
if len(urls) > max_pages:
logger.info(f"Limiting to {max_pages} pages (discovered {len(urls)})")
urls = urls[:max_pages]
logger.info(f"After filtering: {len(urls)} URLs to process")
# ── Dry run ──
if dry_run:
return {
'site': base_url,
'discovery_method': discovery_method,
'dry_run': True,
'urls_discovered': len(urls),
'urls': urls,
}
# ── Phase 3: Ingest each URL ──
from .web_scraper import ingest_url
results = []
total = len(urls)
for i, url in enumerate(urls, 1):
logger.info(f"[{i}/{total}] Ingesting: {url}")
try:
result = ingest_url(url, category=category, source=source, config=config)
result['url'] = url
results.append(result)
status = result.get('status', 'unknown')
title = result.get('title', '')
if status == 'duplicate':
logger.info(f" DUPLICATE: {title}")
else:
logger.info(f" OK: {title} ({result.get('page_count', 0)} pages)")
except Exception as e:
logger.error(f" FAILED: {url} -- {e}")
results.append({
'url': url,
'status': 'failed',
'error': str(e),
})
if i < total and delay > 0:
time.sleep(delay)
# ── Summary ──
succeeded = sum(1 for r in results if r.get('status') not in ('failed', 'duplicate'))
duplicates = sum(1 for r in results if r.get('status') == 'duplicate')
failed = sum(1 for r in results if r.get('status') == 'failed')
summary = {
'total': len(results),
'succeeded': succeeded,
'duplicates': duplicates,
'failed': failed,
}
logger.info(f"Crawl complete: {succeeded} new, {duplicates} duplicates, {failed} failed out of {total}")
return {
'site': base_url,
'domain': urlparse(base_url).netloc,
'category': category,
'discovery_method': discovery_method,
'urls_discovered': total,
'results': results,
'summary': summary,
}

430
lib/embedder.py Normal file
View file

@ -0,0 +1,430 @@
"""
RECON Embedder
Concepts to vectors via TEI (primary, 1024-dim bge-m3, ~1,711 emb/sec)
or Ollama (fallback, ~8 emb/sec). Inserts into Qdrant on cortex:6333.
Supports hybrid dense+sparse vectors when sparse_embedding service is configured.
Dependencies: requests, qdrant-client
Config: embedding, vector_db, processing.embed_workers
"""
import json
import os
import time
import traceback
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests as http_requests
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, SparseVector
from .utils import get_config, concept_id, generate_download_url, setup_logging
from .status import StatusDB
logger = setup_logging('recon.embedder')
# ── Classification allowlists ───────────────────────────────────────────────
VALID_DOMAINS = {
'Agriculture & Livestock', 'Civil Organization', 'Communications',
'Food Systems', 'Foundational Skills', 'Logistics', 'Medical',
'Navigation', 'Operations', 'Power Systems', 'Preservation & Storage',
'Security', 'Shelter & Construction', 'Technology', 'Tools & Equipment',
'Vehicles', 'Water Systems', 'Wilderness Skills',
}
VALID_KNOWLEDGE_TYPES = {'foundational', 'procedural', 'operational'}
VALID_COMPLEXITIES = {'basic', 'intermediate', 'advanced'}
DOMAIN_FALLBACK = 'Foundational Skills'
KNOWLEDGE_TYPE_FALLBACK = 'foundational'
COMPLEXITY_FALLBACK = 'basic'
def _validate_classification(payload):
"""Validate domain, knowledge_type, complexity before upsert.
Logs WARNING and applies safe fallback for any invalid values.
Returns the payload (modified in place if needed).
"""
title = payload.get('title', payload.get('filename', '?'))
# ── domain ──────────────────────────────────────────────────────────
domain = payload.get('domain')
if isinstance(domain, list):
valid = [d for d in domain if d in VALID_DOMAINS]
if valid:
payload['domain'] = valid[0]
else:
logger.warning(f"Invalid domain {domain} for '{title}', fallback → {DOMAIN_FALLBACK}")
payload['domain'] = DOMAIN_FALLBACK
elif isinstance(domain, str):
if domain not in VALID_DOMAINS:
logger.warning(f"Invalid domain '{domain}' for '{title}', fallback → {DOMAIN_FALLBACK}")
payload['domain'] = DOMAIN_FALLBACK
else:
payload['domain'] = DOMAIN_FALLBACK
# ── knowledge_type ──────────────────────────────────────────────────
kt = payload.get('knowledge_type', '')
if isinstance(kt, str):
kt = kt.lower().strip()
else:
kt = ''
if kt not in VALID_KNOWLEDGE_TYPES:
logger.warning(f"Invalid knowledge_type '{kt}' for '{title}', fallback → {KNOWLEDGE_TYPE_FALLBACK}")
payload['knowledge_type'] = KNOWLEDGE_TYPE_FALLBACK
else:
payload['knowledge_type'] = kt
# ── complexity ──────────────────────────────────────────────────────
cx = payload.get('complexity', '')
if isinstance(cx, str):
cx = cx.lower().strip()
else:
cx = ''
if cx not in VALID_COMPLEXITIES:
logger.warning(f"Invalid complexity '{cx}' for '{title}', fallback → {COMPLEXITY_FALLBACK}")
payload['complexity'] = COMPLEXITY_FALLBACK
else:
payload['complexity'] = cx
return payload
def get_embedding_single(text, config):
"""Get a single embedding — uses TEI or Ollama depending on config."""
backend = config['embedding'].get('backend', 'ollama')
if backend == 'tei':
url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/embed"
resp = http_requests.post(url, json={"inputs": text}, timeout=120)
resp.raise_for_status()
return resp.json()[0]
else:
url = f"http://{config['embedding']['ollama_host']}:{config['embedding']['ollama_port']}/api/embed"
resp = http_requests.post(url, json={
"model": config['embedding']['model'],
"input": text
}, timeout=120)
resp.raise_for_status()
return resp.json()['embeddings'][0]
def get_embeddings_batch(texts, config):
"""Get embeddings for a batch of texts via TEI. Falls back to sequential on error."""
url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/embed"
try:
resp = http_requests.post(url, json={"inputs": texts}, timeout=300)
resp.raise_for_status()
return resp.json()
except Exception as e:
if len(texts) <= 1:
raise
# Split batch in half and retry each half
mid = len(texts) // 2
logger.warning(f" Batch of {len(texts)} failed ({e}), splitting in half")
left = get_embeddings_batch(texts[:mid], config)
right = get_embeddings_batch(texts[mid:], config)
return left + right
def get_sparse_embeddings_batch(texts, config):
"""Get sparse embeddings from the sparse embedding service on cortex.
Returns a list of dicts with 'indices' and 'values' keys, or None on failure.
"""
sparse_cfg = config.get('sparse_embedding')
if not sparse_cfg or not sparse_cfg.get('enabled', False):
return None
url = f"http://{sparse_cfg['host']}:{sparse_cfg['port']}/embed_sparse"
try:
resp = http_requests.post(url, json={"inputs": texts}, timeout=300)
resp.raise_for_status()
return resp.json()
except Exception as e:
logger.warning(f" Sparse embedding failed for batch of {len(texts)}: {e}")
return None
def _validate_content(content):
"""Validate and normalize concept content for embedding. Returns clean string or None."""
if content is None:
return None
if not isinstance(content, str):
content = str(content)
content = content.strip()
if len(content) < 10:
return None
# Truncate to 8192 chars (Ollama/TEI input limit)
if len(content) > 8192:
content = content[:8192]
return content
def _build_payload(doc, concept, idx, source, download_url, source_type, page_timestamps):
"""Build and validate payload for a single concept point."""
start_page = concept.get('_start_page', 0)
payload = {
'doc_hash': doc.get('hash', ''),
'filename': doc['filename'],
'book_title': doc.get('book_title', ''),
'book_author': doc.get('book_author', ''),
'source': source,
'download_url': download_url,
'source_type': source_type,
'verification_status': 'unverified',
'credibility_score': 0.7,
'language': 'en',
}
for field in ['content', 'summary', 'title', 'domain', 'subdomain',
'keywords', 'knowledge_type', 'complexity',
'key_facts', 'scenario_applicable',
'cross_domain_tags', 'chapter', 'page_ref', 'notes',
'_window', '_start_page']:
if field in concept:
payload[field] = concept[field]
# Add video timestamp for transcript sources
if source_type == 'transcript' and page_timestamps:
page_key = f"page_{start_page:04d}"
if page_key in page_timestamps:
payload['video_timestamp'] = page_timestamps[page_key]
# Validate classification fields before returning
payload = _validate_classification(payload)
return payload
def _build_point(point_id, dense_vector, sparse_vec, payload, config):
"""Build a PointStruct with dense vector and optional sparse vector."""
sparse_cfg = config.get('sparse_embedding')
if sparse_cfg and sparse_cfg.get('enabled', False) and sparse_vec:
vector = {
"": dense_vector,
"bge-m3-sparse": SparseVector(
indices=sparse_vec['indices'],
values=sparse_vec['values'],
),
}
else:
vector = {"": dense_vector}
return PointStruct(id=point_id, vector=vector, payload=payload)
def embed_single(file_hash, db, config):
doc = db.get_document(file_hash)
if not doc:
return False
concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
if not os.path.exists(concepts_dir):
db.mark_failed(file_hash, f"Concepts directory not found: {concepts_dir}")
return False
db.update_status(file_hash, 'embedding')
try:
qdrant = QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=60
)
collection = config['vector_db']['collection']
qdrant_batch_size = config['processing']['embed_batch_size']
embed_batch_size = config['embedding'].get('batch_size', 128)
backend = config['embedding'].get('backend', 'ollama')
window_files = sorted([
f for f in os.listdir(concepts_dir)
if f.startswith('window_') and f.endswith('.json')
])
if not window_files:
db.mark_failed(file_hash, "No window files found")
return False
all_concepts = []
for wf in window_files:
with open(os.path.join(concepts_dir, wf), encoding='utf-8') as f:
concepts = json.load(f)
if isinstance(concepts, list):
all_concepts.extend([c for c in concepts if isinstance(c, dict)])
if not all_concepts:
db.update_status(file_hash, 'complete', vectors_inserted=0)
logger.info(f"No concepts to embed for {doc['filename']}")
return True
# Look up source from catalogue once per doc
cat_conn = db._get_conn()
cat_row = cat_conn.execute(
"SELECT source FROM catalogue WHERE hash = ?", (file_hash,)
).fetchone()
source = dict(cat_row)['source'] if cat_row else ''
download_url = ''
is_web = doc.get('path', '').startswith(('http://', 'https://'))
source_type = 'web' if is_web else 'document'
# Check meta.json for explicit source_type (e.g. 'transcript')
text_dir = os.path.join(config['paths']['text'], file_hash)
meta_path = os.path.join(text_dir, 'meta.json')
page_timestamps = {}
if os.path.exists(meta_path):
try:
with open(meta_path) as mf:
meta = json.load(mf)
if meta.get('source_type'):
source_type = meta['source_type']
if not download_url and meta.get('url'):
download_url = meta['url']
if meta.get('page_timestamps'):
page_timestamps = meta['page_timestamps']
except Exception:
pass
if doc.get('path'):
download_url = generate_download_url(
doc['path'], config.get('library_root', '/mnt/library')
)
# Build list of valid concepts with their indices
valid = []
skipped = 0
for idx, concept in enumerate(all_concepts):
content = _validate_content(concept.get('content', ''))
if content is None:
skipped += 1
continue
valid.append((idx, concept, content))
if skipped > 0:
logger.info(f" Skipped {skipped} concepts with invalid/empty content")
if not valid:
db.update_status(file_hash, 'complete', vectors_inserted=0)
logger.info(f"No valid concepts to embed for {doc['filename']}")
return True
points = []
embedded_count = 0
if backend == 'tei':
# TEI: batch embedding
for batch_start in range(0, len(valid), embed_batch_size):
batch = valid[batch_start:batch_start + embed_batch_size]
texts = [content for _, _, content in batch]
try:
vectors = get_embeddings_batch(texts, config)
except Exception as e:
logger.error(f" Batch embedding failed at offset {batch_start}: {e}")
# Skip entire batch on unrecoverable error
continue
# Get sparse embeddings for the same batch
sparse_results = get_sparse_embeddings_batch(texts, config)
for i, ((idx, concept, content), vector) in enumerate(zip(batch, vectors)):
start_page = concept.get('_start_page', 0)
point_id = concept_id(file_hash, start_page, idx)
payload = _build_payload(
doc, concept, idx, source, download_url,
source_type, page_timestamps
)
sparse_vec = sparse_results[i] if sparse_results and i < len(sparse_results) else None
points.append(_build_point(point_id, vector, sparse_vec, payload, config))
embedded_count += 1
if len(points) >= qdrant_batch_size:
qdrant.upsert(collection_name=collection, points=points)
logger.debug(f" Upserted batch of {len(points)} points")
points = []
else:
# Ollama: one-at-a-time with retry
for idx, concept, content in valid:
try:
vector = get_embedding_single(content, config)
except Exception as e:
logger.warning(f" Embedding failed for concept {idx}: {e}")
time.sleep(2)
try:
vector = get_embedding_single(content, config)
except Exception as e2:
logger.error(f" Embedding retry failed for concept {idx}: {e2}")
continue
# Get sparse embedding for single text
sparse_results = get_sparse_embeddings_batch([content], config)
sparse_vec = sparse_results[0] if sparse_results else None
start_page = concept.get('_start_page', 0)
point_id = concept_id(file_hash, start_page, idx)
payload = _build_payload(
doc, concept, idx, source, download_url,
source_type, page_timestamps
)
points.append(_build_point(point_id, vector, sparse_vec, payload, config))
embedded_count += 1
if len(points) >= qdrant_batch_size:
qdrant.upsert(collection_name=collection, points=points)
logger.debug(f" Upserted batch of {len(points)} points")
points = []
if points:
qdrant.upsert(collection_name=collection, points=points)
logger.debug(f" Upserted final batch of {len(points)} points")
db.update_status(file_hash, 'complete', vectors_inserted=embedded_count)
logger.info(f"Embedded {doc['filename']}: {embedded_count} vectors ({skipped} skipped)")
return True
except Exception as e:
logger.error(f"Embedding failed for {file_hash}: {e}\n{traceback.format_exc()}")
db.mark_failed(file_hash, str(e))
return False
def run_embedding(workers=None, limit=None):
config = get_config()
db = StatusDB()
workers = workers or config['processing']['embed_workers']
enriched = db.get_by_status('enriched', limit=limit)
if not enriched:
logger.info("No enriched documents to embed")
return 0
backend = config['embedding'].get('backend', 'ollama')
sparse_cfg = config.get('sparse_embedding')
sparse_status = "enabled" if (sparse_cfg and sparse_cfg.get('enabled')) else "disabled"
logger.info(f"Embedding {len(enriched)} documents with {workers} workers (backend: {backend}, sparse: {sparse_status})")
success = 0
with ThreadPoolExecutor(max_workers=workers) as pool:
futures = {
pool.submit(embed_single, doc['hash'], StatusDB(), config): doc
for doc in enriched
}
for future in as_completed(futures):
doc = futures[future]
try:
if future.result():
success += 1
except Exception as e:
logger.error(f"Worker error for {doc['hash']}: {e}")
logger.info(f"Embedding complete: {success}/{len(enriched)} succeeded")
return success

561
lib/enricher.py Normal file
View file

@ -0,0 +1,561 @@
"""
RECON Enricher
Text to structured concepts via Gemini API. Saves JSON to data/concepts/{hash}/
BEFORE any DB operations. Uses 10-page windows, 4 API keys, 16 workers.
Resilience:
- Exponential backoff with jitter for transient errors (429, 500, 503, timeout)
- Permanent errors (JSON parse, auth) fail immediately without wasting retries
- Window failures skip that window and continue partial enrichment beats zero
- Document marked enriched if ANY windows succeeded, failed only if ALL failed
Dependencies: google-generativeai
Config: processing.enrich_workers, processing.enrich_window_size, gemini, paths.concepts
"""
import json
import os
import random
import re
import time
import traceback
from concurrent.futures import ThreadPoolExecutor, as_completed
import google.generativeai as genai
from .utils import get_config, setup_logging
from .status import StatusDB
logger = setup_logging('recon.enricher')
# Docs stuck in "enriching" longer than this get reset to "extracted" for retry
STALE_ENRICHING_HOURS = 2
# ── Classification allowlists ───────────────────────────────────────────────
VALID_DOMAINS = {
'Agriculture & Livestock', 'Civil Organization', 'Communications',
'Food Systems', 'Foundational Skills', 'Logistics', 'Medical',
'Navigation', 'Operations', 'Power Systems', 'Preservation & Storage',
'Security', 'Shelter & Construction', 'Technology', 'Tools & Equipment',
'Vehicles', 'Water Systems', 'Wilderness Skills',
}
VALID_KNOWLEDGE_TYPES = {'foundational', 'procedural', 'operational'}
VALID_COMPLEXITIES = {'basic', 'intermediate', 'advanced'}
DOMAIN_FALLBACK = 'Foundational Skills'
KNOWLEDGE_TYPE_FALLBACK = 'foundational'
COMPLEXITY_FALLBACK = 'basic'
def repair_json(text):
"""Attempt to repair common LLM JSON output issues including truncation."""
# Remove control characters except newlines and tabs
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
# Fix invalid JSON escape sequences (e.g. \e, \p, \c from Gemini)
# Valid JSON escapes: \", \\, \/, \b, \f, \n, \r, \t, \uXXXX
text = re.sub(r'\\(?!["\\/bfnrtu])', r'\\\\', text)
# Remove trailing commas before } or ]
text = re.sub(r',\s*([}\]])', r'\1', text)
# Handle truncated JSON: try to find the last complete object in the array
try:
json.loads(text, strict=False)
return text
except json.JSONDecodeError:
pass
# Find the last complete }, then close the array
# Walk backward to find the last valid closing brace
last_complete = -1
depth_brace = 0
depth_bracket = 0
in_string = False
escape = False
for i, ch in enumerate(text):
if escape:
escape = False
continue
if ch == '\\' and in_string:
escape = True
continue
if ch == '"' and not escape:
in_string = not in_string
continue
if in_string:
continue
if ch == '{':
depth_brace += 1
elif ch == '}':
depth_brace -= 1
if depth_brace == 0:
last_complete = i
elif ch == '[':
depth_bracket += 1
elif ch == ']':
depth_bracket -= 1
if last_complete > 0:
truncated = text[:last_complete + 1].rstrip().rstrip(',')
# Close any open arrays
open_brackets = truncated.count('[') - truncated.count(']')
truncated += ']' * open_brackets
return truncated
return text
ENRICH_PROMPT = """Extract knowledge concepts from this document text.
A concept is a SELF-CONTAINED piece of knowledge that can stand alone.
For each concept, provide ALL fields:
Required:
- content: Full text of the concept (complete procedure, definition, etc.)
- summary: 1-2 sentence summary
- title: Brief descriptive title
- domain: must be exactly one of: Agriculture & Livestock, Civil Organization, Communications, Food Systems, Foundational Skills, Logistics, Medical, Navigation, Operations, Power Systems, Preservation & Storage, Security, Shelter & Construction, Technology, Tools & Equipment, Vehicles, Water Systems, Wilderness Skills return ONLY this exact string, no variations, no new domains, no underscores, no synonyms
CRITICAL: Medical content (first aid, anatomy, pharmacology, herbs, veterinary, austere medicine) Medical
CRITICAL: Food growing, farming, animal husbandry, livestock Agriculture & Livestock
CRITICAL: Foraging, hunting, fishing, bushcraft, wilderness survival Wilderness Skills
CRITICAL: Food preservation, storage, canning, dehydration, processing Preservation & Storage
CRITICAL: Solar, wind, hydro, batteries, generators Power Systems
CRITICAL: Water sourcing, filtration, sanitation, purification Water Systems
CRITICAL: Building, carpentry, structural construction, shelter Shelter & Construction
CRITICAL: Tactical operations, mission execution, combat maneuvers, search & rescue Operations
CRITICAL: Governance, civil administration, community leadership Civil Organization
CRITICAL: Electronics, IT, computing, engineering Technology
CRITICAL: Hand tools, power tools, equipment maintenance Tools & Equipment
CRITICAL: Motor vehicles, aircraft, watercraft, vehicle maintenance Vehicles
CRITICAL: Radio, signals, networking, comms equipment Communications
CRITICAL: Supply chain, transport, distribution, inventory Logistics
CRITICAL: Physical security, OPSEC, threat assessment Security
CRITICAL: Map reading, orienteering, GPS, celestial navigation Navigation
CRITICAL: Cooking methods, food production, recipes, nutrition Food Systems
- subdomain: Array of specific subcategories (up to 10)
- keywords: Array of 3-30 searchable terms
- knowledge_type: foundational | procedural | operational
foundational concepts, definitions, theory, background knowledge, explanations of how things work
procedural step-by-step techniques, instructions, how-to skills, methods you execute
operational application under real conditions, decision-making, mission execution, judgment calls in context
Valid values are ONLY: foundational, procedural, operational do not use any other values
- complexity: basic | intermediate | advanced
basic requires little or no prior knowledge, introductory material, simple concepts
intermediate requires some domain familiarity, assumes foundational knowledge is in place
advanced requires significant experience or expertise, high-stakes or highly technical material
Valid values are ONLY: basic, intermediate, advanced do not use any other values
- key_facts: Array of specific extractable claims, measurements, data points
Optional (include when present):
- scenario_applicable: Array from: tuesday_prepper, month_prepper, year_prepper, multi_year, eotwawki
- cross_domain_tags: Array from: sustainment, medical, security, communications, leadership, logistics, navigation, power_systems, water_systems, food_systems, tactical_ops, community_coordination
- chapter: Chapter name if identifiable
- page_ref: Page reference
- notes: Any additional context
EXAMPLES (knowledge_type + complexity):
- "Needle chest decompression procedure" knowledge_type: "procedural", complexity: "advanced"
- "What is soil texture and why does it matter" knowledge_type: "foundational", complexity: "basic"
- "Coordinating a fire team withdrawal under contact" knowledge_type: "operational", complexity: "advanced"
Return JSON array. If no extractable concepts, return [].
Document text:
"""
class KeyRotator:
def __init__(self, keys):
self.keys = keys
self.index = 0
def next(self):
if not self.keys:
raise ValueError("No Gemini API keys configured")
key = self.keys[self.index % len(self.keys)]
self.index += 1
return key
def enrich_window(text, key, config):
genai.configure(api_key=key)
model = genai.GenerativeModel(
config['gemini']['model'],
generation_config={"response_mime_type": config['gemini']['response_mime_type']}
)
response = model.generate_content(ENRICH_PROMPT + text)
raw = response.text
try:
result = json.loads(raw, strict=False)
except json.JSONDecodeError:
repaired = repair_json(raw)
result = json.loads(repaired, strict=False)
# Filter out non-dict items (nested lists from truncated responses)
if isinstance(result, list):
result = [c for c in result if isinstance(c, dict)]
return result
def _is_transient(error_str):
"""Classify whether an error is transient (worth retrying) or permanent."""
s = error_str.lower()
transient_signals = ['429', 'resource_exhausted', 'quota', 'rate',
'500', '503', 'unavailable', 'timeout',
'connection', 'reset by peer', 'broken pipe']
return any(sig in s for sig in transient_signals)
def _retry_with_backoff(fn, max_retries=5, base_delay=5.0, max_delay=120.0):
"""Retry with exponential backoff + jitter for transient errors.
Backoff: ~5s, ~10s, ~20s, ~40s, ~80s (total ~155s before giving up).
Permanent errors (JSON parse, auth) raise immediately without retrying.
"""
last_exc = None
for attempt in range(max_retries):
try:
return fn()
except Exception as e:
last_exc = e
err = str(e)
if not _is_transient(err):
raise # permanent — don't waste retries
if attempt < max_retries - 1:
delay = min(base_delay * (2 ** attempt) + random.uniform(0, base_delay), max_delay)
logger.info(f" Transient error (attempt {attempt+1}/{max_retries}), "
f"retrying in {delay:.0f}s: {err[:120]}")
time.sleep(delay)
else:
logger.warning(f" Transient error, max retries exhausted: {err[:150]}")
raise last_exc
def _reclassify_field(field_name, allowlist, concept, key, config, max_retries=3):
"""Retry Gemini up to max_retries to get a valid value for a specific field."""
content = concept.get('content', concept.get('summary', ''))
if isinstance(content, str):
content = content[:400]
else:
content = str(content)[:400]
title = concept.get('title', '(untitled)')
allowlist_str = ', '.join(sorted(allowlist))
for attempt in range(max_retries):
try:
prompt = (
f"Your previous response for '{field_name}' was invalid. "
f"You must return ONLY one of these exact strings: {allowlist_str}\n\n"
f"Title: {title}\n"
f"Content: {content}\n\n"
f"Return ONLY the exact string, nothing else. No explanation, no punctuation, no quotes."
)
genai.configure(api_key=key)
model = genai.GenerativeModel(
config['gemini']['model'],
generation_config={"response_mime_type": "text/plain"}
)
resp = model.generate_content(prompt)
value = resp.text.strip().strip('"').strip("'").strip()
if value in allowlist:
return value
# Try case-insensitive match for knowledge_type/complexity
for valid in allowlist:
if value.lower() == valid.lower():
return valid
except Exception as e:
err = str(e).lower()
if any(s in err for s in ['429', 'quota', 'rate', '503']):
time.sleep(min(3 * (2 ** attempt) + random.uniform(0, 2), 30))
else:
logger.warning(f" Reclassify retry {attempt+1} for {field_name} failed: {e}")
return None
def validate_and_fix_concepts(concepts, key, config):
"""Validate domain, knowledge_type, complexity on each concept.
For invalid values: retry Gemini up to 3 times, then apply safe fallback.
"""
for concept in concepts:
if not isinstance(concept, dict):
continue
# ── Validate domain ─────────────────────────────────────────────
domain = concept.get('domain')
if isinstance(domain, list):
# Legacy array format — find first valid or reclassify
valid = [d for d in domain if d in VALID_DOMAINS]
if valid:
concept['domain'] = valid[0]
else:
new_val = _reclassify_field('domain', VALID_DOMAINS, concept, key, config)
if new_val:
concept['domain'] = new_val
else:
logger.warning(f"Invalid domain {domain} for '{concept.get('title', '?')}', using fallback")
concept['domain'] = DOMAIN_FALLBACK
elif isinstance(domain, str):
if domain not in VALID_DOMAINS:
new_val = _reclassify_field('domain', VALID_DOMAINS, concept, key, config)
if new_val:
concept['domain'] = new_val
else:
logger.warning(f"Invalid domain '{domain}' for '{concept.get('title', '?')}', using fallback")
concept['domain'] = DOMAIN_FALLBACK
else:
concept['domain'] = DOMAIN_FALLBACK
# ── Validate knowledge_type ─────────────────────────────────────
kt = concept.get('knowledge_type', '')
if isinstance(kt, str):
kt = kt.lower().strip()
else:
kt = ''
if kt not in VALID_KNOWLEDGE_TYPES:
new_val = _reclassify_field('knowledge_type', VALID_KNOWLEDGE_TYPES, concept, key, config)
if new_val:
concept['knowledge_type'] = new_val
else:
logger.warning(f"Invalid knowledge_type '{kt}' for '{concept.get('title', '?')}', using fallback")
concept['knowledge_type'] = KNOWLEDGE_TYPE_FALLBACK
else:
concept['knowledge_type'] = kt
# ── Validate complexity ─────────────────────────────────────────
cx = concept.get('complexity', '')
if isinstance(cx, str):
cx = cx.lower().strip()
else:
cx = ''
if cx not in VALID_COMPLEXITIES:
new_val = _reclassify_field('complexity', VALID_COMPLEXITIES, concept, key, config)
if new_val:
concept['complexity'] = new_val
else:
logger.warning(f"Invalid complexity '{cx}' for '{concept.get('title', '?')}', using fallback")
concept['complexity'] = COMPLEXITY_FALLBACK
else:
concept['complexity'] = cx
return concepts
def enrich_single(file_hash, db, config, key_rotator):
doc = db.get_document(file_hash)
if not doc:
return False
text_dir = os.path.join(config['paths']['text'], file_hash)
concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
window_size = config['processing']['enrich_window_size']
delay = config['processing']['rate_limit_delay']
proc = config.get('processing', {})
max_retries = proc.get('enrich_max_retries', proc.get('max_retries', 5))
base_delay = proc.get('enrich_base_delay', 5.0)
max_delay = proc.get('enrich_max_delay', 120.0)
if not os.path.exists(text_dir):
db.mark_failed(file_hash, f"Text directory not found: {text_dir}")
return False
db.update_status(file_hash, 'enriching')
try:
os.makedirs(concepts_dir, exist_ok=True)
page_files = sorted([f for f in os.listdir(text_dir) if f.startswith('page_') and f.endswith('.txt')])
if not page_files:
db.mark_failed(file_hash, "No page files found")
return False
pages_text = []
for pf in page_files:
with open(os.path.join(text_dir, pf), encoding='utf-8') as f:
pages_text.append(f.read())
windows = []
for i in range(0, len(pages_text), window_size):
window_pages = pages_text[i:i + window_size]
combined = "\n\n".join(f"--- Page {i + j + 1} ---\n{t}" for j, t in enumerate(window_pages))
windows.append((i, combined))
total_concepts = 0
failed_windows = []
for w_idx, (start_page, window_text) in enumerate(windows):
window_file = os.path.join(concepts_dir, f"window_{w_idx+1:04d}.json")
if os.path.exists(window_file):
with open(window_file, encoding='utf-8') as f:
existing = json.load(f)
total_concepts += len(existing)
logger.debug(f" Window {w_idx+1} already exists, skipping")
continue
if len(window_text.strip()) < 50:
with open(window_file, 'w') as f:
json.dump([], f)
continue
# Attempt enrichment with backoff — failures skip the window, not the doc
try:
key = key_rotator.next()
concepts = _retry_with_backoff(
lambda k=key: enrich_window(window_text, k, config),
max_retries=max_retries,
base_delay=base_delay,
max_delay=max_delay,
)
except Exception as e:
failed_windows.append((w_idx + 1, str(e)[:100]))
logger.warning(f" Window {w_idx+1}/{len(windows)} failed: {e}")
continue # skip this window, keep going
if not isinstance(concepts, list):
concepts = [concepts] if isinstance(concepts, dict) else []
concepts = [c for c in concepts if isinstance(c, dict)]
# Validate domain, knowledge_type, complexity — retry then fallback
validation_key = key_rotator.next()
concepts = validate_and_fix_concepts(concepts, validation_key, config)
for c_idx, concept in enumerate(concepts):
concept['_window'] = w_idx + 1
concept['_start_page'] = start_page + 1
concept['_doc_hash'] = file_hash
# JSON FIRST: save before anything else
with open(window_file, 'w', encoding='utf-8') as f:
json.dump(concepts, f, indent=2, ensure_ascii=False)
total_concepts += len(concepts)
logger.debug(f" Window {w_idx+1}/{len(windows)}: {len(concepts)} concepts")
time.sleep(delay)
# Decide document status based on results
meta = {
'hash': file_hash,
'total_windows': len(windows),
'total_concepts': total_concepts,
'failed_windows': len(failed_windows),
'window_size': window_size,
'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
}
with open(os.path.join(concepts_dir, 'meta.json'), 'w') as f:
json.dump(meta, f, indent=2)
if total_concepts > 0 or not failed_windows:
# Some concepts extracted, or all windows were empty — mark enriched
error_msg = None
if total_concepts == 0 and doc.get('page_count', 0) >= 3:
error_msg = (f"0 concepts from {doc.get('page_count', '?')} pages — "
f"likely image-only PDF, may need manual review")
logger.warning(f" {doc['filename']}: {error_msg}")
elif failed_windows:
wins = ', '.join(str(w) for w, _ in failed_windows[:10])
error_msg = (f"Partial: {len(failed_windows)}/{len(windows)} "
f"windows failed (windows {wins})")
logger.warning(f" {doc['filename']}: {error_msg}")
db.update_status(file_hash, 'enriched', concepts_extracted=total_concepts,
error_message=error_msg)
fw_note = f", {len(failed_windows)} windows failed" if failed_windows else ""
logger.info(f"Enriched {doc['filename']}: {total_concepts} concepts "
f"from {len(windows)} windows{fw_note}")
return True
else:
# Every window failed — document truly failed
first_err = failed_windows[0][1] if failed_windows else 'unknown'
db.mark_failed(file_hash,
f"All {len(windows)} windows failed: {first_err}")
logger.error(f" {doc['filename']}: all {len(windows)} windows failed")
return False
except Exception as e:
logger.error(f"Enrichment failed for {file_hash}: {e}\n{traceback.format_exc()}")
db.mark_failed(file_hash, str(e))
return False
def _recover_stale_enriching(db, max_hours=STALE_ENRICHING_HOURS):
"""Reset docs stuck in enriching back to extracted so they get retried.
This handles the case where a previous enrichment run crashed mid-document.
The enricher skips already-completed window files, so no work is lost.
"""
import sqlite3
conn = db._get_conn()
rows = conn.execute(
"SELECT hash, filename FROM documents WHERE status = 'enriching'",
).fetchall()
if not rows:
return
# Check extracted_at timestamp — if enriching started > max_hours ago, reset
now = __import__('datetime').datetime.now(__import__('datetime').timezone.utc)
reset = []
for row in rows:
doc = db.get_document(row['hash'])
extracted_at = doc.get('extracted_at', '')
if not extracted_at:
reset.append(row)
continue
try:
from datetime import datetime, timezone
ts = datetime.fromisoformat(extracted_at)
if ts.tzinfo is None:
ts = ts.replace(tzinfo=timezone.utc)
age_hours = (now - ts).total_seconds() / 3600
if age_hours > max_hours:
reset.append(row)
except Exception:
reset.append(row)
for row in reset:
conn.execute(
"UPDATE documents SET status = 'extracted' WHERE hash = ?",
(row['hash'],)
)
logger.warning(f"Recovered stale enriching doc: {row['filename']} ({row['hash'][:12]}...)")
if reset:
conn.commit()
logger.info(f"Reset {len(reset)} stale enriching docs back to extracted")
def run_enrichment(workers=None, limit=None):
config = get_config()
db = StatusDB()
workers = workers or config['processing']['enrich_workers']
# Recover docs orphaned by previous crashed enrichment runs
_recover_stale_enriching(db)
keys = config.get('gemini_keys', [])
if not keys:
logger.error("No Gemini API keys configured in .env")
return 0
key_rotator = KeyRotator(keys)
extracted = db.get_by_status('extracted', limit=limit)
if not extracted:
logger.info("No extracted documents to enrich")
return 0
logger.info(f"Enriching {len(extracted)} documents with {workers} workers, {len(keys)} API key(s)")
success = 0
with ThreadPoolExecutor(max_workers=workers) as pool:
futures = {
pool.submit(enrich_single, doc['hash'], StatusDB(), config, key_rotator): doc
for doc in extracted
}
for future in as_completed(futures):
doc = futures[future]
try:
if future.result():
success += 1
except Exception as e:
logger.error(f"Worker error for {doc['hash']}: {e}")
logger.info(f"Enrichment complete: {success}/{len(extracted)} succeeded")
return success

601
lib/extractor.py Normal file
View file

@ -0,0 +1,601 @@
"""
RECON Text Extractor
PDF to text via PyPDF2 -> pdftotext -> Tesseract -> Gemini Vision fallback chain.
Saves to data/text/{hash}/page_NNNN.txt (4-digit zero-padded, 1-indexed).
Safety guards:
- Layer 1: Pre-flight size check (max_pdf_size_mb, default 200)
- Layer 2: Per-document timeout (extract_timeout, default 300s)
- Layer 3: Per-page timeout (page_timeout, default 30s)
- Partial extractions saved as 'extracted' with error_message noting incompleteness
Fallback chain per page:
1. PyPDF2 (fast, free, text-based PDFs)
2. pdftotext/poppler (handles some PDFs PyPDF2 misses)
3. Tesseract OCR (renders page local OCR)
4. Gemini Vision (renders page cloud vision API, last resort for scanned docs)
Dependencies: PyPDF2, pdftotext (poppler-utils), pytesseract, google-generativeai
Config: processing.extract_workers, processing.max_pdf_size_mb,
processing.extract_timeout, processing.page_timeout
"""
import base64
import json
import os
import random
import subprocess
import tempfile
import threading
import time
import traceback
from concurrent.futures import ThreadPoolExecutor, as_completed, TimeoutError as FuturesTimeoutError
from pathlib import Path
import google.generativeai as genai
from PyPDF2 import PdfReader
from .utils import get_config, content_hash, clean_filename_to_title, setup_logging
from .status import StatusDB
logger = setup_logging('recon.extractor')
# ── Gemini Vision singleton (lazy, thread-safe) ──
_vision_keys = None
_vision_key_index = 0
_vision_lock = threading.Lock()
def _get_vision_keys():
"""Load Gemini API keys once from .env (same keys the enricher uses)."""
global _vision_keys
if _vision_keys is not None:
return _vision_keys
with _vision_lock:
if _vision_keys is not None:
return _vision_keys
keys = []
env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), '.env')
if os.path.exists(env_path):
with open(env_path) as f:
for line in f:
line = line.strip()
if not line or line.startswith('#') or '=' not in line:
continue
key_name, val = line.split('=', 1)
val = val.strip().strip('"').strip("'")
if key_name.strip().startswith('GEMINI_KEY_') and val != 'PASTE_KEY_HERE':
keys.append(val)
_vision_keys = keys
if keys:
logger.info(f"Gemini vision OCR: {len(keys)} API key(s) available")
else:
logger.warning("No Gemini API keys found — vision OCR fallback disabled")
return keys
def _next_vision_key():
"""Round-robin through available Gemini keys."""
global _vision_key_index
keys = _get_vision_keys()
if not keys:
return None
with _vision_lock:
key = keys[_vision_key_index % len(keys)]
_vision_key_index += 1
return key
def _is_transient(error_str):
"""Classify whether an error is transient (worth retrying)."""
s = error_str.lower()
transient_signals = ['429', 'resource_exhausted', 'quota', 'rate',
'500', '503', 'unavailable', 'timeout',
'connection', 'reset by peer', 'broken pipe']
return any(sig in s for sig in transient_signals)
def _render_page_to_png(pdf_path, page_num_1indexed, dpi=200, timeout=30):
"""Render a single PDF page to PNG bytes using pdftoppm.
Args:
pdf_path: Path to PDF file
page_num_1indexed: 1-indexed page number
dpi: Resolution (200 = readable text, reasonable file size)
timeout: Subprocess timeout in seconds
Returns:
bytes or None: PNG image data, or None if render fails/blank
"""
with tempfile.TemporaryDirectory() as tmpdir:
prefix = os.path.join(tmpdir, 'page')
try:
subprocess.run(
['pdftoppm', '-f', str(page_num_1indexed), '-l', str(page_num_1indexed),
'-png', '-r', str(dpi), pdf_path, prefix],
capture_output=True, timeout=timeout, check=True
)
png_files = list(Path(tmpdir).glob('*.png'))
if not png_files:
return None
img_data = png_files[0].read_bytes()
# Skip blank pages (tiny image = solid white/blank page)
if len(img_data) < 5000:
return None
return img_data
except (subprocess.TimeoutExpired, subprocess.CalledProcessError, OSError):
return None
def _try_gemini_vision(pdf_path, page_num_1indexed, page_timeout=60):
"""Last-resort OCR: render page to image, send to Gemini vision.
Only called when PyPDF2, pdftotext, AND Tesseract all failed.
Args:
pdf_path: Path to PDF file
page_num_1indexed: 1-indexed page number
page_timeout: Max time for the render + API call
Returns:
str: Extracted text, or empty string if vision fails
"""
api_key = _next_vision_key()
if api_key is None:
return ''
# Render page to PNG
img_data = _render_page_to_png(pdf_path, page_num_1indexed, timeout=min(page_timeout, 30))
if img_data is None:
return ''
# Call Gemini vision with retry for transient errors
last_exc = None
for attempt in range(3):
try:
genai.configure(api_key=api_key)
model = genai.GenerativeModel('gemini-2.0-flash')
response = model.generate_content([
{
'mime_type': 'image/png',
'data': base64.b64encode(img_data).decode('utf-8')
},
"Extract ALL text from this scanned document page exactly as written. "
"Preserve headings, lists, numbered items, tables, and paragraph structure. "
"Return ONLY the extracted text, no commentary or markdown formatting."
])
if response and response.text:
text = response.text.strip()
if len(text) > 10:
return text
return ''
except Exception as e:
last_exc = e
if not _is_transient(str(e)):
break # permanent error — don't retry
if attempt < 2:
delay = 5.0 * (2 ** attempt) + random.uniform(0, 3)
time.sleep(delay)
# Rotate to next key on rate limit
api_key = _next_vision_key() or api_key
if last_exc:
logger.debug(f" Vision OCR failed page {page_num_1indexed}: {last_exc}")
return ''
def _get_page_count(pdf_path):
"""Get page count using pdfinfo (poppler) as fallback when PdfReader fails."""
try:
result = subprocess.run(
['pdfinfo', pdf_path],
capture_output=True, text=True, timeout=30
)
if result.returncode == 0:
for line in result.stdout.splitlines():
if line.startswith('Pages:'):
return int(line.split(':', 1)[1].strip())
except Exception:
pass
return 0
def _extract_page_without_reader(pdf_path, page_num_0indexed, page_timeout=30):
"""Extract text from a single page WITHOUT PyPDF2 reader.
Used when PdfReader() fails entirely (corrupt/encrypted PDFs).
Runs the pdftotext -> Tesseract -> Gemini Vision fallback chain.
Returns:
tuple: (text, ocr_method)
"""
text = ''
# Method 1: pdftotext (poppler)
try:
result = subprocess.run(
['pdftotext', '-f', str(page_num_0indexed + 1),
'-l', str(page_num_0indexed + 1), pdf_path, '-'],
capture_output=True, text=True, timeout=page_timeout
)
if result.returncode == 0:
text = result.stdout
except Exception:
pass
if len(text.strip()) >= 50:
return text, 'pdftotext'
# Method 2: pdftoppm + Tesseract OCR
try:
from PIL import Image
import pytesseract
result = subprocess.run(
['pdftoppm', '-f', str(page_num_0indexed + 1),
'-l', str(page_num_0indexed + 1),
'-png', '-singlefile', pdf_path, '-'],
capture_output=True, timeout=page_timeout * 2
)
if result.returncode == 0 and result.stdout:
with tempfile.NamedTemporaryFile(suffix='.png', delete=True) as tmp:
tmp.write(result.stdout)
tmp.flush()
img = Image.open(tmp.name)
ocr_text = pytesseract.image_to_string(img)
if len(ocr_text.strip()) > len(text.strip()):
text = ocr_text
except Exception:
pass
if len(text.strip()) >= 50:
return text, 'tesseract'
# Method 3: Gemini Vision (last resort)
vision_text = _try_gemini_vision(pdf_path, page_num_0indexed + 1,
page_timeout=page_timeout * 2)
if len(vision_text.strip()) > len(text.strip()):
text = vision_text
if len(text.strip()) >= 10:
return text, 'gemini_vision'
return text, 'none'
# ── Core extraction functions ──
def _pypdf2_extract(reader, page_num):
"""Extract text from a PyPDF2 page object. Runs inside a thread for timeout."""
return reader.pages[page_num].extract_text() or ''
def extract_text_from_page(reader, page_num, pdf_path, page_timeout=30):
"""Extract text from a single page with fallback chain.
Returns:
tuple: (text, ocr_method) where ocr_method is one of:
'pypdf2', 'pdftotext', 'tesseract', 'gemini_vision', 'none'
"""
# Method 1: PyPDF2 (wrapped in thread for timeout — extract_text() can hang)
text = ''
try:
ex = ThreadPoolExecutor(1)
future = ex.submit(_pypdf2_extract, reader, page_num)
try:
text = future.result(timeout=page_timeout)
except FuturesTimeoutError:
logger.warning(f" PyPDF2 timeout on page {page_num + 1}")
text = ''
finally:
ex.shutdown(wait=False, cancel_futures=True)
except Exception:
text = ''
if len(text.strip()) >= 50:
return text, 'pypdf2'
# Method 2: pdftotext via subprocess (inherently timeout-safe)
try:
result = subprocess.run(
['pdftotext', '-f', str(page_num + 1), '-l', str(page_num + 1), pdf_path, '-'],
capture_output=True, text=True, timeout=page_timeout
)
if result.returncode == 0 and len(result.stdout.strip()) > len(text.strip()):
text = result.stdout
except Exception:
pass
if len(text.strip()) >= 50:
return text, 'pdftotext'
# Method 3: pdftoppm + Tesseract OCR
try:
from PIL import Image
import pytesseract
result = subprocess.run(
['pdftoppm', '-f', str(page_num + 1), '-l', str(page_num + 1),
'-png', '-singlefile', pdf_path, '-'],
capture_output=True, timeout=page_timeout * 2
)
if result.returncode == 0 and result.stdout:
with tempfile.NamedTemporaryFile(suffix='.png', delete=True) as tmp:
tmp.write(result.stdout)
tmp.flush()
img = Image.open(tmp.name)
ocr_text = pytesseract.image_to_string(img)
if len(ocr_text.strip()) > len(text.strip()):
text = ocr_text
except Exception:
pass
if len(text.strip()) >= 50:
return text, 'tesseract'
# Method 4: Gemini Vision (last resort — costs API calls but handles scanned docs)
vision_text = _try_gemini_vision(pdf_path, page_num + 1, page_timeout=page_timeout * 2)
if len(vision_text.strip()) > len(text.strip()):
text = vision_text
if len(text.strip()) >= 10:
return text, 'gemini_vision'
return text, 'none'
def extract_book_metadata(first_page_text, config):
keys = config.get('gemini_keys', [])
if not keys or len(first_page_text.strip()) < 20:
return None, None
try:
genai.configure(api_key=keys[0])
model = genai.GenerativeModel(
config['gemini']['model'],
generation_config={"response_mime_type": config['gemini']['response_mime_type']}
)
prompt = f"""Extract the book title and author from this first page text.
Return JSON: {{"title": "...", "author": "..."}}
If unknown, use null for that field.
Text:
{first_page_text[:3000]}"""
response = model.generate_content(prompt)
data = json.loads(response.text)
return data.get('title'), data.get('author')
except Exception as e:
logger.warning(f"Metadata extraction failed: {e}")
return None, None
def extract_single(file_hash, db, config):
doc = db.get_document(file_hash)
if not doc:
return False
pdf_path = doc['path']
filename = doc['filename']
text_dir = os.path.join(config['paths']['text'], file_hash)
if not os.path.exists(pdf_path):
db.mark_failed(file_hash, f"File not found: {pdf_path}")
return False
# Layer 1: Pre-flight size check
proc = config.get('processing', {})
max_size_mb = proc.get('max_pdf_size_mb', 200)
try:
file_size_mb = os.path.getsize(pdf_path) / 1048576
except OSError as e:
db.mark_failed(file_hash, f"Cannot stat file: {e}")
return False
if file_size_mb > max_size_mb:
msg = f"Skipped: {file_size_mb:.0f}MB exceeds {max_size_mb}MB limit"
logger.warning(f"SIZE SKIP: {filename}{msg}")
db.mark_failed(file_hash, msg)
return False
db.update_status(file_hash, 'extracting')
# Layer 2/3 setup
max_doc_seconds = proc.get('extract_timeout', 300)
page_timeout = proc.get('page_timeout', 30)
start_time = time.time()
page_count = 0
pages_extracted = 0
skipped_pages = 0
ocr_pages = []
ocr_methods = {'pypdf2': 0, 'pdftotext': 0, 'tesseract': 0, 'gemini_vision': 0, 'none': 0}
try:
os.makedirs(text_dir, exist_ok=True)
# Try PyPDF2 first; fall back to poppler-only extraction if it fails
reader = None
use_reader = True
try:
reader = PdfReader(pdf_path)
page_count = len(reader.pages)
except Exception as pdf_err:
logger.warning(f"PdfReader failed for {filename}: {pdf_err} — using poppler fallback")
use_reader = False
page_count = _get_page_count(pdf_path)
if page_count == 0:
db.mark_failed(file_hash, f"PdfReader failed and pdfinfo returned 0 pages: {str(pdf_err)[:200]}")
return False
for i in range(page_count):
# Layer 2: Check total document time budget
elapsed = time.time() - start_time
if elapsed > max_doc_seconds:
msg = f"Timed out after {elapsed:.0f}s at page {i}/{page_count}"
logger.warning(f"TIMEOUT: {filename}{msg}")
if pages_extracted > 0:
_save_partial(file_hash, db, doc, config, text_dir,
page_count, pages_extracted, ocr_pages,
f"Partial: {pages_extracted}/{page_count} pages "
f"(timed out after {elapsed:.0f}s)",
ocr_methods=ocr_methods)
return True
else:
db.mark_failed(file_hash, msg)
return False
# Layer 3: Per-page extraction with fallback chain
try:
if use_reader:
text, method = extract_text_from_page(reader, i, pdf_path, page_timeout)
else:
text, method = _extract_page_without_reader(pdf_path, i, page_timeout)
ocr_methods[method] += 1
if method in ('tesseract', 'gemini_vision'):
ocr_pages.append(i + 1)
except Exception as e:
logger.warning(f" Page {i+1}/{page_count} failed: {e} — skipping")
text = ''
skipped_pages += 1
ocr_methods['none'] += 1
page_file = os.path.join(text_dir, f"page_{i+1:04d}.txt")
with open(page_file, 'w', encoding='utf-8') as f:
f.write(text)
if text.strip():
pages_extracted += 1
# Progress logging every 50 pages (more frequent since vision is slower)
if (i + 1) % 50 == 0:
el = time.time() - start_time
rate = (i + 1) / el if el > 0 else 0
vision_n = ocr_methods['gemini_vision']
vision_note = f", {vision_n} vision" if vision_n else ""
logger.info(f" {filename}: page {i+1}/{page_count} "
f"({rate:.1f} pages/sec, {skipped_pages} skipped{vision_note})")
# Full extraction complete — save metadata
first_page_text = ''
first_page_file = os.path.join(text_dir, 'page_0001.txt')
if os.path.exists(first_page_file):
with open(first_page_file, encoding='utf-8') as f:
first_page_text = f.read()
book_title, book_author = extract_book_metadata(first_page_text, config)
if not book_title:
book_title = clean_filename_to_title(filename)
meta = {
'hash': file_hash,
'filename': filename,
'page_count': page_count,
'ocr_pages': ocr_pages,
'skipped_pages': skipped_pages,
'ocr_methods': ocr_methods,
}
with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
json.dump(meta, f, indent=2)
kwargs = {
'page_count': page_count,
'pages_extracted': pages_extracted,
'book_title': book_title,
}
if book_author:
kwargs['book_author'] = book_author
if skipped_pages > 0:
kwargs['error_message'] = (f"Partial: {pages_extracted}/{page_count} pages "
f"({skipped_pages} pages timed out)")
elapsed = time.time() - start_time
db.update_status(file_hash, 'extracted', **kwargs)
ocr_note = f", {len(ocr_pages)} OCR" if ocr_pages else ""
skip_note = f", {skipped_pages} skipped" if skipped_pages > 0 else ""
vision_note = f", {ocr_methods['gemini_vision']} vision" if ocr_methods['gemini_vision'] else ""
logger.info(f"Extracted {filename}: {pages_extracted}/{page_count} pages "
f"({elapsed:.1f}s{ocr_note}{vision_note}{skip_note})")
return True
except Exception as e:
logger.error(f"Extraction failed for {file_hash}: {e}\n{traceback.format_exc()}")
if pages_extracted > 0:
_save_partial(file_hash, db, doc, config, text_dir,
page_count, pages_extracted, ocr_pages,
f"Partial: {pages_extracted}/{page_count} pages "
f"({str(e)[:150]})",
ocr_methods=ocr_methods)
return True
db.mark_failed(file_hash, str(e)[:500])
return False
def _save_partial(file_hash, db, doc, config, text_dir, page_count,
pages_extracted, ocr_pages, error_msg, ocr_methods=None):
"""Save metadata and mark a partial extraction as 'extracted'."""
book_title = clean_filename_to_title(doc['filename'])
first_page_file = os.path.join(text_dir, 'page_0001.txt')
if os.path.exists(first_page_file):
with open(first_page_file, encoding='utf-8') as f:
first_text = f.read()
if len(first_text.strip()) > 20:
title, _ = extract_book_metadata(first_text, config)
if title:
book_title = title
meta = {
'hash': file_hash,
'filename': doc['filename'],
'page_count': page_count,
'ocr_pages': ocr_pages,
'partial': True,
}
if ocr_methods:
meta['ocr_methods'] = ocr_methods
with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
json.dump(meta, f, indent=2)
db.update_status(file_hash, 'extracted',
page_count=page_count,
pages_extracted=pages_extracted,
book_title=book_title,
error_message=error_msg)
logger.info(f" Saved partial extraction: {pages_extracted}/{page_count} pages")
def run_extraction(workers=None):
config = get_config()
db = StatusDB()
workers = workers or config['processing']['extract_workers']
queued = db.get_by_status('queued')
if not queued:
logger.info("No queued documents to extract")
return 0
logger.info(f"Extracting {len(queued)} documents with {workers} workers")
success = 0
with ThreadPoolExecutor(max_workers=workers) as pool:
futures = {pool.submit(extract_single, doc['hash'], StatusDB(), config): doc for doc in queued}
for future in as_completed(futures):
doc = futures[future]
try:
if future.result():
success += 1
except Exception as e:
logger.error(f"Worker error for {doc['hash']}: {e}")
logger.info(f"Extraction complete: {success}/{len(queued)} succeeded")
return success

159
lib/ingester.py Normal file
View file

@ -0,0 +1,159 @@
"""
RECON Intel Ingester
ARGUS intelligence feed intake. Embeds intel JSON and inserts into Qdrant
with source_type='intel_feed'.
Dependencies: requests, qdrant-client
Config: embedding, vector_db
"""
import json
import os
import time
import traceback
import requests as http_requests
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
from .utils import get_config, setup_logging
from .status import StatusDB
logger = setup_logging('recon.ingester')
def ingest_intel(intel_data, config=None):
if config is None:
config = get_config()
db = StatusDB()
required = ['source', 'category', 'content']
for field in required:
if field not in intel_data:
logger.error(f"Missing required field: {field}")
return None
try:
conn = db._get_conn()
cursor = conn.execute(
"""INSERT INTO intel (source, timestamp, region, category, content,
summary, key_facts, credibility_score, verification_status)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
intel_data.get('source', 'unknown'),
intel_data.get('timestamp', time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())),
intel_data.get('region', 'unknown'),
intel_data['category'],
intel_data['content'],
intel_data.get('summary', ''),
json.dumps(intel_data.get('key_facts', [])),
intel_data.get('credibility_score', 0.5),
intel_data.get('verification_status', 'unverified'),
)
)
intel_id = cursor.lastrowid
conn.commit()
url = f"http://{config['embedding']['host']}:{config['embedding']['port']}/api/embed"
resp = http_requests.post(url, json={
"model": config['embedding']['model'],
"input": intel_data['content']
}, timeout=120)
resp.raise_for_status()
vector = resp.json()['embeddings'][0]
qdrant = QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=60
)
point_id = intel_id + 2**60
payload = {
'source_type': 'intel_feed',
'intel_id': intel_id,
'source': intel_data.get('source', 'unknown'),
'region': intel_data.get('region', 'unknown'),
'category': intel_data['category'],
'content': intel_data['content'],
'summary': intel_data.get('summary', ''),
'key_facts': intel_data.get('key_facts', []),
'credibility_score': intel_data.get('credibility_score', 0.5),
'verification_status': intel_data.get('verification_status', 'unverified'),
'timestamp': intel_data.get('timestamp', ''),
'language': 'en',
}
qdrant.upsert(
collection_name=config['vector_db']['collection'],
points=[PointStruct(id=point_id, vector=vector, payload=payload)]
)
conn.execute("UPDATE intel SET vector_id = ? WHERE id = ?", (point_id, intel_id))
conn.commit()
logger.info(f"Ingested intel #{intel_id} from {intel_data.get('source', 'unknown')}")
return intel_id
except Exception as e:
logger.error(f"Intel ingestion failed: {e}\n{traceback.format_exc()}")
return None
def ingest_file(filepath, config=None):
if config is None:
config = get_config()
try:
with open(filepath, encoding='utf-8') as f:
data = json.load(f)
if isinstance(data, list):
results = []
for item in data:
result = ingest_intel(item, config)
results.append(result)
success = sum(1 for r in results if r is not None)
logger.info(f"Ingested {success}/{len(data)} items from {filepath}")
return results
else:
return [ingest_intel(data, config)]
except Exception as e:
logger.error(f"Failed to ingest file {filepath}: {e}")
return []
def run_ingestion(directory=None):
config = get_config()
intel_dir = directory or config['paths']['intel']
if not os.path.exists(intel_dir):
logger.info(f"Intel directory does not exist: {intel_dir}")
return 0
json_files = sorted([
f for f in os.listdir(intel_dir)
if f.endswith('.json') and not f.startswith('.')
])
if not json_files:
logger.info("No intel files to ingest")
return 0
total = 0
for jf in json_files:
filepath = os.path.join(intel_dir, jf)
results = ingest_file(filepath, config)
ingested = sum(1 for r in results if r is not None)
total += ingested
if ingested > 0:
done_dir = os.path.join(intel_dir, 'processed')
os.makedirs(done_dir, exist_ok=True)
os.rename(filepath, os.path.join(done_dir, jf))
logger.info(f"Intel ingestion complete: {total} items ingested")
return total

270
lib/key_manager.py Normal file
View file

@ -0,0 +1,270 @@
"""
RECON Key Manager - Thread-safe API key management with hot-reload.
Provides a singleton KeyManager that workers (enricher, extractor) read from
instead of loading .env directly. Dashboard can update keys at runtime without
restarting the service.
Dependencies: None beyond stdlib + requests (already in requirements.txt)
Config: Reads/writes /opt/recon/.env
"""
import os
import re
import time
import logging
import threading
import requests
logger = logging.getLogger('recon.key_manager')
class KeyManager:
"""Thread-safe API key store with hot-reload and validation."""
_instance = None
_lock = threading.Lock()
def __new__(cls):
if cls._instance is None:
with cls._lock:
if cls._instance is None:
cls._instance = super().__new__(cls)
cls._instance._initialized = False
return cls._instance
def __init__(self):
if self._initialized:
return
self._keys_lock = threading.RLock()
self._gemini_keys = []
self._env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), '.env')
self._last_loaded = None
self._key_stats = {} # key_index -> {calls, errors, last_used}
self._load_from_env()
self._initialized = True
logger.info(f"KeyManager initialized with {len(self._gemini_keys)} Gemini key(s)")
# ── Read Operations ──
def get_gemini_keys(self):
"""Return a copy of current Gemini keys. Thread-safe."""
with self._keys_lock:
return list(self._gemini_keys)
def get_gemini_key(self, index=0):
"""Get a single Gemini key by index. Returns None if out of range."""
with self._keys_lock:
if 0 <= index < len(self._gemini_keys):
return self._gemini_keys[index]
return None
def get_gemini_key_count(self):
"""Return number of loaded Gemini keys."""
with self._keys_lock:
return len(self._gemini_keys)
def get_masked_keys(self):
"""Return keys masked for display: first 8 + ... + last 4 chars."""
with self._keys_lock:
result = []
for i, key in enumerate(self._gemini_keys):
if len(key) > 16:
masked = key[:8] + '...' + key[-4:]
elif len(key) > 8:
masked = key[:4] + '...' + key[-2:]
else:
masked = '****'
stats = self._key_stats.get(i, {})
result.append({
'index': i,
'masked': masked,
'length': len(key),
'calls': stats.get('calls', 0),
'errors': stats.get('errors', 0),
'last_used': stats.get('last_used', None),
'valid': stats.get('valid', None),
'last_validated': stats.get('last_validated', None),
})
return result
# ── Write Operations (all persist to .env) ──
def set_gemini_keys(self, keys):
"""Replace all Gemini keys. Persists to .env. Returns success bool."""
# Filter empty strings
keys = [k.strip() for k in keys if k.strip()]
with self._keys_lock:
self._gemini_keys = keys
self._key_stats = {} # Reset stats on full replace
self._persist_to_env()
logger.info(f"Gemini keys replaced: {len(keys)} key(s) loaded")
return True
def add_gemini_key(self, key):
"""Add a single Gemini key. Persists to .env. Returns new index."""
key = key.strip()
if not key:
raise ValueError("Key cannot be empty")
with self._keys_lock:
# Check for duplicates
if key in self._gemini_keys:
raise ValueError("Key already exists")
self._gemini_keys.append(key)
idx = len(self._gemini_keys) - 1
self._persist_to_env()
logger.info(f"Gemini key added at index {idx}")
return idx
def remove_gemini_key(self, index):
"""Remove a Gemini key by index. Persists to .env. Returns removed key (masked)."""
with self._keys_lock:
if index < 0 or index >= len(self._gemini_keys):
raise IndexError(f"Key index {index} out of range (have {len(self._gemini_keys)} keys)")
if len(self._gemini_keys) <= 1:
raise ValueError("Cannot remove last key — pipeline needs at least 1 Gemini key")
key = self._gemini_keys.pop(index)
# Rebuild stats with shifted indices
new_stats = {}
for i, stats in self._key_stats.items():
if i < index:
new_stats[i] = stats
elif i > index:
new_stats[i - 1] = stats
self._key_stats = new_stats
self._persist_to_env()
masked = key[:8] + '...' + key[-4:] if len(key) > 16 else '****'
logger.info(f"Gemini key removed at index {index}: {masked}")
return masked
def replace_gemini_key(self, index, new_key):
"""Replace a single Gemini key at index. Persists to .env."""
new_key = new_key.strip()
if not new_key:
raise ValueError("Key cannot be empty")
with self._keys_lock:
if index < 0 or index >= len(self._gemini_keys):
raise IndexError(f"Key index {index} out of range")
# Check duplicate (but allow replacing with same key)
if new_key in self._gemini_keys and self._gemini_keys[index] != new_key:
raise ValueError("Key already exists at another index")
self._gemini_keys[index] = new_key
if index in self._key_stats:
self._key_stats[index] = {} # Reset stats for replaced key
self._persist_to_env()
logger.info(f"Gemini key replaced at index {index}")
# ── Validation ──
def validate_key(self, key):
"""
Test a Gemini API key by listing models.
Returns (valid: bool, message: str).
"""
try:
resp = requests.get(
f"https://generativelanguage.googleapis.com/v1beta/models?key={key}",
timeout=10
)
if resp.status_code == 200 and 'models' in resp.text:
return True, "Valid — API responded"
elif resp.status_code == 400:
return False, f"Invalid key (HTTP {resp.status_code})"
elif resp.status_code == 403:
return False, "Key disabled or quota exhausted"
elif resp.status_code == 429:
return True, "Valid — but currently rate-limited"
else:
return False, f"Unexpected response (HTTP {resp.status_code})"
except requests.Timeout:
return False, "Timeout — could not reach Gemini API"
except requests.ConnectionError:
return False, "Connection error — check network"
except Exception as e:
return False, f"Error: {str(e)}"
def validate_all(self):
"""Validate all loaded Gemini keys. Returns list of results."""
results = []
with self._keys_lock:
keys_copy = list(enumerate(self._gemini_keys))
for i, key in keys_copy:
valid, message = self.validate_key(key)
with self._keys_lock:
if i not in self._key_stats:
self._key_stats[i] = {}
self._key_stats[i]['valid'] = valid
self._key_stats[i]['last_validated'] = time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())
results.append({'index': i, 'valid': valid, 'message': message})
time.sleep(0.2) # Don't hammer the API
return results
# ── Stats tracking (called by enricher/extractor) ──
def record_usage(self, key_index, success=True):
"""Record a key usage event. Called by workers after each Gemini call."""
with self._keys_lock:
if key_index not in self._key_stats:
self._key_stats[key_index] = {'calls': 0, 'errors': 0}
self._key_stats[key_index]['calls'] = self._key_stats[key_index].get('calls', 0) + 1
if not success:
self._key_stats[key_index]['errors'] = self._key_stats[key_index].get('errors', 0) + 1
self._key_stats[key_index]['last_used'] = time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())
# ── Internal ──
def _load_from_env(self):
"""Load Gemini keys from .env file."""
keys = []
if os.path.exists(self._env_path):
with open(self._env_path, 'r') as f:
for line in f:
line = line.strip()
if line and not line.startswith('#'):
match = re.match(r'^GEMINI_KEY(?:_\d+)?=(.+)$', line)
if match:
val = match.group(1).strip().strip('"').strip("'")
if val:
keys.append(val)
self._gemini_keys = keys
self._last_loaded = time.time()
def _persist_to_env(self):
"""Write current keys back to .env file, preserving non-Gemini lines."""
other_lines = []
if os.path.exists(self._env_path):
with open(self._env_path, 'r') as f:
for line in f:
stripped = line.strip()
if stripped and not re.match(r'^GEMINI_KEY', stripped):
other_lines.append(line.rstrip('\n'))
with open(self._env_path, 'w') as f:
# Write non-Gemini lines first
for line in other_lines:
f.write(line + '\n')
# Write Gemini keys
for i, key in enumerate(self._gemini_keys, 1):
f.write(f'GEMINI_KEY_{i}={key}\n')
self._last_loaded = time.time()
logger.info(f"Persisted {len(self._gemini_keys)} Gemini key(s) to {self._env_path}")
def reload_from_env(self):
"""Force reload from .env (e.g., if edited externally)."""
with self._keys_lock:
self._load_from_env()
logger.info(f"Reloaded {len(self._gemini_keys)} Gemini key(s) from .env")
return len(self._gemini_keys)
# Module-level convenience — import and use anywhere
_manager = None
def get_key_manager():
"""Get the singleton KeyManager instance."""
global _manager
if _manager is None:
_manager = KeyManager()
return _manager

1637
lib/new_pipeline.py Normal file

File diff suppressed because it is too large Load diff

374
lib/organizer.py Normal file
View file

@ -0,0 +1,374 @@
"""
RECON Library Organizer
After a document completes the pipeline (extract -> enrich -> embed),
this module classifies it by dominant domain and moves it into the
correct Domain/Subdomain/ folder with a sanitized filename.
Two modes:
1. Per-document: determine_dominant_domain() from on-disk concept JSONs
2. Bulk manifest: organize_from_manifest() using pre-built manifest JSON
Path updates trigger the existing catalogue.path_updated_at mechanism,
which sync_qdrant_paths() propagates to Qdrant payloads.
"""
import json
import logging
import os
import shutil
from collections import Counter
from .utils import sanitize_filename
logger = logging.getLogger('recon.organizer')
# ── Domain folder mapping (canonical) ───────────────────────────────────
# Keys = exact domain strings from Gemini enrichment
# Values = filesystem-safe folder names
DOMAIN_FOLDERS = {
'Agriculture & Livestock': 'Agriculture-and-Livestock',
'Civil Organization': 'Civil-Organization',
'Communications': 'Communications',
'Food Systems': 'Food-Systems',
'Foundational Skills': 'Foundational-Skills',
'Logistics': 'Logistics',
'Medical': 'Medical',
'Navigation': 'Navigation',
'Operations': 'Operations',
'Power Systems': 'Power-Systems',
'Preservation & Storage': 'Preservation-and-Storage',
'Security': 'Security',
'Shelter & Construction': 'Shelter-and-Construction',
'Technology': 'Technology',
'Tools & Equipment': 'Tools-and-Equipment',
'Vehicles': 'Vehicles',
'Water Systems': 'Water-Systems',
'Wilderness Skills': 'Wilderness-Skills',
}
def normalize_folder_name(name):
"""Normalize a domain/subdomain name to a folder-safe string.
Examples:
'Edible Plants & Foraging' -> 'Edible-Plants-and-Foraging'
'emergency medicine' -> 'Emergency-Medicine'
"""
if not name:
return 'Uncategorized'
name = name.strip()
name = name.replace('&', 'and')
words = name.split()
titled = []
for w in words:
if w.lower() in ('and', 'of', 'the', 'to', 'for', 'in', 'on', 'at'):
titled.append(w.lower())
else:
titled.append(w.capitalize())
return '-'.join(titled)
def determine_dominant_domain(doc_hash, data_dir):
"""Determine a document's dominant domain from on-disk concept JSONs.
Reads all /data/concepts/{hash}/window_*.json files, counts domain
occurrences across all concepts, returns the top domain.
Args:
doc_hash: Document hash
data_dir: Path to /opt/recon/data
Returns:
(domain, subdomain, confidence) tuple.
domain/subdomain are strings or None.
confidence is float 0-1 (top domain count / total concepts).
"""
concepts_dir = os.path.join(data_dir, 'concepts', doc_hash)
if not os.path.isdir(concepts_dir):
return (None, None, 0.0)
domain_counter = Counter()
subdomain_counter = Counter()
total_concepts = 0
for fname in os.listdir(concepts_dir):
if not fname.startswith('window_') or not fname.endswith('.json'):
continue
fpath = os.path.join(concepts_dir, fname)
try:
with open(fpath, 'r') as f:
concepts = json.load(f)
except (json.JSONDecodeError, OSError):
continue
if not isinstance(concepts, list):
continue
for concept in concepts:
total_concepts += 1
# domain is usually a list with one element
dom = concept.get('domain')
if isinstance(dom, list):
for d in dom:
if isinstance(d, str):
domain_counter[d] += 1
elif isinstance(dom, str):
domain_counter[dom] += 1
sub = concept.get('subdomain')
if isinstance(sub, list):
for s in sub:
if isinstance(s, str):
subdomain_counter[s] += 1
elif isinstance(sub, str):
subdomain_counter[sub] += 1
if total_concepts == 0 or not domain_counter:
return (None, None, 0.0)
top_domains = domain_counter.most_common(2)
dom_name = top_domains[0][0]
dom_count = top_domains[0][1]
confidence = dom_count / total_concepts
# Check ambiguity
is_ambiguous = False
if len(top_domains) >= 2:
dom2_count = top_domains[1][1]
if dom2_count >= dom_count * 0.8:
is_ambiguous = True
if confidence < 0.4:
is_ambiguous = True
if is_ambiguous:
return (None, None, confidence)
top_sub = subdomain_counter.most_common(1)
sub_name = top_sub[0][0] if top_sub else None
return (dom_name, sub_name, confidence)
def _build_target_path(library_root, domain, subdomain, filename, doc_hash):
"""Build the target path for a document, handling domain mapping and collisions.
Returns:
(target_path, sanitized_filename) tuple
"""
san_name = sanitize_filename(filename, doc_hash=doc_hash)
if domain is None:
# Unclassified — leave in place (don't move to Review folder for pipeline)
return (None, san_name)
domain_folder = DOMAIN_FOLDERS.get(domain)
if not domain_folder:
domain_folder = normalize_folder_name(domain)
if subdomain:
sub_folder = normalize_folder_name(subdomain)
else:
sub_folder = 'General'
target_dir = os.path.join(library_root, domain_folder, sub_folder)
target_path = os.path.join(target_dir, san_name)
# Handle collision at target
if os.path.exists(target_path):
stem, ext = os.path.splitext(san_name)
h6 = doc_hash[:6]
new_name = '{} [{}]{}'.format(stem, h6, ext)
if len(new_name) > 120:
max_stem = 120 - len(ext) - 9
stem = stem[:max_stem].rstrip('. -,')
new_name = '{} [{}]{}'.format(stem, h6, ext)
san_name = new_name
target_path = os.path.join(target_dir, san_name)
return (target_path, san_name)
def organize_document(doc_hash, db, config, dry_run=False):
"""Organize a single document: classify, rename, and move.
Args:
doc_hash: Document hash
db: StatusDB instance
config: RECON config dict
dry_run: If True, don't actually move files
Returns:
dict with keys: hash, action, before_path, after_path, domain, subdomain, error
"""
library_root = config['library_root']
data_dir = config['paths']['data']
result = {
'hash': doc_hash,
'action': 'skip',
'before_path': None,
'after_path': None,
'domain': None,
'subdomain': None,
'error': None,
}
# Look up current path from catalogue
conn = db._get_conn()
row = conn.execute(
"SELECT path, filename FROM catalogue WHERE hash = ?", (doc_hash,)
).fetchone()
if not row:
result['error'] = 'Not in catalogue'
return result
current_path = row['path']
current_filename = row['filename']
result['before_path'] = current_path
# Verify file exists on disk
if not dry_run and not os.path.exists(current_path):
result['error'] = 'File not found on disk'
return result
# Determine domain from concept JSONs
domain, subdomain, confidence = determine_dominant_domain(doc_hash, data_dir)
result['domain'] = domain
result['subdomain'] = subdomain
if domain is None:
result['action'] = 'skip_unclassified'
return result
# Build target path
target_path, san_name = _build_target_path(
library_root, domain, subdomain, current_filename, doc_hash
)
if target_path is None:
result['action'] = 'skip_unclassified'
return result
result['after_path'] = target_path
# Already at target?
if os.path.abspath(current_path) == os.path.abspath(target_path):
result['action'] = 'already_organized'
# Still mark as organized
if not dry_run:
db.mark_organized(doc_hash)
return result
if dry_run:
result['action'] = 'would_move'
return result
# Move the file
try:
target_dir = os.path.dirname(target_path)
os.makedirs(target_dir, exist_ok=True)
shutil.move(current_path, target_path)
# Update catalogue (triggers path_updated_at for Qdrant sync)
db.update_catalogue_path(doc_hash, target_path, san_name)
db.mark_organized(doc_hash)
result['action'] = 'moved'
logger.info("Organized %s -> %s [%s/%s]",
doc_hash[:8], target_path, domain, subdomain)
except Exception as e:
result['action'] = 'error'
result['error'] = str(e)
logger.error("Failed to organize %s: %s", doc_hash[:8], e)
return result
def organize_from_manifest(manifest_path, db, config, dry_run=False):
"""Bulk migration using a pre-built manifest JSON.
The manifest is produced by recon_manifest_builder.py and contains
entries with current_path, sanitized_path, sanitized_filename, hash, etc.
Args:
manifest_path: Path to manifest JSON file
db: StatusDB instance
config: RECON config dict
dry_run: If True, don't actually move files
Returns:
dict with summary stats: moved, skipped, errors, already_organized, total
"""
with open(manifest_path, 'r') as f:
entries = json.load(f)
stats = {
'total': len(entries),
'moved': 0,
'skipped': 0,
'already_organized': 0,
'errors': 0,
'not_found': 0,
}
for i, entry in enumerate(entries):
doc_hash = entry['hash']
current_path = entry['current_path']
target_path = entry.get('sanitized_path', entry.get('proposed_path'))
san_name = entry.get('sanitized_filename', entry.get('filename'))
if not target_path or not san_name:
stats['skipped'] += 1
continue
# Skip ambiguous entries
if entry.get('ambiguous'):
stats['skipped'] += 1
continue
# Already at target?
if os.path.abspath(current_path) == os.path.abspath(target_path):
stats['already_organized'] += 1
if not dry_run:
db.mark_organized(doc_hash)
continue
if dry_run:
stats['moved'] += 1
continue
# Verify source exists
if not os.path.exists(current_path):
stats['not_found'] += 1
logger.warning("Manifest: file not found: %s [%s]", current_path, doc_hash[:8])
continue
try:
target_dir = os.path.dirname(target_path)
os.makedirs(target_dir, exist_ok=True)
# Check for collision at target (different file already there)
if os.path.exists(target_path):
stem, ext = os.path.splitext(san_name)
h6 = doc_hash[:6]
san_name = '{} [{}]{}'.format(stem, h6, ext)
target_path = os.path.join(target_dir, san_name)
shutil.move(current_path, target_path)
# Update catalogue + mark organized
db.update_catalogue_path(doc_hash, target_path, san_name)
db.mark_organized(doc_hash)
stats['moved'] += 1
except Exception as e:
stats['errors'] += 1
logger.error("Manifest: failed to move %s: %s", doc_hash[:8], e)
# Progress reporting
if (i + 1) % 1000 == 0:
logger.info("Manifest progress: %d / %d (moved=%d, errors=%d)",
i + 1, stats['total'], stats['moved'], stats['errors'])
return stats

137
lib/peertube_collector.py Normal file
View file

@ -0,0 +1,137 @@
"""
RECON Metrics Collector
Background daemon thread that snapshots pipeline metrics every 5 minutes
to the metrics_snapshots SQLite table. Used for time-series charts.
"""
import json
import time
import threading
import logging
logger = logging.getLogger('recon.collector')
def start_collector(stop_event=None):
"""Start the metrics collector in a daemon thread."""
def _run():
from .status import StatusDB
from .utils import get_config
import requests as req
interval = 120 # 2 minutes
logger.info(f"Metrics collector started (interval: {interval}s)")
while True:
if stop_event and stop_event.is_set():
break
try:
_snapshot(StatusDB(), get_config(), req)
except Exception as e:
logger.error(f"Metrics snapshot failed: {e}")
# Wait with stop check
if stop_event:
stop_event.wait(interval)
if stop_event.is_set():
break
else:
time.sleep(interval)
logger.info("Metrics collector stopped")
t = threading.Thread(target=_run, daemon=True, name='metrics-collector')
t.start()
return t
def _snapshot(db, config, req):
"""Take a single metrics snapshot."""
from datetime import datetime, timezone, timedelta
conn = db._get_conn()
ts = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:00Z') # Round to minute
# Knowledge pipeline stats
try:
totals = conn.execute("""
SELECT
COUNT(*) as total,
SUM(CASE WHEN status = 'complete' THEN 1 ELSE 0 END) as complete,
SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed,
SUM(CASE WHEN status NOT IN ('complete', 'failed') THEN 1 ELSE 0 END) as in_pipeline,
SUM(COALESCE(concepts_extracted, 0)) as concepts,
SUM(COALESCE(vectors_inserted, 0)) as vectors
FROM documents
""").fetchone()
knowledge_data = {
'total': totals['total'],
'complete': totals['complete'],
'failed': totals['failed'],
'in_pipeline': totals['in_pipeline'],
'concepts': totals['concepts'],
'vectors': totals['vectors'],
}
conn.execute(
"INSERT OR REPLACE INTO metrics_snapshots (timestamp, metric_type, data) VALUES (?, ?, ?)",
(ts, 'knowledge', json.dumps(knowledge_data))
)
conn.commit()
except Exception as e:
logger.debug(f"Knowledge snapshot failed: {e}")
# PeerTube pipeline stats (via SSH)
try:
import subprocess
result = subprocess.run(
['ssh', '-o', 'BatchMode=yes', '-o', 'ConnectTimeout=5',
'zvx@192.168.1.170',
'sudo -u peertube psql peertube_prod -t -A -c "SELECT state, COUNT(*) FROM video GROUP BY state;" 2>/dev/null; '
'echo "---"; '
'for d in staging completed transcoded failed; do '
' dir="/opt/bulk-import/$d"; '
' files=$(find -L "$dir" -type f 2>/dev/null | wc -l); '
' echo "$d|$files"; '
'done'],
capture_output=True, text=True, timeout=20
)
if result.returncode == 0 or result.stdout.strip():
sections = result.stdout.split('---')
video_states = {}
if len(sections) > 0:
for line in sections[0].strip().split('\n'):
if '|' in line:
parts = line.split('|')
if len(parts) == 2 and parts[1].isdigit():
video_states[parts[0]] = int(parts[1])
pipeline_files = {}
if len(sections) > 1:
for line in sections[1].strip().split('\n'):
if '|' in line:
parts = line.split('|')
if len(parts) == 2:
pipeline_files[parts[0]] = int(parts[1]) if parts[1].isdigit() else 0
pt_data = {
'video_states': video_states,
'pipeline_files': pipeline_files,
'published': video_states.get('1', 0),
'backlog': sum(pipeline_files.values()),
}
conn.execute(
"INSERT OR REPLACE INTO metrics_snapshots (timestamp, metric_type, data) VALUES (?, ?, ?)",
(ts, 'peertube', json.dumps(pt_data))
)
conn.commit()
except Exception as e:
logger.debug(f"PeerTube snapshot failed: {e}")
# Prune old snapshots (> 7 days)
try:
cutoff = (datetime.now(timezone.utc) - timedelta(days=7)).isoformat()
conn.execute("DELETE FROM metrics_snapshots WHERE timestamp < ?", (cutoff,))
conn.commit()
except Exception:
pass

580
lib/peertube_scraper.py Normal file
View file

@ -0,0 +1,580 @@
"""
RECON PeerTube Scraper Video transcript ingestion.
Fetches WebVTT captions from a PeerTube instance, converts to plain text,
chunks into pages, and feeds into the standard RECON enrichment pipeline.
Output format matches lib/web_scraper.py so the enricher and embedder
process transcript content identically to web content.
"""
import hashlib
import io
import json
import os
import bisect
import re
import time
from datetime import datetime, timezone
from urllib.parse import quote
import requests
import webvtt
from .utils import get_config, setup_logging
from .status import StatusDB
from .web_scraper import chunk_text
logger = setup_logging('recon.peertube_scraper')
# Module-level stop flag — set by service thread for graceful shutdown
_stop_check = None
def set_stop_check(fn):
"""Register a callable that returns True when shutdown is requested."""
global _stop_check
_stop_check = fn
# Defaults (overridden by config.yaml peertube section)
DEFAULT_API_BASE = 'http://192.168.1.170'
DEFAULT_PUBLIC_URL = 'https://stream.echo6.co'
DEFAULT_FETCH_TIMEOUT = 30
DEFAULT_RATE_LIMIT_DELAY = 0.5
def _get_pt_config(config=None):
"""Get PeerTube settings from config, with defaults."""
if config is None:
config = get_config()
pt = config.get('peertube', {})
return {
'api_base': pt.get('api_base', DEFAULT_API_BASE),
'public_url': pt.get('public_url', DEFAULT_PUBLIC_URL),
'fetch_timeout': pt.get('fetch_timeout', DEFAULT_FETCH_TIMEOUT),
'rate_limit_delay': pt.get('rate_limit_delay', DEFAULT_RATE_LIMIT_DELAY),
}
def _api_get(path, config=None, params=None):
"""Make a GET request to the PeerTube API."""
ptc = _get_pt_config(config)
url = f"{ptc['api_base']}{path}"
resp = requests.get(url, params=params, timeout=ptc['fetch_timeout'])
resp.raise_for_status()
return resp.json()
def get_videos(channel=None, since=None, config=None):
"""
Paginate through all published videos on the PeerTube instance.
Args:
channel: Filter to this channel actor_name (e.g., 'mental-outlaw')
since: ISO date string only return videos published after this date
config: RECON config dict
Returns list of video dicts with: uuid, name, duration,
channel.name, channel.displayName, publishedAt, description.
"""
ptc = _get_pt_config(config)
videos = []
start = 0
count = 100 # PeerTube supports up to 100 per page
while True:
if channel:
path = f"/api/v1/video-channels/{channel}/videos"
else:
path = "/api/v1/videos"
data = _api_get(path, config, params={
'count': count,
'start': start,
'sort': '-publishedAt',
})
total = data.get('total', 0)
batch = data.get('data', [])
if not batch:
break
for v in batch:
published = v.get('publishedAt', '')
# Filter by since date
if since and published < since:
# Videos are sorted by publishedAt desc, so once we pass
# the since threshold, all remaining are older — stop
return videos
videos.append({
'uuid': v['uuid'],
'name': v['name'],
'duration': v.get('duration', 0),
'channel_name': v.get('channel', {}).get('name', ''),
'channel_display': v.get('channel', {}).get('displayName', ''),
'publishedAt': published,
'description': (v.get('description') or '')[:500],
})
start += count
if start >= total:
break
# Check for shutdown during pagination
if _stop_check and _stop_check():
logger.info(f"Shutdown requested during video listing — returning {len(videos)} collected so far")
return videos
# Rate limit pagination requests
time.sleep(ptc['rate_limit_delay'])
return videos
def get_captions(uuid, config=None):
"""Get caption list for a video. Returns list of caption dicts."""
data = _api_get(f"/api/v1/videos/{uuid}/captions", config)
return data.get('data', [])
def fetch_vtt(caption_path, config=None):
"""Fetch raw VTT file content from PeerTube."""
ptc = _get_pt_config(config)
url = f"{ptc['api_base']}{caption_path}"
resp = requests.get(url, timeout=ptc['fetch_timeout'])
resp.raise_for_status()
return resp.text
def _parse_vtt_time(time_str):
"""Parse VTT timestamp string (HH:MM:SS.mmm or MM:SS.mmm) to seconds."""
parts = time_str.split(':')
if len(parts) == 3:
h, m, s = parts
return int(h) * 3600 + int(m) * 60 + float(s)
elif len(parts) == 2:
m, s = parts
return int(m) * 60 + float(s)
return 0.0
def vtt_to_text(vtt_content):
"""
Convert WebVTT content to clean plain text with timestamp tracking.
Strips timestamps, de-duplicates consecutive identical cues (common with
Whisper output), removes HTML tags, and joins cues with spaces (not
newlines Whisper cues break mid-sentence).
Returns (text, cue_timestamps) where:
- text: clean prose string
- cue_timestamps: list of (start_seconds, char_offset) tuples tracking
where each VTT cue begins in the output text
"""
buf = io.StringIO(vtt_content)
try:
captions = webvtt.read_buffer(buf)
except Exception:
# Fallback: manual regex parse if webvtt-py fails
return _vtt_to_text_fallback(vtt_content)
prev_text = None
segments = []
raw_timestamps = [] # (start_seconds, segment_index)
for caption in captions:
text = caption.text.strip()
if not text:
continue
# Strip HTML tags
text = re.sub(r'<[^>]+>', '', text)
# De-duplicate consecutive identical cues
if text == prev_text:
continue
prev_text = text
start_seconds = _parse_vtt_time(caption.start)
raw_timestamps.append((start_seconds, len(segments)))
segments.append(text)
# Join with spaces — VTT cues break mid-sentence
raw = ' '.join(segments)
# Clean up double spaces and whitespace
raw = re.sub(r'\s+', ' ', raw).strip()
# Compute char offsets for each tracked segment
seg_offsets = []
pos = 0
for i, seg in enumerate(segments):
seg_offsets.append(pos)
pos += len(seg) + 1 # +1 for space separator
cue_timestamps = []
for start_secs, seg_idx in raw_timestamps:
if seg_idx < len(seg_offsets):
cue_timestamps.append((start_secs, seg_offsets[seg_idx]))
return raw, cue_timestamps
def _vtt_to_text_fallback(vtt_content):
"""Regex-based VTT parser as fallback. Returns (text, cue_timestamps)."""
lines = vtt_content.split('\n')
prev_text = None
segments = []
raw_timestamps = []
last_time = 0.0
for line in lines:
line = line.strip()
if not line or line == 'WEBVTT':
continue
if '-->' in line:
# Parse start time from "00:01:23.456 --> 00:01:25.789"
time_part = line.split('-->')[0].strip()
last_time = _parse_vtt_time(time_part)
continue
if line.isdigit():
continue
text = re.sub(r'<[^>]+>', '', line)
if text == prev_text:
continue
prev_text = text
raw_timestamps.append((last_time, len(segments)))
segments.append(text)
raw = ' '.join(segments)
raw = re.sub(r'\s+', ' ', raw).strip()
# Compute char offsets
seg_offsets = []
pos = 0
for seg in segments:
seg_offsets.append(pos)
pos += len(seg) + 1
cue_timestamps = []
for start_secs, seg_idx in raw_timestamps:
if seg_idx < len(seg_offsets):
cue_timestamps.append((start_secs, seg_offsets[seg_idx]))
return raw, cue_timestamps
def _map_page_timestamps(pages, full_text, cue_timestamps):
"""
Map page numbers to video timestamps.
For each page, finds its approximate start position in the full text,
then looks up the nearest VTT cue timestamp via binary search.
Returns dict: {"page_0001": 0.0, "page_0002": 312.5, ...}
"""
if not cue_timestamps:
return {}
offsets = [ct[1] for ct in cue_timestamps]
times = [ct[0] for ct in cue_timestamps]
page_ts = {}
search_start = 0
for i, page_text in enumerate(pages):
page_name = f"page_{i+1:04d}"
# Find where this page starts in the full text
snippet = page_text[:200].strip()
pos = full_text.find(snippet, search_start)
if pos < 0:
pos = search_start # fallback
# Binary search for nearest cue at or before this position
idx = bisect.bisect_right(offsets, pos) - 1
if idx < 0:
idx = 0
page_ts[page_name] = round(times[idx], 1)
search_start = pos + len(snippet)
return page_ts
def _content_hash(text):
"""MD5 hash of text content — same as web_scraper."""
return hashlib.md5(text.encode('utf-8')).hexdigest()
def ingest_video(uuid, video_meta, config=None):
"""
Ingest a single PeerTube video transcript.
Fetches captions, converts VTT to text, chunks into pages,
saves to data/text/{hash}/, and sets status to 'extracted'.
Args:
uuid: Video UUID
video_meta: Dict with name, duration, channel_name, channel_display,
publishedAt, description
config: RECON config dict
Returns dict with hash, status, title, page_count or None if no captions.
"""
if config is None:
config = get_config()
ptc = _get_pt_config(config)
db = StatusDB()
# Get captions
captions = get_captions(uuid, config)
if not captions:
return None
# Prefer English caption
caption = None
for c in captions:
if c.get('language', {}).get('id') == 'en':
caption = c
break
if caption is None:
caption = captions[0]
# Fetch VTT
vtt_content = fetch_vtt(caption['captionPath'], config)
# Convert to plain text with timestamp tracking
text, cue_timestamps = vtt_to_text(vtt_content)
if not text or len(text) < 50:
logger.warning(f"Transcript too short for {video_meta['name']} ({uuid}): {len(text)} chars")
return None
# Hash the text content
doc_hash = _content_hash(text)
# Check for duplicate
conn = db._get_conn()
existing = conn.execute("SELECT * FROM catalogue WHERE hash = ?", (doc_hash,)).fetchone()
if existing:
doc = db.get_document(doc_hash)
existing_status = doc['status'] if doc else existing['status']
logger.debug(f"Duplicate transcript (hash {doc_hash[:12]}...) — {video_meta['name']}")
return {
'hash': doc_hash,
'status': 'duplicate',
'title': video_meta['name'],
'existing_status': existing_status,
}
# Chunk into pages
words_per_page = config.get('web_scraper', {}).get('words_per_page', 2000)
pages = chunk_text(text, words_per_page)
# Compute page-to-timestamp mapping
page_timestamps = _map_page_timestamps(pages, text, cue_timestamps)
# Save text files
text_dir = os.path.join(config['paths']['text'], doc_hash)
os.makedirs(text_dir, exist_ok=True)
for i, page_text in enumerate(pages, 1):
page_file = os.path.join(text_dir, f"page_{i:04d}.txt")
with open(page_file, 'w', encoding='utf-8') as f:
f.write(page_text)
# Save meta.json
video_url = f"{ptc['public_url']}/w/{uuid}"
meta = {
'hash': doc_hash,
'source_type': 'transcript',
'url': video_url,
'title': video_meta['name'],
'author': video_meta.get('channel_display', ''),
'channel': video_meta.get('channel_name', ''),
'duration': video_meta.get('duration', 0),
'date': video_meta.get('publishedAt', ''),
'description': video_meta.get('description', ''),
'sitename': 'stream.echo6.co',
'page_count': len(pages),
'text_length': len(text),
'page_timestamps': page_timestamps,
'fetched_at': datetime.now(timezone.utc).isoformat(),
}
with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
json.dump(meta, f, indent=2)
# Display filename for catalogue
display_name = re.sub(r'[^\w\s._-]', '', video_meta['name'])[:200].strip()
if not display_name:
display_name = uuid
# Add to catalogue
db.add_to_catalogue(
doc_hash, display_name, video_url,
len(text), 'stream.echo6.co', video_meta.get('channel_name', 'unknown')
)
# Queue + advance to extracted
db.queue_document(doc_hash)
db.update_status(doc_hash, 'extracted',
page_count=len(pages),
pages_extracted=len(pages),
book_title=video_meta['name'],
book_author=video_meta.get('channel_display', ''))
logger.info(
f"Ingested transcript: {video_meta['name']} ({uuid[:8]}...) "
f"-> {doc_hash[:12]}... ({len(pages)} pages, {len(text)} chars)"
)
return {
'hash': doc_hash,
'status': 'extracted',
'title': video_meta['name'],
'page_count': len(pages),
'text_length': len(text),
'page_timestamps': page_timestamps,
'channel': video_meta.get('channel_name', ''),
'duration': video_meta.get('duration', 0),
'url': video_url,
}
def ingest_channel(channel_name, config=None, since=None):
"""
Ingest all captioned videos from a specific channel.
Returns summary dict.
"""
if config is None:
config = get_config()
ptc = _get_pt_config(config)
logger.info(f"Ingesting channel: {channel_name}")
videos = get_videos(channel=channel_name, since=since, config=config)
return _ingest_video_list(videos, config, ptc)
def ingest_all(config=None, since=None):
"""
Ingest all captioned videos from the entire PeerTube instance.
Returns summary dict.
"""
if config is None:
config = get_config()
ptc = _get_pt_config(config)
logger.info("Ingesting all PeerTube videos with captions")
videos = get_videos(since=since, config=config)
return _ingest_video_list(videos, config, ptc)
def _ingest_video_list(videos, config, ptc):
"""Process a list of videos — shared logic for ingest_channel and ingest_all."""
results = []
skipped_no_captions = 0
skipped_duplicate = 0
failed = 0
ingested = 0
total_pages = 0
total = len(videos)
logger.info(f"Found {total} videos to check for captions")
for i, video in enumerate(videos, 1):
if _stop_check and _stop_check():
logger.info(f"Shutdown requested — stopping after {i-1}/{total} videos")
break
uuid = video['uuid']
try:
result = ingest_video(uuid, video, config)
if result is None:
skipped_no_captions += 1
elif result['status'] == 'duplicate':
skipped_duplicate += 1
else:
ingested += 1
total_pages += result.get('page_count', 0)
results.append(result)
except Exception as e:
logger.error(f"[{i}/{total}] Failed: {video['name']} ({uuid}) — {e}")
failed += 1
# Check for shutdown
if _stop_check and _stop_check():
logger.info(f"Shutdown requested — stopping after {i}/{total} videos")
break
# Rate limit
if i < total:
time.sleep(ptc['rate_limit_delay'])
# Progress logging every 50 videos
if i % 50 == 0:
logger.info(
f"Progress: {i}/{total} checked — "
f"{ingested} ingested, {skipped_no_captions} no captions, "
f"{skipped_duplicate} dupes, {failed} failed"
)
logger.info(
f"PeerTube ingestion complete: {ingested} ingested ({total_pages} pages), "
f"{skipped_no_captions} no captions, {skipped_duplicate} duplicates, "
f"{failed} failed out of {total} videos"
)
return {
'results': results,
'summary': {
'total_checked': total,
'ingested': ingested,
'skipped_no_captions': skipped_no_captions,
'skipped_duplicate': skipped_duplicate,
'failed': failed,
'total_pages': total_pages,
}
}
def get_instance_stats(config=None):
"""Get PeerTube instance statistics for the dashboard."""
if config is None:
config = get_config()
db = StatusDB()
# Total videos on instance
try:
data = _api_get("/api/v1/videos", config, params={'count': 1})
total_videos = data.get('total', 0)
except Exception:
total_videos = 0
# Videos ingested into RECON (from catalogue)
conn = db._get_conn()
ingested = conn.execute(
"SELECT count(*) FROM catalogue WHERE source = 'stream.echo6.co'"
).fetchone()[0]
# Status breakdown
status_rows = conn.execute(
"SELECT d.status, count(*) as cnt FROM documents d "
"JOIN catalogue c ON d.hash = c.hash "
"WHERE c.source = 'stream.echo6.co' "
"GROUP BY d.status"
).fetchall()
status_breakdown = {row['status']: row['cnt'] for row in status_rows}
return {
'total_videos': total_videos,
'ingested': ingested,
'status_breakdown': status_breakdown,
}

508
lib/status.py Normal file
View file

@ -0,0 +1,508 @@
"""
RECON Status Tracker
SQLite operations for catalogue and documents tables. WAL mode, thread-local connections.
Status flow: catalogued -> queued -> extracting -> extracted -> enriching -> enriched -> embedding -> complete.
Config: paths.db
"""
import os
import sqlite3
import threading
from datetime import datetime, timezone
from .utils import get_config
_local = threading.local()
class StatusDB:
def __init__(self, db_path=None):
if db_path is None:
db_path = get_config()['paths']['db']
self.db_path = db_path
os.makedirs(os.path.dirname(db_path), exist_ok=True)
self._init_db()
def _get_conn(self):
if not hasattr(_local, 'conn') or _local.conn is None:
_local.conn = sqlite3.connect(self.db_path, timeout=30)
_local.conn.row_factory = sqlite3.Row
_local.conn.execute("PRAGMA journal_mode=WAL")
_local.conn.execute("PRAGMA busy_timeout=5000")
return _local.conn
def _init_db(self):
conn = self._get_conn()
conn.executescript("""
CREATE TABLE IF NOT EXISTS catalogue (
hash TEXT PRIMARY KEY,
filename TEXT NOT NULL,
path TEXT NOT NULL,
size_bytes INTEGER,
source TEXT,
category TEXT,
status TEXT DEFAULT 'catalogued',
discovered_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS documents (
hash TEXT PRIMARY KEY,
filename TEXT NOT NULL,
path TEXT,
size_bytes INTEGER,
page_count INTEGER,
book_title TEXT,
book_author TEXT,
collection TEXT DEFAULT 'survival',
status TEXT DEFAULT 'pending',
pages_extracted INTEGER DEFAULT 0,
concepts_extracted INTEGER DEFAULT 0,
vectors_inserted INTEGER DEFAULT 0,
discovered_at TEXT DEFAULT CURRENT_TIMESTAMP,
extracted_at TEXT,
enriched_at TEXT,
embedded_at TEXT,
error_message TEXT,
retry_count INTEGER DEFAULT 0
);
CREATE TABLE IF NOT EXISTS intel (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source TEXT,
timestamp TEXT,
region TEXT,
category TEXT,
content TEXT,
summary TEXT,
key_facts TEXT,
credibility_score REAL,
verification_status TEXT,
vector_id INTEGER,
ingested_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS metrics_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
metric_type TEXT NOT NULL,
data TEXT NOT NULL,
UNIQUE(timestamp, metric_type)
);
CREATE INDEX IF NOT EXISTS idx_catalogue_status ON catalogue(status);
CREATE INDEX IF NOT EXISTS idx_catalogue_source ON catalogue(source);
CREATE INDEX IF NOT EXISTS idx_documents_status ON documents(status);
""")
# Migration: add path_updated_at column if missing
try:
conn.execute("ALTER TABLE catalogue ADD COLUMN path_updated_at TEXT")
except Exception:
pass # column already exists
# Migration: add organized_at column to documents if missing
try:
conn.execute("ALTER TABLE documents ADD COLUMN organized_at TEXT")
except Exception:
pass # column already exists
# Stream B: file_operations + duplicate_review tables
conn.executescript("""
CREATE TABLE IF NOT EXISTS file_operations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
doc_hash TEXT NOT NULL,
operation TEXT NOT NULL,
source_path TEXT NOT NULL,
target_path TEXT NOT NULL,
source_filename TEXT NOT NULL,
target_filename TEXT NOT NULL,
original_filename TEXT,
collision_step INTEGER,
qdrant_points_updated INTEGER DEFAULT 0,
performed_at TEXT DEFAULT CURRENT_TIMESTAMP,
reversed_at TEXT,
notes TEXT
);
CREATE INDEX IF NOT EXISTS idx_fileops_hash ON file_operations(doc_hash);
CREATE TABLE IF NOT EXISTS duplicate_review (
id INTEGER PRIMARY KEY AUTOINCREMENT,
doc_hash TEXT NOT NULL,
original_filename TEXT NOT NULL,
sanitized_filename TEXT NOT NULL,
collision_with_hash TEXT,
collision_path TEXT,
duplicate_path TEXT NOT NULL,
domain TEXT,
subdomain TEXT,
book_author TEXT,
book_title TEXT,
status TEXT DEFAULT 'pending',
resolution TEXT,
discovered_at TEXT DEFAULT CURRENT_TIMESTAMP,
resolved_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_dupreview_status ON duplicate_review(status);
""")
conn.commit()
def add_to_catalogue(self, file_hash, filename, path, size_bytes, source, category):
conn = self._get_conn()
conn.execute(
"""INSERT INTO catalogue (hash, filename, path, size_bytes, source, category)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(hash) DO UPDATE SET
path = excluded.path,
filename = excluded.filename,
source = excluded.source,
category = excluded.category,
path_updated_at = CASE
WHEN catalogue.path != excluded.path THEN CURRENT_TIMESTAMP
ELSE catalogue.path_updated_at
END""",
(file_hash, filename, path, size_bytes, source, category)
)
conn.commit()
def queue_document(self, file_hash):
conn = self._get_conn()
row = conn.execute("SELECT * FROM catalogue WHERE hash = ?", (file_hash,)).fetchone()
if not row:
return False
conn.execute("UPDATE catalogue SET status = 'queued' WHERE hash = ?", (file_hash,))
conn.execute(
"""INSERT INTO documents (hash, filename, path, size_bytes, status)
VALUES (?, ?, ?, ?, 'queued')
ON CONFLICT(hash) DO UPDATE SET
path = excluded.path,
filename = excluded.filename""",
(row['hash'], row['filename'], row['path'], row['size_bytes'])
)
conn.commit()
return True
def update_status(self, file_hash, status, **kwargs):
conn = self._get_conn()
sets = ["status = ?"]
vals = [status]
ts_field = {
'extracted': 'extracted_at',
'enriched': 'enriched_at',
'complete': 'embedded_at',
}.get(status)
if ts_field:
sets.append(f"{ts_field} = ?")
vals.append(datetime.now(timezone.utc).isoformat())
for k, v in kwargs.items():
sets.append(f"{k} = ?")
vals.append(v)
vals.append(file_hash)
conn.execute(f"UPDATE documents SET {', '.join(sets)} WHERE hash = ?", vals)
conn.commit()
def get_by_status(self, status, limit=None):
conn = self._get_conn()
q = "SELECT * FROM documents WHERE status = ? ORDER BY discovered_at"
if limit:
q += f" LIMIT {int(limit)}"
return [dict(r) for r in conn.execute(q, (status,)).fetchall()]
def get_catalogued(self, source=None, category=None, limit=None):
conn = self._get_conn()
q = "SELECT * FROM catalogue WHERE status = 'catalogued'"
params = []
if source:
q += " AND source = ?"
params.append(source)
if category:
q += " AND category = ?"
params.append(category)
q += " ORDER BY discovered_at"
if limit:
q += f" LIMIT {int(limit)}"
return [dict(r) for r in conn.execute(q, params).fetchall()]
def get_document(self, file_hash):
conn = self._get_conn()
row = conn.execute("SELECT * FROM documents WHERE hash = ?", (file_hash,)).fetchone()
return dict(row) if row else None
def get_status_counts(self):
conn = self._get_conn()
cat_counts = {}
for row in conn.execute("SELECT status, COUNT(*) as cnt FROM catalogue GROUP BY status"):
cat_counts[row['status']] = row['cnt']
doc_counts = {}
for row in conn.execute("SELECT status, COUNT(*) as cnt FROM documents GROUP BY status"):
doc_counts[row['status']] = row['cnt']
return {'catalogue': cat_counts, 'documents': doc_counts}
def get_failures(self):
conn = self._get_conn()
return [dict(r) for r in conn.execute(
"SELECT * FROM documents WHERE status = 'failed' ORDER BY discovered_at"
).fetchall()]
def mark_failed(self, file_hash, error_msg):
conn = self._get_conn()
conn.execute(
"UPDATE documents SET status = 'failed', error_message = ? WHERE hash = ?",
(str(error_msg)[:1000], file_hash)
)
conn.commit()
def increment_retry(self, file_hash):
conn = self._get_conn()
conn.execute(
"UPDATE documents SET retry_count = retry_count + 1, status = 'queued', error_message = NULL WHERE hash = ?",
(file_hash,)
)
conn.commit()
def get_sources(self):
conn = self._get_conn()
return [r[0] for r in conn.execute(
"SELECT DISTINCT source FROM catalogue ORDER BY source"
).fetchall()]
def get_categories(self, source=None):
conn = self._get_conn()
if source:
return [r[0] for r in conn.execute(
"SELECT DISTINCT category FROM catalogue WHERE source = ? ORDER BY category", (source,)
).fetchall()]
return [r[0] for r in conn.execute(
"SELECT DISTINCT category FROM catalogue ORDER BY category"
).fetchall()]
def get_all_documents(self, status=None, source=None, category=None, limit=None, offset=None):
conn = self._get_conn()
q = """SELECT d.*, c.source, c.category FROM documents d
LEFT JOIN catalogue c ON d.hash = c.hash WHERE 1=1"""
params = []
if status:
q += " AND d.status = ?"
params.append(status)
if source:
q += " AND c.source = ?"
params.append(source)
if category:
q += " AND c.category = ?"
params.append(category)
q += " ORDER BY d.discovered_at DESC"
if limit:
q += f" LIMIT {int(limit)}"
if offset:
q += f" OFFSET {int(offset)}"
return [dict(r) for r in conn.execute(q, params).fetchall()]
def count_documents(self, source=None, category=None):
"""Count documents matching optional source/category filters."""
conn = self._get_conn()
q = """SELECT COUNT(*) FROM documents d
LEFT JOIN catalogue c ON d.hash = c.hash WHERE 1=1"""
params = []
if source:
q += " AND c.source = ?"
params.append(source)
if category:
q += " AND c.category = ?"
params.append(category)
return conn.execute(q, params).fetchone()[0]
def catalogue_count(self):
conn = self._get_conn()
return conn.execute("SELECT COUNT(*) FROM catalogue").fetchone()[0]
def source_breakdown(self):
conn = self._get_conn()
return [dict(r) for r in conn.execute(
"SELECT source, COUNT(*) as count, SUM(size_bytes) as total_bytes FROM catalogue GROUP BY source ORDER BY count DESC"
).fetchall()]
def category_breakdown(self, source=None):
conn = self._get_conn()
if source:
return [dict(r) for r in conn.execute(
"SELECT category, COUNT(*) as count FROM catalogue WHERE source = ? GROUP BY category ORDER BY count DESC",
(source,)
).fetchall()]
return [dict(r) for r in conn.execute(
"SELECT source, category, COUNT(*) as count FROM catalogue GROUP BY source, category ORDER BY source, count DESC"
).fetchall()]
def get_path_updates(self):
"""Get catalogue entries where path was updated since last sync."""
conn = self._get_conn()
return [dict(r) for r in conn.execute(
"SELECT hash, filename, path, source, category FROM catalogue "
"WHERE path_updated_at IS NOT NULL"
).fetchall()]
def clear_path_update(self, file_hash):
"""Clear path_updated_at flag after Qdrant sync."""
conn = self._get_conn()
conn.execute(
"UPDATE catalogue SET path_updated_at = NULL WHERE hash = ?",
(file_hash,)
)
conn.commit()
def sync_document_path(self, file_hash, path, filename):
"""Update path and filename in documents table."""
conn = self._get_conn()
conn.execute(
"UPDATE documents SET path = ?, filename = ? WHERE hash = ?",
(path, filename, file_hash)
)
conn.commit()
def status_breakdown(self):
conn = self._get_conn()
rows = conn.execute(
"SELECT status, COUNT(*) as count FROM catalogue GROUP BY status ORDER BY count DESC"
).fetchall()
return [dict(r) for r in rows]
def get_unorganized(self, limit=None):
"""Get completed documents that haven't been organized yet."""
conn = self._get_conn()
q = "SELECT hash, filename, path FROM documents WHERE status = 'complete' AND organized_at IS NULL ORDER BY embedded_at"
if limit:
q += " LIMIT {}".format(int(limit))
return [dict(r) for r in conn.execute(q).fetchall()]
def get_ingest_pending(self, ingest_dir, limit=50):
"""Get completed docs in _ingest/ that haven't been organized."""
conn = self._get_conn()
pattern = ingest_dir + '%'
return [dict(r) for r in conn.execute(
"SELECT hash, filename, path FROM documents "
"WHERE status = 'complete' AND organized_at IS NULL AND path LIKE ? "
"ORDER BY embedded_at LIMIT ?",
(pattern, limit)
).fetchall()]
def mark_organized(self, file_hash):
"""Mark a document as organized (sets organized_at timestamp)."""
conn = self._get_conn()
conn.execute(
"UPDATE documents SET organized_at = CURRENT_TIMESTAMP WHERE hash = ?",
(file_hash,)
)
conn.commit()
def update_catalogue_path(self, file_hash, new_path, new_filename):
"""Update catalogue path/filename and flag for Qdrant sync."""
conn = self._get_conn()
conn.execute(
"UPDATE catalogue SET path = ?, filename = ?, path_updated_at = CURRENT_TIMESTAMP WHERE hash = ?",
(new_path, new_filename, file_hash)
)
conn.commit()
# ── Stream B: File Operations ───────────────────────────────────
def log_file_operation(self, doc_hash, operation, source_path, target_path,
source_filename, target_filename, original_filename=None,
collision_step=None, qdrant_points_updated=0, notes=None):
"""Log a file move/rename operation for audit trail and rollback."""
conn = self._get_conn()
conn.execute(
"""INSERT INTO file_operations
(doc_hash, operation, source_path, target_path,
source_filename, target_filename, original_filename,
collision_step, qdrant_points_updated, notes)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(doc_hash, operation, source_path, target_path,
source_filename, target_filename, original_filename,
collision_step, qdrant_points_updated, notes)
)
conn.commit()
return conn.execute("SELECT last_insert_rowid()").fetchone()[0]
def get_file_operations(self, doc_hash=None, limit=50):
"""Get file operations, optionally filtered by doc_hash."""
conn = self._get_conn()
if doc_hash:
return [dict(r) for r in conn.execute(
"SELECT * FROM file_operations WHERE doc_hash = ? ORDER BY performed_at DESC LIMIT ?",
(doc_hash, limit)
).fetchall()]
return [dict(r) for r in conn.execute(
"SELECT * FROM file_operations WHERE reversed_at IS NULL ORDER BY performed_at DESC LIMIT ?",
(limit,)
).fetchall()]
def get_file_operation(self, op_id):
"""Get a single file operation by ID."""
conn = self._get_conn()
row = conn.execute("SELECT * FROM file_operations WHERE id = ?", (op_id,)).fetchone()
return dict(row) if row else None
def mark_operation_reversed(self, op_id):
"""Mark a file operation as reversed."""
conn = self._get_conn()
conn.execute(
"UPDATE file_operations SET reversed_at = CURRENT_TIMESTAMP WHERE id = ?",
(op_id,)
)
conn.commit()
def queue_duplicate_review(self, doc_hash, original_filename, sanitized_filename,
collision_with_hash=None, collision_path=None,
duplicate_path='', domain=None, subdomain=None,
book_author=None, book_title=None):
"""Queue a file for human duplicate review."""
conn = self._get_conn()
conn.execute(
"""INSERT INTO duplicate_review
(doc_hash, original_filename, sanitized_filename,
collision_with_hash, collision_path, duplicate_path,
domain, subdomain, book_author, book_title)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(doc_hash, original_filename, sanitized_filename,
collision_with_hash, collision_path, duplicate_path,
domain, subdomain, book_author, book_title)
)
conn.commit()
def get_duplicate_reviews(self, status='pending', limit=50):
"""Get duplicate review queue."""
conn = self._get_conn()
return [dict(r) for r in conn.execute(
"SELECT * FROM duplicate_review WHERE status = ? ORDER BY discovered_at DESC LIMIT ?",
(status, limit)
).fetchall()]
def get_pipeline_stats(self):
"""Get Stream B pipeline statistics."""
conn = self._get_conn()
ops = conn.execute(
"SELECT operation, COUNT(*) as cnt FROM file_operations WHERE reversed_at IS NULL GROUP BY operation"
).fetchall()
dupes = conn.execute(
"SELECT status, COUNT(*) as cnt FROM duplicate_review GROUP BY status"
).fetchall()
acquired = 0
ingest = 0
try:
acquired_dir = get_config().get('new_pipeline', {}).get('acquired_dir', '')
ingest_dir = get_config().get('new_pipeline', {}).get('ingest_dir', '')
if acquired_dir and os.path.isdir(acquired_dir):
acquired = len([f for f in os.listdir(acquired_dir) if f.lower().endswith('.pdf')])
if ingest_dir and os.path.isdir(ingest_dir):
ingest = len([f for f in os.listdir(ingest_dir) if f.lower().endswith('.pdf')])
except Exception:
pass
return {
'operations': {dict(r)['operation']: dict(r)['cnt'] for r in ops},
'duplicates': {dict(r)['status']: dict(r)['cnt'] for r in dupes},
'acquired_pending': acquired,
'ingest_pending': ingest,
}

390
lib/utils.py Normal file
View file

@ -0,0 +1,390 @@
"""
RECON Utilities
Content hashing (MD5), config loading (YAML), download URL generation,
source/category derivation, logging setup, filename sanitization.
Config: Loads and caches config.yaml
"""
import hashlib
import logging
import os
import re
import unicodedata
from urllib.parse import quote
import yaml
from logging.handlers import RotatingFileHandler
_config = None
def get_config():
global _config
if _config is not None:
return _config
config_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'config.yaml')
with open(config_path) as f:
_config = yaml.safe_load(f)
# Load Gemini keys from .env
env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), '.env')
_config['gemini_keys'] = []
if os.path.exists(env_path):
with open(env_path) as f:
for line in f:
line = line.strip()
if line and not line.startswith('#') and '=' in line:
key, val = line.split('=', 1)
if key.startswith('GEMINI_KEY_') and val != 'PASTE_KEY_HERE':
_config['gemini_keys'].append(val)
return _config
def content_hash(filepath):
h = hashlib.md5()
with open(filepath, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
h.update(chunk)
return h.hexdigest()
def concept_id(doc_hash, page_num, concept_index):
raw = f"{doc_hash}:{page_num}:{concept_index}"
h = hashlib.md5(raw.encode()).hexdigest()[:15]
return int(h, 16)
def setup_logging(name='recon'):
config = get_config()
log_dir = config['paths']['logs']
os.makedirs(log_dir, exist_ok=True)
os.makedirs(os.path.join(log_dir, 'errors'), exist_ok=True)
logger = logging.getLogger(name)
if logger.handlers:
return logger
logger.setLevel(logging.DEBUG)
fmt = logging.Formatter('%(asctime)s [%(levelname)s] %(name)s: %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
fh = RotatingFileHandler(os.path.join(log_dir, 'recon.log'), maxBytes=10*1024*1024, backupCount=5)
fh.setLevel(logging.DEBUG)
fh.setFormatter(fmt)
logger.addHandler(fh)
eh = RotatingFileHandler(os.path.join(log_dir, 'errors', 'errors.log'), maxBytes=5*1024*1024, backupCount=3)
eh.setLevel(logging.ERROR)
eh.setFormatter(fmt)
logger.addHandler(eh)
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
ch.setFormatter(fmt)
logger.addHandler(ch)
return logger
def derive_source_and_category(filepath, library_root):
rel = os.path.relpath(filepath, library_root)
parts = rel.split(os.sep)
source = parts[0] if parts else 'unknown'
category = parts[1] if len(parts) > 2 else source
return source, category
def clean_filename_to_title(filename):
"""Convert a PDF filename into a human-readable title."""
# Strip extension
name = os.path.splitext(filename)[0]
# Remove common PDF download suffixes (with or without parens)
name = re.sub(r'[\s_]*\(?\s*PDFDrive\s*\)?\s*_?', '', name, flags=re.IGNORECASE)
name = re.sub(r'[\s_]*\(?\s*z-lib\.org\s*\)?\s*_?', '', name, flags=re.IGNORECASE)
# Handle military manual prefixes: FM_23_10 -> FM 23-10, ATP_3_21 -> ATP 3-21
name = re.sub(
r'\b(FM|ATP|TC|TM|AR|STP|GTA|ATTP|FMFRP|ADP|ADRP)[-_](\d+)[-_](\d+)',
lambda m: f"{m.group(1)} {m.group(2)}-{m.group(3)}",
name
)
# Fix common abbreviations: U_S -> U.S., etc.
name = re.sub(r'(?<![A-Za-z])U[_\s]S(?=[_\s]|$)', 'U.S.', name)
# Replace underscores and hyphens with spaces (but not in manual numbers like FM 23-10)
name = re.sub(r'(?<!\d)[-_](?!\d)', ' ', name)
name = name.replace('_', ' ')
# Remove bracketed years like [1990]
year_match = re.search(r'\[(\d{4})\]', name)
year_suffix = f" ({year_match.group(1)})" if year_match else ''
name = re.sub(r'\s*\[\d{4}\]\s*', ' ', name)
# Collapse multiple spaces
name = re.sub(r'\s+', ' ', name).strip()
# Title-case, but preserve uppercase military abbreviations
words = name.split()
titled = []
for w in words:
if w.isupper() and len(w) >= 2:
titled.append(w)
elif re.match(r'^\d', w):
titled.append(w)
else:
titled.append(w.capitalize() if w.islower() else w)
name = ' '.join(titled) + year_suffix
name = name.strip()
if len(name) < 3:
return os.path.splitext(filename)[0]
return name
# ── Mojibake fix table ──────────────────────────────────────────────
_MOJIBAKE = {
'\u00e2\u0080\u0099': "'", # ’ → ' (right single quote)
'\u00e2\u0080\u0098': "'", # ‘ → ' (left single quote)
'\u00e2\u0080\u009c': '"', # “ → " (left double quote)
'\u00e2\u0080\u009d': '"', # †→ " (right double quote)
'\u00e2\u0080\u0093': '-', # â€" → - (en dash)
'\u00e2\u0080\u0094': '-', # â€" → - (em dash)
'\u00e2\u0080\u00a6': '...', # … → ... (ellipsis)
'\u00c3\u00a9': 'e', # é → e (e-acute)
'\u00c3\u00a8': 'e', # è → e (e-grave)
'\u00c3\u00b6': 'o', # ö → o (o-umlaut)
'\u00c3\u00bc': 'u', # ü → u (u-umlaut)
'\u00c3\u00a4': 'a', # ä → a (a-umlaut)
'\u00c3\u00b1': 'n', # ñ → n (n-tilde)
'\u00c3\u00ad': 'i', # í → i (i-acute)
'\u00c3\u00a1': 'a', # á → a (a-acute)
'\u00c3\u00ba': 'u', # ú → u (u-acute)
'\u00c3\u00b3': 'o', # ó → o (o-acute)
'\u00c2\u00ae': '', # ® → (registered)
'\u00c2\u00a9': '', # © → (copyright)
'\u00c2\u00ab': '"', # « → " (guillemet left)
'\u00c2\u00bb': '"', # » → " (guillemet right)
}
# Pre-compile: replace longer sequences first to avoid partial matches
_MOJIBAKE_PATTERN = re.compile(
'|'.join(re.escape(k) for k in sorted(_MOJIBAKE.keys(), key=len, reverse=True))
)
def sanitize_filename(filename, doc_hash=None):
"""Sanitize a PDF filename for cross-platform filesystem safety.
Six-phase pipeline:
1. Strip source-site metadata (Anna's Archive, PDFDrive, z-lib, torrent tags)
2. Strip embedded identifiers (ISBN, MD5 hash, z-lib hex suffix)
3. Fix character encoding (mojibake, NFKD normalization)
4. Normalize structure (military prefixes, period-separated words, underscores)
5. Clean characters (Windows-illegal, control chars, collapse whitespace)
6. Validate and truncate (120 char max, word-boundary break)
Args:
filename: Original filename (with extension)
doc_hash: Optional doc_hash to verify z-lib suffix matches
Returns:
Sanitized filename (with extension preserved)
"""
stem, ext = os.path.splitext(filename)
ext = ext.lower()
if not ext:
ext = '.pdf'
# ── Phase 1: Strip source-site metadata ─────────────────────────
# Anna's Archive pattern: Title -- Authors -- Edition -- ISBN -- Hash -- Source
segments = stem.split(' -- ')
if len(segments) >= 3:
stem = segments[0]
elif len(segments) == 2:
second = segments[1]
if re.search(r'97[89]\d{10}|[0-9a-f]{32}|(?:19|20)\d{2}|[Aa]nna', second):
stem = segments[0]
# PDFDrive tags
stem = re.sub(r'\s*\(\s*PDFDrive\s*\)\s*', ' ', stem, flags=re.IGNORECASE)
stem = re.sub(r'\s*_PDFDrive_\s*', ' ', stem, flags=re.IGNORECASE)
# z-lib tags
stem = re.sub(r'\s*\(\s*z-lib\.org\s*\)\s*', ' ', stem, flags=re.IGNORECASE)
stem = re.sub(r'\s*_z-lib\.org_\s*', ' ', stem, flags=re.IGNORECASE)
# Torrent tags in curly braces
stem = re.sub(r'\s*\{[A-Za-z0-9]+\}\s*', ' ', stem)
# ── Phase 2: Strip embedded identifiers ─────────────────────────
# ISBN-13 (with optional dashes/spaces)
stem = re.sub(r'\s*97[89][\s-]?\d[\s-]?\d{2}[\s-]?\d{5,6}[\s-]?\d\s*', ' ', stem)
# ISBN-10 with dashes
stem = re.sub(r'\s*\d[\s-]\d{2}[\s-]\d{5,6}[\s-][\dXx]\s*', ' ', stem)
# MD5 hashes (32 hex chars, standalone)
stem = re.sub(r'\s*\b[0-9a-f]{32}\b\s*', ' ', stem)
# z-lib 8-char hex suffix like _4d969c3c
if doc_hash:
# Only strip if it matches the doc_hash prefix
match = re.search(r'_([0-9a-f]{8})$', stem)
if match and doc_hash.startswith(match.group(1)):
stem = stem[:match.start()]
else:
# Strip any trailing 8-char hex suffix after underscore
stem = re.sub(r'_[0-9a-f]{8}$', '', stem)
# ── Phase 3: Fix character encoding ─────────────────────────────
# Fix known mojibake sequences
stem = _MOJIBAKE_PATTERN.sub(lambda m: _MOJIBAKE[m.group()], stem)
# Common single-char mojibake that slip through
stem = stem.replace('\u00e2\u0080', '-') # partial em/en dash mojibake
stem = stem.replace('H_', 'H. ') # Anna's Archive initial abbreviation pattern
# NFKD normalize: decompose accented chars, strip combining marks
nfkd = unicodedata.normalize('NFKD', stem)
cleaned = []
for ch in nfkd:
cat = unicodedata.category(ch)
if cat.startswith('M'): # combining mark — skip
continue
if cat.startswith('C') and ch not in (' ', '\t'): # control char — skip
continue
# Keep ASCII + common punctuation; drop CJK/Cyrillic/etc if not transliteratable
cp = ord(ch)
if cp < 128:
cleaned.append(ch)
elif cat.startswith('L') or cat.startswith('N'):
# Letter or number outside ASCII — try to keep if Latin-ish
if cp < 0x0250: # Latin Extended range
cleaned.append(ch)
# else: drop CJK, Cyrillic, etc.
elif cat.startswith('P') or cat.startswith('S'):
# Punctuation/symbol — map to ASCII equivalent
if ch in ('\u2018', '\u2019', '\u201a', '\u0060'):
cleaned.append("'")
elif ch in ('\u201c', '\u201d', '\u201e'):
cleaned.append('"')
elif ch in ('\u2013', '\u2014', '\u2012'):
cleaned.append('-')
elif ch == '\u2026':
cleaned.append('...')
elif ch in ('\u00ab', '\u00bb'):
cleaned.append('"')
else:
cleaned.append(' ')
elif cat.startswith('Z'):
cleaned.append(' ')
stem = ''.join(cleaned)
# ── Phase 4: Normalize structure ────────────────────────────────
# Detect URL-derived filenames — skip aggressive normalization
is_url_derived = bool(re.match(r'[a-z0-9-]+\.[a-z]{2,}[_/]', stem))
if not is_url_derived:
# Military manual prefixes: FM_23_10 -> FM 23-10
stem = re.sub(
r'\b(FM|ATP|TC|TM|AR|STP|GTA|ATTP|FMFRP|ADP|ADRP)[-_](\d+)[-_](\d+)',
lambda m: '{} {}-{}'.format(m.group(1), m.group(2), m.group(3)),
stem
)
# Period-separated words (4+ segments = likely word-separated, not abbreviations like U.S.)
if stem.count('.') >= 4:
stem = re.sub(r'\.(?=[A-Za-z])', ' ', stem)
# Underscores to spaces (always)
stem = stem.replace('_', ' ')
# ── Phase 5: Clean characters ───────────────────────────────────
# Remove Windows-illegal chars and control chars
stem = re.sub(r'[<>:"|?*\\\/]', '', stem)
stem = re.sub(r'[\x00-\x1f\x7f]', '', stem)
# Collapse multiple spaces, hyphens, underscores
stem = re.sub(r' {2,}', ' ', stem)
stem = re.sub(r'-{2,}', '-', stem)
# Strip leading/trailing dots, spaces, dashes
stem = stem.strip('. -')
# ── Phase 6: Validate and truncate ──────────────────────────────
stem = stem.strip()
if not stem or len(stem) < 2:
stem = 'untitled'
max_stem = 120 - len(ext)
if len(stem) > max_stem:
# Break at word boundary
truncated = stem[:max_stem]
last_space = truncated.rfind(' ')
if last_space > max_stem * 0.6:
truncated = truncated[:last_space]
stem = truncated.rstrip('. -,')
return stem + ext
def filename_needs_sanitization(filename, doc_hash=None):
"""Return True if sanitize_filename() would change the filename."""
return sanitize_filename(filename, doc_hash) != filename
def resolve_collisions(entries):
"""Resolve filename collisions after sanitization.
Args:
entries: list of dicts, each with 'sanitized_filename', 'proposed_dir', 'hash'
Returns:
Updated entries with collision suffixes applied where needed.
Each entry gets 'collision' key (True/False) and possibly updated 'sanitized_filename'.
"""
from collections import defaultdict
# Group by (dir, lowercase filename) to find collisions
groups = defaultdict(list)
for i, e in enumerate(entries):
key = (e['proposed_dir'], e['sanitized_filename'].lower())
groups[key].append(i)
collision_count = 0
for key, indices in groups.items():
if len(indices) <= 1:
for i in indices:
entries[i]['collision'] = False
continue
# Collision — add hash suffix to all but the first
collision_count += len(indices) - 1
entries[indices[0]]['collision'] = False
for i in indices[1:]:
e = entries[i]
h6 = e['hash'][:6]
stem, ext = os.path.splitext(e['sanitized_filename'])
new_name = '{} [{}]{}'.format(stem, h6, ext)
# Re-check length
if len(new_name) > 120:
max_stem = 120 - len(ext) - 9 # 9 = len(' [XXXXXX]')
stem = stem[:max_stem].rstrip('. -,')
new_name = '{} [{}]{}'.format(stem, h6, ext)
e['sanitized_filename'] = new_name
e['collision'] = True
return entries, collision_count
def generate_download_url(filepath, library_root='/mnt/library', base_url='https://files.echo6.co'):
"""Generate a download/source URL from a document path.
For web URLs (http/https): returns the URL directly -- it's already a link.
For file paths: converts to files.echo6.co URL.
"""
if not filepath:
return ''
# Web content -- path IS the source URL
if filepath.startswith(('http://', 'https://')):
return filepath
# File content -- convert to files.echo6.co URL
rel = os.path.relpath(filepath, library_root)
parts = rel.split(os.sep)
encoded = '/'.join(quote(p) for p in parts)
return f"{base_url}/{encoded}"

324
lib/web_scraper.py Normal file
View file

@ -0,0 +1,324 @@
"""
RECON Web Scraper URL-based content ingestion.
Fetches web pages, extracts clean text, chunks into pages,
and feeds into the standard RECON enrichment pipeline.
Output format matches lib/extractor.py so the enricher
processes web content identically to PDF content.
"""
import hashlib
import json
import os
import re
import time
from datetime import datetime, timezone
from urllib.parse import urlparse, unquote
import requests
import trafilatura
from .utils import get_config, setup_logging
from .status import StatusDB
logger = setup_logging('recon.web_scraper')
# Defaults (overridden by config.yaml web_scraper section)
DEFAULT_WORDS_PER_PAGE = 2000
DEFAULT_FETCH_TIMEOUT = 30
DEFAULT_USER_AGENT = 'RECON/1.0 (Knowledge Extraction Pipeline)'
DEFAULT_RATE_LIMIT_DELAY = 1.0
def _get_scraper_config(config=None):
"""Get web scraper settings from config, with defaults."""
if config is None:
config = get_config()
ws = config.get('web_scraper', {})
return {
'words_per_page': ws.get('words_per_page', DEFAULT_WORDS_PER_PAGE),
'fetch_timeout': ws.get('fetch_timeout', DEFAULT_FETCH_TIMEOUT),
'user_agent': ws.get('user_agent', DEFAULT_USER_AGENT),
'rate_limit_delay': ws.get('rate_limit_delay', DEFAULT_RATE_LIMIT_DELAY),
'max_batch_size': ws.get('max_batch_size', 50),
}
def fetch_url(url, config=None):
"""
Fetch a URL and extract clean text + metadata using trafilatura.
Returns dict with: text, title, author, date, description, url,
sitename, raw_length, text_length.
Raises ValueError if fetch or extraction fails.
"""
sc = _get_scraper_config(config)
logger.info(f"Fetching URL: {url}")
try:
response = requests.get(
url,
headers={'User-Agent': sc['user_agent']},
timeout=sc['fetch_timeout'],
allow_redirects=True
)
response.raise_for_status()
except requests.RequestException as e:
raise ValueError(f"Failed to fetch {url}: {e}")
raw_html = response.text
if not raw_html or len(raw_html) < 100:
raise ValueError(f"Empty or too-short response from {url}")
text = trafilatura.extract(
raw_html,
include_comments=False,
include_tables=True,
include_links=False,
include_images=False,
favor_precision=False,
deduplicate=True
)
if not text or len(text.strip()) < 50:
raise ValueError(f"No meaningful text extracted from {url}")
metadata = trafilatura.extract_metadata(raw_html)
result = {
'text': text.strip(),
'title': '',
'author': '',
'date': '',
'description': '',
'url': url,
'sitename': '',
'raw_length': len(raw_html),
'text_length': len(text),
}
if metadata:
result['title'] = metadata.title or ''
result['author'] = metadata.author or ''
result['date'] = metadata.date or ''
result['description'] = metadata.description or ''
result['sitename'] = metadata.sitename or ''
if not result['title']:
result['title'] = _title_from_url(url)
logger.info(f"Extracted {result['text_length']} chars from {url}\"{result['title']}\"")
return result
def _title_from_url(url):
"""Generate a readable title from a URL as fallback."""
parsed = urlparse(url)
path = unquote(parsed.path).strip('/')
if path:
segment = path.split('/')[-1]
segment = re.sub(r'[-_]', ' ', segment)
segment = re.sub(r'\.\w+$', '', segment)
return segment.title() if segment else parsed.netloc
return parsed.netloc
def chunk_text(text, words_per_page=DEFAULT_WORDS_PER_PAGE):
"""
Split text into page-sized chunks for enrichment windows.
Breaks at paragraph boundaries. Each chunk is ~words_per_page words.
Returns list of strings (each is one "page").
"""
paragraphs = text.split('\n\n')
pages = []
current_page = []
current_words = 0
for para in paragraphs:
para = para.strip()
if not para:
continue
para_words = len(para.split())
if para_words > words_per_page * 1.5:
if current_page:
pages.append('\n\n'.join(current_page))
current_page = []
current_words = 0
sentences = re.split(r'(?<=[.!?])\s+', para)
for sentence in sentences:
sentence_words = len(sentence.split())
if current_words + sentence_words > words_per_page and current_page:
pages.append('\n\n'.join(current_page))
current_page = [sentence]
current_words = sentence_words
else:
current_page.append(sentence)
current_words += sentence_words
elif current_words + para_words > words_per_page and current_page:
pages.append('\n\n'.join(current_page))
current_page = [para]
current_words = para_words
else:
current_page.append(para)
current_words += para_words
if current_page:
pages.append('\n\n'.join(current_page))
if not pages:
pages = [text]
return pages
def _content_hash(text):
"""MD5 hash of text content — same hash type as PDF pipeline."""
return hashlib.md5(text.encode('utf-8')).hexdigest()
def _display_filename(url):
"""Create a display filename from a URL."""
parsed = urlparse(url)
name = f"{parsed.netloc}_{parsed.path.strip('/').replace('/', '_')}"
name = re.sub(r'[^\w._-]', '_', name)[:200]
if not name.endswith('.html'):
name += '.html'
return name
def ingest_url(url, category='Web', source='web', config=None):
"""
Full URL ingestion: fetch -> extract -> chunk -> save -> catalogue -> queue as extracted.
Returns dict with hash, title, page_count, status.
Raises ValueError on failure.
"""
if config is None:
config = get_config()
sc = _get_scraper_config(config)
db = StatusDB()
# Fetch and extract
extracted = fetch_url(url, config)
# Hash the extracted text content
doc_hash = _content_hash(extracted['text'])
# Check for duplicate in catalogue
conn = db._get_conn()
existing = conn.execute("SELECT * FROM catalogue WHERE hash = ?", (doc_hash,)).fetchone()
if existing:
# Also check documents table for status
doc = db.get_document(doc_hash)
existing_status = doc['status'] if doc else existing['status']
logger.info(f"Duplicate content (hash {doc_hash[:12]}...) — already exists as '{existing['filename']}'")
return {
'hash': doc_hash,
'status': 'duplicate',
'title': doc.get('book_title', '') if doc else existing['filename'],
'existing_status': existing_status,
}
# Chunk into pages
pages = chunk_text(extracted['text'], sc['words_per_page'])
# Save text files in extractor-compatible format:
# data/text/{hash}/page_0001.txt, page_0002.txt, ... + meta.json
text_dir = os.path.join(config['paths']['text'], doc_hash)
os.makedirs(text_dir, exist_ok=True)
for i, page_text in enumerate(pages, 1):
page_file = os.path.join(text_dir, f"page_{i:04d}.txt")
with open(page_file, 'w', encoding='utf-8') as f:
f.write(page_text)
meta = {
'hash': doc_hash,
'source_type': 'web',
'url': url,
'title': extracted['title'],
'author': extracted['author'],
'date': extracted['date'],
'description': extracted['description'],
'sitename': extracted['sitename'],
'page_count': len(pages),
'text_length': extracted['text_length'],
'fetched_at': datetime.now(timezone.utc).isoformat(),
}
with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
json.dump(meta, f, indent=2)
display_name = _display_filename(url)
# Add to catalogue
db.add_to_catalogue(doc_hash, display_name, url, extracted['text_length'], source, category)
# Queue (creates documents entry as 'queued')
db.queue_document(doc_hash)
# Advance directly to 'extracted' — text is already saved, skip PDF extraction
db.update_status(doc_hash, 'extracted',
page_count=len(pages),
pages_extracted=len(pages),
book_title=extracted['title'],
book_author=extracted['author'] or None)
logger.info(f"Ingested URL: {url} -> {doc_hash[:12]}... ({len(pages)} pages, \"{extracted['title']}\")")
return {
'hash': doc_hash,
'status': 'extracted',
'title': extracted['title'],
'author': extracted['author'],
'page_count': len(pages),
'url': url,
}
def ingest_urls(urls, category='Web', source='web', delay=None, config=None):
"""
Batch URL ingestion with rate limiting.
Returns list of result dicts (one per URL).
"""
if config is None:
config = get_config()
if delay is None:
delay = _get_scraper_config(config)['rate_limit_delay']
results = []
total = len(urls)
for i, url in enumerate(urls, 1):
url = url.strip()
if not url or url.startswith('#'):
continue
logger.info(f"[{i}/{total}] Processing: {url}")
try:
result = ingest_url(url, category=category, source=source, config=config)
result['url'] = url
results.append(result)
except Exception as e:
logger.error(f"[{i}/{total}] Failed: {url}{e}")
results.append({
'url': url,
'status': 'failed',
'error': str(e),
})
if i < total and delay > 0:
time.sleep(delay)
succeeded = sum(1 for r in results if r.get('status') not in ('failed', 'duplicate'))
failed = sum(1 for r in results if r.get('status') == 'failed')
dupes = sum(1 for r in results if r.get('status') == 'duplicate')
logger.info(f"Batch complete: {succeeded} new, {dupes} duplicates, {failed} failed out of {total}")
return results

72
migrate_paths.py Normal file
View file

@ -0,0 +1,72 @@
#!/usr/bin/env python3
"""One-time migration: rescan library to detect moved files and sync paths to Qdrant.
This rescans all PDFs in the library. The upsert in add_to_catalogue() will
detect any files whose paths changed since they were originally catalogued,
and flag them with path_updated_at. Then sync_qdrant_paths() propagates
those path changes to Qdrant download_url payloads.
Usage: cd /opt/recon && source venv/bin/activate && python3 migrate_paths.py [--dry-run]
"""
import sys
import os
sys.path.insert(0, '/opt/recon')
from recon import scan_library, sync_qdrant_paths
from lib.status import StatusDB
from lib.utils import setup_logging
logger = setup_logging('recon.migrate')
def main():
dry_run = '--dry-run' in sys.argv
db = StatusDB()
conn = db._get_conn()
total_cat = conn.execute("SELECT COUNT(*) FROM catalogue").fetchone()[0]
total_docs = conn.execute("SELECT COUNT(*) FROM documents").fetchone()[0]
print(f"Before: {total_cat} catalogue entries, {total_docs} documents")
# Rescan library — upsert will detect and flag path changes
print("\nScanning library (this will re-hash all files)...")
count = scan_library()
print(f"Scanned {count} PDFs")
# Check how many paths changed
updates = db.get_path_updates()
print(f"\nDetected {len(updates)} path changes")
if not updates:
print("No paths need syncing — all up to date")
return 0
# Show what changed
for row in updates[:20]:
print(f" {row['hash'][:8]} {row['filename']}")
if len(updates) > 20:
print(f" ... and {len(updates) - 20} more")
if dry_run:
print(f"\n[DRY RUN] Would sync {len(updates)} paths to Qdrant. Re-run without --dry-run to apply.")
return 0
# Sync to Qdrant
print(f"\nSyncing {len(updates)} paths to Qdrant...")
synced = sync_qdrant_paths()
print(f"Synced {synced} document paths to Qdrant")
# Verify
remaining = db.get_path_updates()
if remaining:
print(f"\nWARNING: {len(remaining)} paths still pending (Qdrant sync may have partially failed)")
else:
print("\nAll paths synced successfully")
return 0
if __name__ == '__main__':
sys.exit(main())

1502
recon.py Executable file

File diff suppressed because it is too large Load diff

69
requirements.txt Normal file
View file

@ -0,0 +1,69 @@
annotated-types==0.7.0
anyio==4.12.1
babel==2.18.0
beautifulsoup4==4.14.3
blinker==1.9.0
certifi==2026.1.4
cffi==2.0.0
charset-normalizer==3.4.4
click==8.3.1
courlan==1.3.2
cryptography==46.0.5
dateparser==1.3.0
Flask==3.1.2
google-ai-generativelanguage==0.6.15
google-api-core==2.29.0
google-api-python-client==2.190.0
google-auth==2.48.0
google-auth-httplib2==0.3.0
google-generativeai==0.8.6
googleapis-common-protos==1.72.0
grpcio==1.78.0
grpcio-status==1.71.2
h11==0.16.0
h2==4.3.0
hpack==4.1.0
htmldate==1.9.4
httpcore==1.0.9
httplib2==0.31.2
httpx==0.28.1
hyperframe==6.1.0
idna==3.11
itsdangerous==2.2.0
Jinja2==3.1.6
jusText==3.0.2
lxml==6.0.2
lxml_html_clean==0.4.3
MarkupSafe==3.0.3
numpy==2.4.2
packaging==26.0
pillow==12.1.1
portalocker==3.2.0
proto-plus==1.27.1
protobuf==5.29.6
pyasn1==0.6.2
pyasn1_modules==0.4.2
pycparser==3.0
pydantic==2.12.5
pydantic_core==2.41.5
pyparsing==3.3.2
PyPDF2==3.0.1
pytesseract==0.3.13
python-dateutil==2.9.0.post0
pytz==2025.2
PyYAML==6.0.3
qdrant-client==1.16.2
regex==2026.1.15
requests==2.32.5
rsa==4.9.1
six==1.17.0
soupsieve==2.8.3
tld==0.13.1
tqdm==4.67.3
trafilatura==2.0.0
typing-inspection==0.4.2
typing_extensions==4.15.0
tzlocal==5.3.1
uritemplate==4.2.0
urllib3==2.6.3
Werkzeug==3.1.5

67
run-pipeline-now.sh Executable file
View file

@ -0,0 +1,67 @@
#!/bin/bash
# RECON Pipeline — Skip scan, run extract + enrich in parallel, then embed
# Scan already completed (10,162 catalogued). 6,211 extracted, 3,603 queued.
set -euo pipefail
cd /opt/recon
source venv/bin/activate
LOGDIR="logs"
mkdir -p "$LOGDIR"
TS=$(date +%Y%m%d_%H%M%S)
MAIN_LOG="$LOGDIR/pipeline_${TS}.log"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$MAIN_LOG"
}
log "=== RECON Pipeline (parallel extract+enrich) ==="
log "Skipping scan (already done). Starting extract + enrich concurrently."
# Reset any stuck docs from previous kill
sqlite3 data/recon.db "UPDATE documents SET status='queued' WHERE status='extracting';"
sqlite3 data/recon.db "UPDATE documents SET status='extracted' WHERE status='enriching';"
sqlite3 data/recon.db "UPDATE documents SET status='enriched' WHERE status='embedding';"
# Status before
log "Before:"
sqlite3 data/recon.db "SELECT status, COUNT(*) FROM documents GROUP BY status;" | while read line; do log " $line"; done
# Start extract and enrich in parallel
log "--- Starting Extract (4 workers) + Enrich (16 workers) ---"
python3 recon.py extract --workers 4 >> "$LOGDIR/extract_${TS}.log" 2>&1 &
EXTRACT_PID=$!
log " Extract PID: $EXTRACT_PID"
sleep 3
python3 recon.py enrich --workers 16 >> "$LOGDIR/enrich_${TS}.log" 2>&1 &
ENRICH_PID=$!
log " Enrich PID: $ENRICH_PID"
# Monitor loop — report progress every 5 minutes
while kill -0 $EXTRACT_PID 2>/dev/null || kill -0 $ENRICH_PID 2>/dev/null; do
sleep 300
STATS=$(sqlite3 data/recon.db "SELECT status, COUNT(*) FROM documents GROUP BY status;" | tr '\n' ' ')
log " Progress: $STATS"
done
log " Extract + Enrich finished"
# Second enrich pass (catch docs extracted during first enrich)
REMAINING=$(sqlite3 data/recon.db "SELECT COUNT(*) FROM documents WHERE status='extracted';")
if [ "$REMAINING" -gt 0 ]; then
log "--- Enrich pass 2: $REMAINING remaining ---"
python3 recon.py enrich --workers 16 >> "$LOGDIR/enrich_${TS}.log" 2>&1
log " Pass 2 complete"
fi
# Embed
log "--- Embed ---"
python3 recon.py embed --workers 4 >> "$LOGDIR/embed_${TS}.log" 2>&1
log " Embed complete"
log "=== Pipeline Complete ==="
python3 recon.py status 2>&1 | tee -a "$MAIN_LOG"
log "Finished: $(date)"

0
scripts/__init__.py Normal file
View file

373
scripts/aa_download.py Executable file
View file

@ -0,0 +1,373 @@
#!/usr/bin/env python3
"""
aa_download.py Anna's Archive bulk downloader for RECON library acquisition.
For each target book:
1. Searches annas-archive.org for the title + author
2. Extracts the best PDF match (verified by author/page count)
3. Gets the MD5 from the book page
4. Attempts download from Libgen mirrors in order
5. Verifies downloaded file is a valid PDF
6. Writes full acquisition report
Usage:
python3 /opt/recon/scripts/aa_download.py [--dry-run] [--limit N]
Report output: ~/projects/recon/aa_acquisition_report.md
"""
import json
import time
import random
import hashlib
import logging
import argparse
from pathlib import Path
from datetime import datetime
import requests
from bs4 import BeautifulSoup
REPORT_PATH = Path.home() / "projects/recon/aa_acquisition_report.md"
LOG_FILE = Path("/opt/recon/logs/aa_download.log")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
)
log = logging.getLogger("aa_download")
SESSION = requests.Session()
SESSION.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0",
"Accept-Language": "en-US,en;q=0.9",
})
BASE_AA = "https://annas-archive.gl"
# Download attempt order — try fastest mirrors first
LIBGEN_MIRRORS = [
"https://libgen.is/get.php?md5={md5}",
"https://libgen.rs/get.php?md5={md5}",
"https://libgen.st/get.php?md5={md5}",
"https://libgen.li/ads.php?md5={md5}",
]
# ── Target book list ──────────────────────────────────────────────────────────
TARGETS = [
# (title, author, dest_dir)
# Medical — Herbalism
("Medical Herbalism", "David Hoffmann", "Medical/Herbalism"),
("Making Plant Medicine", "Richo Cech", "Medical/Herbalism"),
("The Earthwise Herbal Volume 1", "Matthew Wood", "Medical/Herbalism"),
("The Earthwise Herbal Volume 2", "Matthew Wood", "Medical/Herbalism"),
("Herbal Antibiotics", "Stephen Buhner", "Medical/Herbalism"),
("Herbal Antivirals", "Stephen Buhner", "Medical/Herbalism"),
("The Herbal Medicine-Maker's Handbook", "James Green", "Medical/Herbalism"),
("Rosemary Gladstar's Medicinal Herbs", "Rosemary Gladstar", "Medical/Herbalism"),
# Medical — Austere
("Wilderness Medicine", "Paul Auerbach", "Medical/Austere"),
("Medicine for Mountaineering", "James Wilkerson", "Medical/Austere"),
# Medical — Veterinary
("The Chicken Health Handbook", "Gail Damerow", "Medical/Veterinary"),
("Goat Husbandry", "David Mackenzie", "Medical/Veterinary"),
# Power Systems
("The Renewable Energy Handbook", "William Kemp", "Power"),
("Homebrew Wind Power", "Dan Bartmann", "Power"),
("Wind Energy Basics", "Paul Gipe", "Power"),
("12-Volt Bible", "Brotherton", "Power"),
("Wiring a House", "Rex Cauldwell", "Power"),
# Navigation
("Wilderness Navigation", "Bob Burns", "Navigation"),
("Be Expert with Map and Compass", "Bjorn Kjellstrom", "Navigation"),
("Emergency Navigation", "David Burch", "Navigation"),
("The Natural Navigator", "Tristan Gooley", "Navigation"),
("The Essential Wilderness Navigator", "David Seidman", "Navigation"),
# Water Systems
("Rainwater Harvesting for Drylands Volume 1", "Brad Lancaster", "Water"),
("Rainwater Harvesting for Drylands Volume 2", "Brad Lancaster", "Water"),
("Rainwater Harvesting for Drylands Volume 3", "Brad Lancaster", "Water"),
("Water Storage", "Art Ludwig", "Water"),
("The Home Water Supply", "Stu Campbell", "Water"),
# Food Systems
("The Art of Fermentation", "Sandor Katz", "Food"),
("Fermented Vegetables", "Kirsten Shockey", "Food"),
("Mastering Artisan Cheesemaking", "Gianaclis Caldwell", "Food"),
("Home Cheese Making", "Ricki Carroll", "Food"),
("The Art of Natural Cheesemaking", "David Asher", "Food"),
# Permaculture
("Edible Forest Gardens Volume 1", "Dave Jacke", "Permaculture"),
("Edible Forest Gardens Volume 2", "Dave Jacke", "Permaculture"),
("Creating a Forest Garden", "Martin Crawford", "Permaculture"),
("Sepp Holzer's Permaculture", "Sepp Holzer", "Permaculture"),
("The Permaculture Handbook", "Peter Bane", "Permaculture"),
("The Market Gardener", "Jean-Martin Fortier", "Permaculture"),
# Scenario / Emergency
("SAS Survival Handbook", "John Wiseman", "Scenario"),
("Pocket Ref", "Thomas Glover", "Scenario"),
("Deep Survival", "Laurence Gonzales", "Scenario"),
# Foundational Skills
("Back to Basics", "Reader's Digest", "Skills"),
("A Pattern Language", "Christopher Alexander", "Skills"),
]
BASE_LIB = Path("/mnt/library/Acquired")
def search_aa(title, author):
"""Search Anna's Archive and return list of candidate result dicts."""
query = f"{title} {author}"
url = f"{BASE_AA}/search"
params = {"q": query, "ext": "pdf", "lang": "en"}
try:
r = SESSION.get(url, params=params, timeout=20)
r.raise_for_status()
except Exception as e:
log.warning(f"Search failed for '{title}': {e}")
return []
soup = BeautifulSoup(r.text, "html.parser")
results = []
seen_md5 = set()
for item in soup.select("a[href^='/md5/']"):
href = item.get("href", "")
md5 = href.split("/md5/")[-1].split("/")[0].split("?")[0].strip()
if not md5 or len(md5) != 32:
continue
text = item.get_text(" ", strip=True)
if not text or md5 in seen_md5:
continue
seen_md5.add(md5)
results.append({"md5": md5, "text": text, "href": href})
if len(results) >= 5:
break
return results
def get_book_details(md5):
"""Fetch the book detail page and extract useful metadata."""
url = f"{BASE_AA}/md5/{md5}"
try:
r = SESSION.get(url, timeout=20)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
text = soup.get_text(" ", strip=True)
# Extract page count if visible
pages = None
for word in text.split():
if word.isdigit() and 50 < int(word) < 5000:
pages = int(word)
break
return {"pages": pages, "text": text[:500]}
except Exception as e:
log.warning(f"Detail fetch failed for md5={md5}: {e}")
return {}
def try_download(md5, dest_path):
"""Try each libgen mirror until one works. Returns True on success."""
for mirror_tpl in LIBGEN_MIRRORS:
url = mirror_tpl.format(md5=md5)
try:
r = SESSION.get(url, timeout=60, stream=True, allow_redirects=True)
content_type = r.headers.get("content-type", "")
if r.status_code != 200:
continue
# Some mirrors return an HTML ads page before the real file
if "text/html" in content_type:
# Parse redirect link from ads page
soup = BeautifulSoup(r.text, "html.parser")
dl_link = soup.select_one("a[href*='.pdf']")
if not dl_link:
dl_link = soup.select_one("a[href*='get.php']")
if not dl_link:
continue
actual_url = dl_link["href"]
if not actual_url.startswith("http"):
actual_url = f"https://libgen.is{actual_url}"
r = SESSION.get(actual_url, timeout=120, stream=True)
if r.status_code != 200:
continue
# Stream to disk
dest_path.parent.mkdir(parents=True, exist_ok=True)
with open(dest_path, "wb") as f:
for chunk in r.iter_content(8192):
f.write(chunk)
# Verify it's a real PDF
with open(dest_path, "rb") as f:
header = f.read(4)
if header == b"%PDF":
size_mb = dest_path.stat().st_size / 1024 / 1024
log.info(f" [OK] {dest_path.name} ({size_mb:.1f}MB) via {url}")
return True
else:
log.warning(f" [BAD] Not a PDF from {url}")
dest_path.unlink(missing_ok=True)
except Exception as e:
log.warning(f" Mirror failed {url}: {e}")
continue
return False
def process_book(title, author, subdir, dry_run):
"""Full search + download pipeline for one book."""
log.info(f"[SEARCH] '{title}'{author}")
result = {
"title": title,
"author": author,
"status": "NOT FOUND",
"md5": "",
"pages": "",
"file": "",
"notes": "",
}
candidates = search_aa(title, author)
if not candidates:
result["notes"] = "No results from AA search"
return result
# Pick best candidate — prefer one whose text contains author name
best = None
for c in candidates:
if author.split()[-1].lower() in c["text"].lower():
best = c
break
if not best:
best = candidates[0] # take first result if no author match
md5 = best["md5"]
result["md5"] = md5
details = get_book_details(md5)
result["pages"] = details.get("pages", "")
if dry_run:
result["status"] = "DRY RUN — found"
result["notes"] = f"MD5: {md5}"
return result
# Build destination path
safe_title = "".join(c if c.isalnum() or c in " ._-" else "_" for c in title)[:60]
safe_author = author.split()[-1]
filename = f"{safe_title}_{safe_author}.pdf"
dest = BASE_LIB / subdir / filename
if dest.exists():
result["status"] = "ALREADY EXISTS"
result["file"] = str(dest)
return result
log.info(f" MD5: {md5} — attempting download...")
ok = try_download(md5, dest)
if ok:
result["status"] = "DOWNLOADED"
result["file"] = str(dest)
else:
result["status"] = "MD5 ONLY"
result["notes"] = f"All mirrors failed. MD5: {md5}"
return result
def write_report(results):
REPORT_PATH.parent.mkdir(parents=True, exist_ok=True)
downloaded = [r for r in results if r["status"] == "DOWNLOADED"]
md5_only = [r for r in results if r["status"] == "MD5 ONLY"]
not_found = [r for r in results if r["status"] == "NOT FOUND"]
already_have = [r for r in results if r["status"] == "ALREADY EXISTS"]
lines = [
f"# Anna's Archive Acquisition Report",
f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M')}",
f"**Total searched:** {len(results)}",
f"",
f"| Status | Count |",
f"|--------|-------|",
f"| Downloaded | {len(downloaded)} |",
f"| MD5 only (mirrors failed) | {len(md5_only)} |",
f"| Not found on AA | {len(not_found)} |",
f"| Already in library | {len(already_have)} |",
f"",
]
if downloaded:
lines += ["## Downloaded", ""]
lines += ["| Title | Author | Pages | File |", "|-------|--------|-------|------|"]
for r in downloaded:
lines.append(f"| {r['title']} | {r['author']} | {r['pages']} | `{Path(r['file']).name}` |")
lines.append("")
if md5_only:
lines += ["## Found on AA — Download Failed (use MD5 for manual retrieval)", ""]
lines += ["| Title | Author | MD5 | Notes |", "|-------|--------|-----|-------|"]
for r in md5_only:
lines.append(f"| {r['title']} | {r['author']} | `{r['md5']}` | {r['notes']} |")
lines.append("")
if not_found:
lines += ["## Not Found on Anna's Archive", ""]
lines += ["| Title | Author | Notes |", "|-------|--------|-------|"]
for r in not_found:
lines.append(f"| {r['title']} | {r['author']} | {r['notes']} |")
lines.append("")
if already_have:
lines += ["## Already in Library", ""]
lines += ["| Title | Author |", "|-------|--------|"]
for r in already_have:
lines.append(f"| {r['title']} | {r['author']} |")
lines.append("")
REPORT_PATH.write_text("\n".join(lines))
log.info(f"Report written to {REPORT_PATH}")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--limit", type=int, default=None)
args = parser.parse_args()
targets = TARGETS[:args.limit] if args.limit else TARGETS
log.info(f"Starting AA acquisition: {len(targets)} books | dry_run={args.dry_run}")
results = []
for i, (title, author, subdir) in enumerate(targets, 1):
log.info(f"[{i}/{len(targets)}]")
result = process_book(title, author, subdir, args.dry_run)
results.append(result)
log.info(f" -> {result['status']}")
# Polite delay between requests
time.sleep(random.uniform(8, 15))
write_report(results)
print(f"\n-- Summary -----------------------------------------------")
for status in ["DOWNLOADED", "MD5 ONLY", "NOT FOUND", "ALREADY EXISTS", "DRY RUN — found"]:
count = sum(1 for r in results if r["status"] == status)
if count:
print(f" {status:<35} {count:>3}")
print(f" Report: {REPORT_PATH}")
if __name__ == "__main__":
main()

478
scripts/aa_download_pass2.py Executable file
View file

@ -0,0 +1,478 @@
#!/usr/bin/env python3
"""
aa_download_pass2.py Second-pass downloader for books that failed in pass 1.
Reads the MD5 list from pass 1 report and tries:
1. Z-Library search by title/author (separate catalog from Libgen)
2. IPFS gateways using AA's IPFS CID (different from MD5 but findable)
3. Alternative Libgen mirrors not tried in pass 1
4. Direct AA slow download with longer timeout + retry
Checkpoint: saves progress to /opt/recon/data/aa_pass2_checkpoint.json
so interrupted runs resume where they left off.
Usage:
python3 /opt/recon/scripts/aa_download_pass2.py [--dry-run]
"""
import json
import time
import random
import logging
import hashlib
import argparse
from pathlib import Path
from datetime import datetime
import requests
from bs4 import BeautifulSoup
LOG_FILE = Path("/opt/recon/logs/aa_download_pass2.log")
REPORT_IN = Path.home() / "projects/recon/aa_acquisition_report.md"
REPORT_OUT = Path.home() / "projects/recon/aa_acquisition_report_pass2.md"
CHECKPOINT = Path("/opt/recon/data/aa_pass2_checkpoint.json")
BASE_LIB = Path("/mnt/library/Acquired")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
)
log = logging.getLogger("aa_pass2")
SESSION = requests.Session()
SESSION.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0",
"Accept-Language": "en-US,en;q=0.9",
})
# ── Mirrors to try in order ───────────────────────────────────────────────────
MIRRORS = [
# Libgen alternatives
"https://libgen.li/ads.php?md5={md5}",
"https://library.lol/main/{md5}",
"https://libgen.rocks/get.php?md5={md5}",
# Z-Library direct MD5 endpoint (sometimes works)
"https://z-library.se/md5/{md5}",
# IPFS public gateways — AA uses IPFS for storage
"https://cloudflare-ipfs.com/ipfs/{md5}",
"https://ipfs.io/ipfs/{md5}",
"https://gateway.pinata.cloud/ipfs/{md5}",
]
# ── Books that failed in pass 1 — title, author, md5, subdir ─────────────────
PASS1_FAILURES = [
# Medical/Herbalism
("The Earthwise Herbal Volume 1", "Matthew Wood", "fc8dc19f5a17f38849a3979830dc95c1", "Medical/Herbalism"),
("The Earthwise Herbal Volume 2", "Matthew Wood", "fc8dc19f5a17f38849a3979830dc95c1", "Medical/Herbalism"),
("Herbal Antibiotics", "Stephen Buhner", "5839dab78edfdff0d7986fac62b814da", "Medical/Herbalism"),
("The Herbal Medicine-Maker's Handbook", "James Green", "27e8e8a3585705ed194029b69c7d61b1", "Medical/Herbalism"),
("Rosemary Gladstar's Medicinal Herbs", "Rosemary Gladstar", "9b1966f20a32ab4331bfece167be1dd0", "Medical/Herbalism"),
# Medical/Austere
("Wilderness Medicine", "Paul Auerbach", "957818eaa4ec40527bb05902f9ef7c51", "Medical/Austere"),
("Medicine for Mountaineering", "James Wilkerson", "39cb07998f2034206f0c9472e44cb0b4", "Medical/Austere"),
# Medical/Veterinary
("The Chicken Health Handbook", "Gail Damerow", "0ba42fbea034b9a08ec8e2f8d7606efe", "Medical/Veterinary"),
# Power
("The Renewable Energy Handbook", "William Kemp", "475d89fa80aea6c45aa4b1b4b9c5e274", "Power"),
("Homebrew Wind Power", "Dan Bartmann", "0578696d5b1b6bceb3e5e3302c1a31aa", "Power"),
("Wind Energy Basics", "Paul Gipe", "ccbe9d22e0a5e32d61921d20d66a8e05", "Power"),
("12-Volt Bible", "Brotherton", "3f964fa6d730fdf2c3d3e231e87cf692", "Power"),
("Wiring a House", "Rex Cauldwell", "5efcb53450e9eb560210eee40678adcf", "Power"),
# Navigation
("Emergency Navigation", "David Burch", "25e4def9e777b3fa9ca935134732ff9d", "Navigation"),
# Water
("Water Storage", "Art Ludwig", "17c965ec15c6cf4f09b5377b599a5266", "Water"),
("The Home Water Supply", "Stu Campbell", "9b22677d2f8e8b39f7a6bf032187295b", "Water"),
# Food
("Fermented Vegetables", "Kirsten Shockey", "74d3bde876b4c17be66c21fdfa85213e", "Food"),
("The Art of Natural Cheesemaking", "David Asher", "bc0e0829d701fea9beca912d39f8cc74", "Food"),
# Permaculture
("Edible Forest Gardens Volume 1", "Dave Jacke", "6b069c3bb077fdd89d487a363c070fbb", "Permaculture"),
("Edible Forest Gardens Volume 2", "Dave Jacke", "699255bfde7f69285c132a94ec291bf4", "Permaculture"),
("Creating a Forest Garden", "Martin Crawford", "96d71d70dba31ae86e14845f913e557e", "Permaculture"),
("Sepp Holzer's Permaculture", "Sepp Holzer", "32be55a9fce3e31cacd6912069abb410", "Permaculture"),
("The Permaculture Handbook", "Peter Bane", "08cb4492739fda4d01b5a868a408e4a0", "Permaculture"),
("The Market Gardener", "Jean-Martin Fortier", "ac69f6c8c22305b42b539482dc761c19", "Permaculture"),
# Scenario
("SAS Survival Handbook", "John Wiseman", "fa967fd5fcbeb3c9887e22f73e590c64", "Scenario"),
("Pocket Ref", "Thomas Glover", "8e4988ce513a4aa75e7e6c00ee36692b", "Scenario"),
("Deep Survival", "Laurence Gonzales", "9a907ab13b81ea597407fffdb8ea1b04", "Scenario"),
# Skills
("A Pattern Language", "Christopher Alexander","7f5cc06b5399b65a278c4005ccd8d476", "Skills"),
]
def load_checkpoint():
"""Load checkpoint: dict of {title: result_dict} for completed books."""
if CHECKPOINT.exists():
try:
return json.loads(CHECKPOINT.read_text())
except Exception:
pass
return {}
def save_checkpoint(completed):
"""Save checkpoint after each book."""
CHECKPOINT.parent.mkdir(parents=True, exist_ok=True)
tmp = str(CHECKPOINT) + ".tmp"
with open(tmp, "w") as f:
json.dump(completed, f, indent=2)
Path(tmp).replace(CHECKPOINT)
def load_md5s_from_report():
"""Parse MD5 hashes from pass 1 report to pre-populate PASS1_FAILURES."""
if not REPORT_IN.exists():
return {}
md5_map = {}
for line in REPORT_IN.read_text().splitlines():
if "`" in line and len(line) > 30:
parts = line.split("|")
if len(parts) >= 4:
title = parts[1].strip()
md5_cell = parts[3].strip().strip("`")
if len(md5_cell) == 32 and md5_cell.isalnum():
md5_map[title.lower()] = md5_cell
return md5_map
def search_zlib(title, author):
"""Try Z-Library search endpoint."""
try:
url = "https://z-library.se/s/"
params = {"q": f"{title} {author}", "extension[]": "pdf"}
r = SESSION.get(url, params=params, timeout=15)
if r.status_code != 200:
return None
soup = BeautifulSoup(r.text, "html.parser")
# Z-lib book links contain /book/
for a in soup.select("a[href*='/book/']")[:3]:
href = a.get("href", "")
if href:
book_url = f"https://z-library.se{href}" if href.startswith("/") else href
return book_url
except Exception as e:
log.debug(f"Zlib search failed: {e}")
return None
def try_zlib_download(book_url, dest_path):
"""Download from Z-Library book page."""
try:
r = SESSION.get(book_url, timeout=15)
soup = BeautifulSoup(r.text, "html.parser")
dl = soup.select_one("a.addDownloadedBook, a[href*='/dl/'], a.btn-primary[href*='download']")
if not dl:
return False
dl_url = dl["href"]
if not dl_url.startswith("http"):
dl_url = f"https://z-library.se{dl_url}"
r2 = SESSION.get(dl_url, timeout=120, stream=True)
if r2.status_code != 200:
return False
dest_path.parent.mkdir(parents=True, exist_ok=True)
with open(dest_path, "wb") as f:
for chunk in r2.iter_content(8192):
f.write(chunk)
with open(dest_path, "rb") as f:
if f.read(4) == b"%PDF":
return True
dest_path.unlink(missing_ok=True)
except Exception as e:
log.debug(f"Zlib download failed: {e}")
return False
def try_mirrors(md5, dest_path):
"""Try all mirrors with the MD5."""
import re as _re
for tpl in MIRRORS:
url = tpl.format(md5=md5)
try:
r = SESSION.get(url, timeout=20, stream=True, allow_redirects=True)
if r.status_code != 200:
continue
ctype = r.headers.get("content-type", "")
if "html" in ctype:
soup = BeautifulSoup(r.text, "html.parser")
# For libgen.li ads page, look for get.php with key
dl = None
match = _re.search(r'href="(get\.php\?md5=[^"]+)"', r.text)
if match:
actual = f"https://libgen.li/{match.group(1)}"
else:
dl = (soup.select_one("a[href*='.pdf']") or
soup.select_one("a[href*='get.php']") or
soup.select_one("a[href*='/get/']"))
if not dl:
continue
actual = dl["href"]
if not actual.startswith("http"):
base = url.split("/")[0] + "//" + url.split("/")[2]
actual = base + ("/" if not actual.startswith("/") else "") + actual
r = SESSION.get(actual, timeout=60, stream=True)
if r.status_code != 200:
continue
dest_path.parent.mkdir(parents=True, exist_ok=True)
with open(dest_path, "wb") as f:
for chunk in r.iter_content(8192):
f.write(chunk)
with open(dest_path, "rb") as f:
if f.read(4) == b"%PDF":
size_mb = dest_path.stat().st_size / 1024 / 1024
log.info(f" [OK] {size_mb:.1f}MB via {url}")
return True
dest_path.unlink(missing_ok=True)
except Exception as e:
log.debug(f"Mirror {url} failed: {e}")
time.sleep(2)
return False
def get_ipfs_cids(md5):
"""Fetch IPFS CIDs from AA book detail page."""
import re as _re
cids = []
try:
r = SESSION.get(f"https://annas-archive.gl/md5/{md5}", timeout=20)
if r.status_code == 200:
for m in _re.finditer(r'ipfs_cid[:\s]+([A-Za-z0-9]{46,})', r.text):
cids.append(m.group(1))
# Also check for CIDs in href attributes
for m in _re.finditer(r'ipfs://([A-Za-z0-9]{46,})', r.text):
if m.group(1) not in cids:
cids.append(m.group(1))
except Exception as e:
log.debug(f"IPFS CID fetch failed: {e}")
return cids
def try_ipfs_download(cids, dest_path):
"""Try downloading via IPFS public gateways."""
gateways = [
"https://cloudflare-ipfs.com/ipfs/{}",
"https://dweb.link/ipfs/{}",
]
for cid in cids[:3]: # limit to first 3 CIDs
for gw_tpl in gateways:
url = gw_tpl.format(cid)
try:
r = SESSION.get(url, timeout=15, stream=True)
if r.status_code != 200:
continue
dest_path.parent.mkdir(parents=True, exist_ok=True)
with open(dest_path, "wb") as f:
for chunk in r.iter_content(8192):
f.write(chunk)
with open(dest_path, "rb") as f:
if f.read(4) == b"%PDF":
size_mb = dest_path.stat().st_size / 1024 / 1024
log.info(f" [OK] {size_mb:.1f}MB via IPFS {url[:60]}...")
return True
dest_path.unlink(missing_ok=True)
except Exception as e:
log.debug(f"IPFS {url} failed: {e}")
time.sleep(1)
return False
def search_aa_fresh(title, author):
"""Fresh AA search on .gl domain for books that weren't found before."""
for domain in ["annas-archive.gl", "annas-archive.se", "annas-archive.org"]:
try:
url = f"https://{domain}/search"
params = {"q": f"{title} {author}", "ext": "pdf", "lang": "en"}
r = SESSION.get(url, params=params, timeout=15)
if r.status_code != 200:
continue
soup = BeautifulSoup(r.text, "html.parser")
for a in soup.select("a[href^='/md5/']"):
text = a.get_text(" ", strip=True)
if not text:
continue
md5 = a["href"].split("/md5/")[-1].split("/")[0].strip()
if len(md5) == 32:
if author.split()[-1].lower() in text.lower() or title.split()[0].lower() in text.lower():
return md5
except Exception:
continue
return None
def process_book(title, author, md5_hint, subdir, dry_run):
result = {
"title": title, "author": author,
"status": "NOT FOUND", "md5": md5_hint,
"file": "", "notes": "",
}
safe_title = "".join(c if c.isalnum() or c in " ._-" else "_" for c in title)[:60]
safe_author = author.split()[-1]
dest = BASE_LIB / subdir / f"{safe_title}_{safe_author}.pdf"
if dest.exists():
result["status"] = "ALREADY EXISTS"
result["file"] = str(dest)
return result
if dry_run:
result["status"] = "DRY RUN"
return result
# 1. Try Z-Library first (different catalog)
log.info(f" Trying Z-Library...")
zlib_url = search_zlib(title, author)
if zlib_url:
if try_zlib_download(zlib_url, dest):
result["status"] = "DOWNLOADED (Z-Library)"
result["file"] = str(dest)
return result
# 2. If no MD5 from pass 1, do a fresh AA search
md5 = md5_hint
if not md5:
log.info(f" Searching AA for fresh MD5...")
md5 = search_aa_fresh(title, author)
if md5:
result["md5"] = md5
log.info(f" Found MD5: {md5}")
# 3. Try IPFS with real CIDs from AA detail page
if md5:
log.info(f" Fetching IPFS CIDs from AA...")
cids = get_ipfs_cids(md5)
if cids:
log.info(f" Found {len(cids)} IPFS CID(s), trying gateways...")
if try_ipfs_download(cids, dest):
result["status"] = "DOWNLOADED (IPFS)"
result["file"] = str(dest)
return result
# 4. Try all mirrors with MD5
if md5:
log.info(f" Trying mirrors with MD5 {md5}...")
if try_mirrors(md5, dest):
result["status"] = "DOWNLOADED (mirror)"
result["file"] = str(dest)
return result
result["status"] = "MD5 ONLY"
result["notes"] = f"MD5 confirmed, all mirrors failed: {md5}"
else:
result["notes"] = "Not found on AA or Z-Library"
return result
def write_report(results):
downloaded = [r for r in results if "DOWNLOADED" in r["status"]]
md5_only = [r for r in results if r["status"] == "MD5 ONLY"]
not_found = [r for r in results if r["status"] == "NOT FOUND"]
existing = [r for r in results if r["status"] == "ALREADY EXISTS"]
lines = [
"# AA Acquisition Report -- Pass 2",
f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M')}",
f"**Searched:** {len(results)} | **Downloaded:** {len(downloaded)} | "
f"**MD5 only:** {len(md5_only)} | **Not found:** {len(not_found)}",
"",
]
if downloaded:
lines += ["## Downloaded", "",
"| Title | Author | Via | File |",
"|-------|--------|-----|------|"]
for r in downloaded:
lines.append(f"| {r['title']} | {r['author']} | {r['status']} | `{Path(r['file']).name}` |")
lines.append("")
if existing:
lines += ["## Already in Library", "",
"| Title | Author |",
"|-------|--------|"]
for r in existing:
lines.append(f"| {r['title']} | {r['author']} |")
lines.append("")
if md5_only:
lines += ["## MD5 Known -- All Mirrors Failed", "",
"| Title | Author | MD5 |",
"|-------|--------|-----|"]
for r in md5_only:
lines.append(f"| {r['title']} | {r['author']} | `{r['md5']}` |")
lines.append("")
if not_found:
lines += ["## Not Found Anywhere", "",
"| Title | Author | Notes |",
"|-------|--------|-------|"]
for r in not_found:
lines.append(f"| {r['title']} | {r['author']} | {r['notes']} |")
lines.append("")
REPORT_OUT.parent.mkdir(parents=True, exist_ok=True)
REPORT_OUT.write_text("\n".join(lines))
log.info(f"Report written to {REPORT_OUT}")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
# Load any MD5s captured in pass 1
md5_map = load_md5s_from_report()
targets = []
for title, author, md5_hint, subdir in PASS1_FAILURES:
md5 = md5_hint or md5_map.get(title.lower(), "")
targets.append((title, author, md5, subdir))
# Load checkpoint
completed = load_checkpoint()
if completed:
log.info(f"Resuming: {len(completed)} books already processed in previous run")
log.info(f"Pass 2: {len(targets)} books | dry_run={args.dry_run}")
results = []
for i, (title, author, md5, subdir) in enumerate(targets, 1):
# Check checkpoint — skip already-processed books
if title in completed and not args.dry_run:
result = completed[title]
results.append(result)
log.info(f"[{i}/{len(targets)}] {title} — SKIPPED (checkpoint: {result['status']})")
continue
log.info(f"[{i}/{len(targets)}] {title} -- {author}")
result = process_book(title, author, md5, subdir, args.dry_run)
results.append(result)
log.info(f" -> {result['status']}")
# Save checkpoint after each book (not in dry-run)
if not args.dry_run:
completed[title] = result
save_checkpoint(completed)
time.sleep(random.uniform(6, 12))
write_report(results)
print(f"\n-- Pass 2 Summary ----------------------------------------")
for status in ["DOWNLOADED (Z-Library)", "DOWNLOADED (IPFS)", "DOWNLOADED (mirror)", "MD5 ONLY", "NOT FOUND", "ALREADY EXISTS", "DRY RUN"]:
count = sum(1 for r in results if r["status"] == status)
if count:
print(f" {status:<35} {count:>3}")
print(f" Report: {REPORT_OUT}")
if __name__ == "__main__":
main()

64
scripts/backup.sh Executable file
View file

@ -0,0 +1,64 @@
#!/bin/bash
# RECON Backup Script
# Backs up the precious data: concept JSONs, text extracts, SQLite DB
# Qdrant is NOT backed up — rebuilt from JSONs via `recon rebuild`
# Destination: Contabo VPS (100.64.0.1) via rsync+SSH
set -euo pipefail
RECON_DIR="/opt/recon"
DATA_DIR="$RECON_DIR/data"
LOG_FILE="$RECON_DIR/logs/backup.log"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_HOST="root@100.64.0.1"
BACKUP_BASE="/opt/backups/recon"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
mkdir -p "$RECON_DIR/logs"
log "=== RECON Backup Starting ==="
# ── 1. SQLite DB (small, fast, critical) ──
log "Backing up recon.db..."
LOCAL_DB_BACKUP="/tmp/recon_${DATE}.db"
sqlite3 "$DATA_DIR/recon.db" ".backup '$LOCAL_DB_BACKUP'"
rsync -az "$LOCAL_DB_BACKUP" "$BACKUP_HOST:$BACKUP_BASE/recon_${DATE}.db"
rm -f "$LOCAL_DB_BACKUP"
# Keep last 7 daily DB backups on remote
ssh "$BACKUP_HOST" "ls -t $BACKUP_BASE/recon_*.db 2>/dev/null | tail -n +8 | xargs rm -f 2>/dev/null || true"
log " recon.db backed up"
# ── 2. Concept JSONs (THE PRECIOUS DATA — $130+ of Gemini work) ──
log "Syncing concept JSONs..."
rsync -az --delete "$DATA_DIR/concepts/" "$BACKUP_HOST:$BACKUP_BASE/concepts/"
CONCEPT_COUNT=$(find "$DATA_DIR/concepts/" -name "*.json" 2>/dev/null | wc -l)
log " concepts synced ($CONCEPT_COUNT JSON files)"
# ── 3. Text extracts (regenerable but expensive in time) ──
log "Syncing text extracts..."
rsync -az --delete "$DATA_DIR/text/" "$BACKUP_HOST:$BACKUP_BASE/text/"
TEXT_COUNT=$(find "$DATA_DIR/text/" -maxdepth 1 -type d 2>/dev/null | wc -l)
log " text synced ($((TEXT_COUNT - 1)) document dirs)"
# ── 4. Intel feeds ──
if [ -d "$DATA_DIR/intel" ]; then
log "Syncing intel feeds..."
rsync -az --delete "$DATA_DIR/intel/" "$BACKUP_HOST:$BACKUP_BASE/intel/"
log " intel synced"
fi
# ── 5. Config files ──
log "Backing up config..."
rsync -az "$RECON_DIR/config.yaml" "$BACKUP_HOST:$BACKUP_BASE/config_${DATE}.yaml"
rsync -az "$RECON_DIR/.env" "$BACKUP_HOST:$BACKUP_BASE/env_${DATE}" 2>/dev/null || true
ssh "$BACKUP_HOST" "ls -t $BACKUP_BASE/config_*.yaml 2>/dev/null | tail -n +4 | xargs rm -f 2>/dev/null || true"
ssh "$BACKUP_HOST" "ls -t $BACKUP_BASE/env_* 2>/dev/null | tail -n +4 | xargs rm -f 2>/dev/null || true"
log " config backed up"
# ── Summary ──
BACKUP_SIZE=$(ssh "$BACKUP_HOST" "du -sh $BACKUP_BASE" | cut -f1)
log "=== Backup Complete: $BACKUP_SIZE on Contabo ==="

449
scripts/cleanup_outliers.py Executable file
View file

@ -0,0 +1,449 @@
#!/usr/bin/env python3
"""
cleanup_outliers.py Three-pass cleanup of RECON concept data.
Pass 1: Remap ~160 non-canonical domain strings in concept JSONs + Qdrant payloads
Pass 2: Re-enrich 434 concepts with empty domain arrays via Gemini
Pass 3: Purge junk/noise URLs from Qdrant + SQLite DB
Usage:
python3 /opt/recon/scripts/cleanup_outliers.py [--dry-run] [--skip-pass N]
"""
import json
import time
import random
import logging
import argparse
import threading
import sqlite3
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from collections import defaultdict
import google.generativeai as genai
from qdrant_client import QdrantClient
from qdrant_client.models import FieldCondition, MatchAny, Filter
import sys, os
sys.path.insert(0, '/opt/recon')
from lib.utils import get_config, setup_logging
LOG_FILE = Path("/opt/recon/logs/cleanup_outliers.log")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
)
log = logging.getLogger("cleanup_outliers")
CONCEPTS_DIR = Path("/opt/recon/data/concepts")
DB_PATH = Path("/opt/recon/data/recon.db")
CANONICAL_DOMAINS = {
"Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
"Foundational Skills", "Communications", "Medical", "Food Systems",
"Navigation", "Logistics", "Power Systems", "Leadership",
"Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
}
# Non-canonical → canonical remap
OUTLIER_MAP = {
"Zoology": "Sustainment Systems",
"Botany": "Sustainment Systems",
"Nature Lore": "Sustainment Systems",
"Ecology": "Sustainment Systems",
"Navigational Astronomy": "Navigation",
"Troubleshooting": "Foundational Skills",
"Chemistry": "Foundational Skills",
"Metallurgy": "Foundational Skills",
"Weird Science": "Foundational Skills",
"Philosophy of physics": "Foundational Skills",
"Physics": "Foundational Skills",
"Cell biology": "Foundational Skills",
"Economics": "Leadership",
"Business": "Leadership",
"Safety": "Security",
"Law Enforcement": "Security",
"Security & Intelligence": "Security",
"Fire Weather": "Scenario Playbooks",
"Legal": "Leadership",
# Discard — replace with closest real domain
"Site News": "Foundational Skills",
"Paleogeography": "Foundational Skills",
"Chemical Manipulation": "Foundational Skills",
}
# Junk URL patterns — pages with no knowledge value
JUNK_URL_PATTERNS = [
# rocketstoves.com nav/template garbage
"rocketstoves.com/favicon",
"rocketstoves.com/cropped-favicon",
"rocketstoves.com/layouts/",
"rocketstoves.com/sample",
"rocketstoves.com/templates/",
"rocketstoves.com/hello-world",
"rocketstoves.com/blog-forthcoming",
"rocketstoves.com/contact",
"rocketstoves.com/acknowledgements",
"rocketstoves.com/ja3",
"rocketstoves.com/juxtapositions",
"rocketstoves.com/no-name-soi",
"rocketstoves.com/big4",
"rocketstoves.com/roof",
"rocketstoves.com/rmh_dloadcover",
"rocketstoves.com/pedcover",
"rocketstoves.com/laundry-to-landscape",
"rocketstoves.com/barreloven",
# NRCS calendar/event noise
"nrcs.usda.gov/events/",
"nrcs.usda.gov/state-offices/massachusetts",
"nrcs.usda.gov/state-offices/nebraska",
"nrcs.usda.gov/state-offices/oklahoma",
"nrcs.usda.gov/state-offices/utah",
"nrcs.usda.gov/conservation-basics/natural-resource-concerns/soil/western-call-for-abstracts",
# deeranddeerhunting trophy hunt videos (no knowledge value)
"deeranddeerhunting.com/trophy-whitetails-exclusive-videos/",
# eattheweeds non-content pages
"eattheweeds.com/media-interviews-with-green-deane",
"eattheweeds.com/motorcycles-and-mushrooms",
"eattheweeds.com/sunny-savage",
# foragersharvest nav pages
"foragersharvest.com/contact",
"foragersharvest.com/podcasts",
# motherearthnews classifieds/nav
"motherearthnews.com/classifieds/",
"motherearthnews.com/biographies/",
]
CLASSIFY_PROMPT = """\
Classify this knowledge concept into one or more domains.
VALID DOMAINS (use ONLY these exact strings):
Defense & Tactics, Sustainment Systems, Off-Grid Systems, Foundational Skills,
Communications, Medical, Food Systems, Navigation, Logistics, Power Systems,
Leadership, Scenario Playbooks, Water Systems, Security, Community Coordination
Concept title: {title}
Concept tags: {subdomain}
Concept preview: {content}
Return ONLY valid JSON, no markdown:
{{"domain": ["Domain Name"]}}
Rules:
- Never return empty domain list
- Medical content, herbs, first aid, veterinary Medical
- Food growing, foraging, hunting, livestock Sustainment Systems
- Food preservation, canning, storage Food Systems
- Solar, wind, batteries, generators Power Systems
- Water sourcing, filtration, sanitation Water Systems
"""
def load_gemini_keys():
keys = []
for line in Path("/opt/recon/.env").read_text().splitlines():
if line.startswith("GEMINI_KEY_"):
keys.append(line.split("=", 1)[1].strip())
return keys
class KeyRotator:
def __init__(self, keys):
self.keys = keys
self._i = 0
self._lock = threading.Lock()
def next(self):
with self._lock:
key = self.keys[self._i % len(self.keys)]
self._i += 1
return key
def classify_concept(title, subdomains, content, key):
prompt = CLASSIFY_PROMPT.format(
title=title or "(untitled)",
subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
content=str(content)[:300] if content else "(none)",
)
genai.configure(api_key=key)
model = genai.GenerativeModel(
"gemini-2.0-flash",
generation_config={"response_mime_type": "application/json"}
)
for attempt in range(4):
try:
resp = model.generate_content(prompt)
data = json.loads(resp.text)
domains = [d for d in data.get("domain", []) if d in CANONICAL_DOMAINS]
if domains:
return domains
except Exception as e:
err = str(e).lower()
if any(s in err for s in ["429", "quota", "rate", "503"]):
time.sleep(min(5 * (2 ** attempt) + random.uniform(0, 3), 60))
else:
break
return ["Foundational Skills"]
# ── PASS 1: Remap outlier domains ────────────────────────────────────────────
def remap_concept_domains(domains):
"""Remap any outlier domain names in a domain list."""
result = set()
changed = False
for d in domains:
if d in CANONICAL_DOMAINS:
result.add(d)
elif d in OUTLIER_MAP:
result.add(OUTLIER_MAP[d])
changed = True
else:
changed = True # drop unknown
return list(result), changed
def pass1_remap_outliers(qdrant, collection, dry_run):
log.info("=== PASS 1: Remapping non-canonical outlier domains ===")
outlier_names = list(OUTLIER_MAP.keys())
stats = defaultdict(int)
# Scroll through Qdrant finding affected vectors
offset = None
affected_points = []
while True:
results, offset = qdrant.scroll(
collection_name=collection,
scroll_filter=Filter(
must=[FieldCondition(
key="domain",
match=MatchAny(any=outlier_names)
)]
),
limit=500,
with_payload=True,
with_vectors=False,
offset=offset,
)
affected_points.extend(results)
if offset is None:
break
log.info(f"Found {len(affected_points)} Qdrant points with outlier domains")
for point in affected_points:
payload = point.payload
old_domains = payload.get("domain", [])
if isinstance(old_domains, str):
old_domains = [old_domains]
new_domains, changed = remap_concept_domains(old_domains)
if not new_domains:
new_domains = ["Foundational Skills"]
if changed:
stats["qdrant_updated"] += 1
if not dry_run:
qdrant.set_payload(
collection_name=collection,
payload={"domain": new_domains},
points=[point.id],
)
# Also fix concept JSON files on disk
json_fixed = 0
for window_file in CONCEPTS_DIR.rglob("window_*.json"):
try:
with open(window_file, "r", encoding="utf-8") as f:
concepts = json.load(f)
except Exception:
continue
if not isinstance(concepts, list):
continue
file_changed = False
for concept in concepts:
if not isinstance(concept, dict):
continue
raw = concept.get("domain", [])
if isinstance(raw, str):
raw = [raw]
new, changed = remap_concept_domains(raw)
if changed:
concept["domain"] = new if new else ["Foundational Skills"]
file_changed = True
if file_changed:
json_fixed += 1
if not dry_run:
with open(window_file, "w", encoding="utf-8") as f:
json.dump(concepts, f, indent=2, ensure_ascii=False)
log.info(f"Pass 1 complete: {stats['qdrant_updated']} Qdrant points updated, {json_fixed} JSON files updated")
return stats
# ── PASS 2: Re-enrich empty domain concepts ──────────────────────────────────
def pass2_empty_domains(qdrant, collection, key_rotator, dry_run):
log.info("=== PASS 2: Re-enriching empty domain concepts ===")
stats = defaultdict(int)
# Find empty domain points in Qdrant
offset = None
empty_points = []
while True:
results, offset = qdrant.scroll(
collection_name=collection,
limit=500,
with_payload=True,
with_vectors=False,
offset=offset,
)
for r in results:
d = r.payload.get("domain", [])
if not d or d == [] or d == [""]:
empty_points.append(r)
if offset is None:
break
log.info(f"Found {len(empty_points)} points with empty domains")
for point in empty_points:
payload = point.payload
title = payload.get("title", "")
subdomains = payload.get("subdomain", [])
content = payload.get("content", payload.get("summary", ""))
key = key_rotator.next()
new_domains = classify_concept(title, subdomains, content, key)
stats["classified"] += 1
if not dry_run:
qdrant.set_payload(
collection_name=collection,
payload={"domain": new_domains},
points=[point.id],
)
# Also update the concept JSON on disk
doc_hash = payload.get("doc_hash", "")
if doc_hash:
doc_concepts_dir = CONCEPTS_DIR / doc_hash
if doc_concepts_dir.exists():
for wf in doc_concepts_dir.glob("window_*.json"):
try:
with open(wf, "r", encoding="utf-8") as f:
concepts = json.load(f)
changed = False
for c in concepts:
if isinstance(c, dict) and c.get("title") == title:
d = c.get("domain", [])
if not d or d == []:
c["domain"] = new_domains
changed = True
if changed and not dry_run:
with open(wf, "w", encoding="utf-8") as f:
json.dump(concepts, f, indent=2, ensure_ascii=False)
except Exception:
pass
time.sleep(0.05)
log.info(f"Pass 2 complete: {stats['classified']} concepts re-classified")
return stats
# ── PASS 3: Purge junk URLs ──────────────────────────────────────────────────
def is_junk_url(url):
url_lower = url.lower()
return any(pattern.lower() in url_lower for pattern in JUNK_URL_PATTERNS)
def pass3_purge_junk(qdrant, collection, dry_run):
log.info("=== PASS 3: Purging junk URLs ===")
stats = defaultdict(int)
# Scroll all web-source points and find junk
offset = None
junk_point_ids = []
junk_doc_hashes = set()
while True:
results, offset = qdrant.scroll(
collection_name=collection,
scroll_filter=Filter(
must=[FieldCondition(key="source_type", match=MatchAny(any=["web"]))]
),
limit=500,
with_payload=True,
with_vectors=False,
offset=offset,
)
for r in results:
filename = r.payload.get("filename", "")
doc_hash = r.payload.get("doc_hash", "")
if is_junk_url(filename):
junk_point_ids.append(r.id)
if doc_hash:
junk_doc_hashes.add(doc_hash)
if offset is None:
break
log.info(f"Found {len(junk_point_ids)} junk vectors across {len(junk_doc_hashes)} documents")
if not dry_run and junk_point_ids:
# Delete in batches
batch_size = 500
for i in range(0, len(junk_point_ids), batch_size):
batch = junk_point_ids[i:i + batch_size]
qdrant.delete(collection_name=collection, points_selector=batch)
log.info(f"Deleted {len(junk_point_ids)} junk vectors from Qdrant")
# Mark junk docs as skipped in SQLite
conn = sqlite3.connect(str(DB_PATH))
for doc_hash in junk_doc_hashes:
conn.execute(
"UPDATE documents SET status = 'skipped', error_message = 'junk content purged' WHERE hash = ?",
(doc_hash,)
)
conn.commit()
conn.close()
log.info(f"Marked {len(junk_doc_hashes)} documents as skipped in DB")
stats["junk_vectors"] = len(junk_point_ids)
stats["junk_docs"] = len(junk_doc_hashes)
log.info(f"Pass 3 complete: {stats['junk_vectors']} vectors, {stats['junk_docs']} docs purged")
return stats
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--skip-pass", type=int, action="append", default=[])
args = parser.parse_args()
config = get_config()
keys = load_gemini_keys()
rotator = KeyRotator(keys)
qdrant = QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=60
)
collection = config['vector_db']['collection']
log.info(f"Starting cleanup | dry_run={args.dry_run} | skipping passes: {args.skip_pass}")
if 1 not in args.skip_pass:
pass1_remap_outliers(qdrant, collection, args.dry_run)
if 2 not in args.skip_pass:
pass2_empty_domains(qdrant, collection, rotator, args.dry_run)
if 3 not in args.skip_pass:
pass3_purge_junk(qdrant, collection, args.dry_run)
log.info("All passes complete.")
if __name__ == "__main__":
main()

215
scripts/domain_reenrich.py Executable file
View file

@ -0,0 +1,215 @@
#!/usr/bin/env python3
"""
domain_reenrich.py Re-enriches solo-Reference concepts that domain_remap.py
couldn't fix via subdomain lookup. Reads remap_unknowns.jsonl, calls Gemini
with a lightweight classification-only prompt, updates domain in-place.
Usage:
python3 /opt/recon/scripts/domain_reenrich.py [--workers 16] [--limit N]
Reads: /opt/recon/data/remap_unknowns.jsonl
Writes: domain field in-place in window JSON files
Log: /opt/recon/logs/domain_reenrich.log
"""
import json
import time
import random
import logging
import argparse
import threading
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from collections import defaultdict
import google.generativeai as genai
UNKNOWNS_FILE = Path("/opt/recon/data/remap_unknowns.jsonl")
LOG_FILE = Path("/opt/recon/logs/domain_reenrich.log")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[
logging.FileHandler(LOG_FILE),
logging.StreamHandler(),
]
)
log = logging.getLogger("domain_reenrich")
CANONICAL_DOMAINS = [
"Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
"Foundational Skills", "Communications", "Medical", "Food Systems",
"Navigation", "Logistics", "Power Systems", "Leadership",
"Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
]
DOMAIN_SET = set(CANONICAL_DOMAINS)
CLASSIFY_PROMPT = """\
Classify this knowledge concept into one or more domains.
VALID DOMAINS (use ONLY these exact strings, no others):
{domains}
Concept title: {title}
Concept tags: {subdomain}
Concept preview: {content}
Return ONLY valid JSON, no markdown, no explanation:
{{"domain": ["Domain Name"]}}
Rules:
- Use only the domain strings listed above, spelled exactly
- If genuinely multi-domain assign all that apply
- Never return empty domain list pick the closest match
- Medical content, herbs, first aid, veterinary Medical
- Food growing, foraging, hunting, livestock Sustainment Systems
- Food preservation, canning, storage Food Systems
- Solar, wind, batteries, generators Power Systems
- Water sourcing, filtration, sanitation Water Systems
"""
def load_gemini_keys():
env = Path("/opt/recon/.env")
keys = []
for line in env.read_text().splitlines():
if line.startswith("GEMINI_KEY_"):
keys.append(line.split("=", 1)[1].strip())
return keys
class KeyRotator:
def __init__(self, keys):
self.keys = keys
self._i = 0
self._lock = threading.Lock()
def next(self):
with self._lock:
key = self.keys[self._i % len(self.keys)]
self._i += 1
return key
def classify_concept(title, subdomains, content, key):
prompt = CLASSIFY_PROMPT.format(
domains="\n".join(f" {d}" for d in CANONICAL_DOMAINS),
title=title or "(untitled)",
subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
content=content[:300] if content else "(none)",
)
genai.configure(api_key=key)
model = genai.GenerativeModel(
"gemini-2.0-flash",
generation_config={"response_mime_type": "application/json"}
)
for attempt in range(4):
try:
resp = model.generate_content(prompt)
data = json.loads(resp.text)
domains = [d for d in data.get("domain", []) if d in DOMAIN_SET]
if domains:
return domains
except Exception as e:
err = str(e).lower()
if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
delay = min(5 * (2 ** attempt) + random.uniform(0, 3), 60)
time.sleep(delay)
else:
break
return ["Foundational Skills"] # last-resort fallback
def process_unknown(item, key_rotator):
filepath = Path(item["filepath"])
title = item.get("title", "")
subdomains = item.get("subdomain", [])
content = item.get("content_preview", "")
if not filepath.exists():
return "file_missing"
try:
with open(filepath, "r", encoding="utf-8") as f:
concepts = json.load(f)
except Exception:
return "read_error"
if not isinstance(concepts, list):
return "not_list"
# Find this concept by title and update its domain
matched = False
for concept in concepts:
if not isinstance(concept, dict):
continue
if concept.get("title", "") == title:
raw = concept.get("domain", [])
if isinstance(raw, str):
raw = [raw]
# Only re-enrich if still stuck on Reference
if raw == ["Reference"] or raw == []:
key = key_rotator.next()
new_domains = classify_concept(title, subdomains, content, key)
concept["domain"] = new_domains
concept["_reenriched"] = True
matched = True
break
if not matched:
return "already_fixed"
try:
with open(filepath, "w", encoding="utf-8") as f:
json.dump(concepts, f, indent=2, ensure_ascii=False)
except Exception:
return "write_error"
return "ok"
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--workers", type=int, default=16)
parser.add_argument("--limit", type=int, default=None)
args = parser.parse_args()
keys = load_gemini_keys()
if not keys:
log.error("No Gemini keys found in .env")
return
rotator = KeyRotator(keys)
unknowns = []
with open(UNKNOWNS_FILE, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
unknowns.append(json.loads(line))
if args.limit:
unknowns = unknowns[:args.limit]
total = len(unknowns)
log.info(f"Re-enriching {total:,} concepts | {args.workers} workers | {len(keys)} API keys")
log.info(f"Estimated Gemini Flash cost: ~${total * 0.0004:.2f} (conservative)")
results = defaultdict(int)
lock = threading.Lock()
done = 0
with ThreadPoolExecutor(max_workers=args.workers) as ex:
futures = {ex.submit(process_unknown, item, rotator): item for item in unknowns}
for future in as_completed(futures):
status = future.result()
with lock:
results[status] += 1
done += 1
if done % 5000 == 0:
pct = done / total * 100
log.info(f" Progress: {done:,}/{total:,} ({pct:.1f}%) | {dict(results)}")
time.sleep(0.05)
log.info("── Final Results ─────────────────────────────────────────────")
for status, count in sorted(results.items(), key=lambda x: -x[1]):
log.info(f" {status:<25} {count:>10,}")
log.info(f" Total: {total:,}")
if __name__ == "__main__":
main()

428
scripts/domain_remap.py Executable file
View file

@ -0,0 +1,428 @@
#!/usr/bin/env python3
"""
domain_remap.py Fix RECON concept domain classifications without API calls.
What this does:
1. Strips "Reference" from concepts that have other real domains
2. Remaps variant domain spellings to canonical names
3. Reclassifies solo-Reference concepts using their subdomain tags
4. Writes a JSONL file of true unknowns for API re-enrichment
Each window file is a JSON array of concept dicts.
Field names: "domain" (list), "subdomain" (list)
Usage:
python3 /opt/recon/scripts/domain_remap.py --dry-run # report only
python3 /opt/recon/scripts/domain_remap.py # apply fixes
python3 /opt/recon/scripts/domain_remap.py --workers 16
"""
import json
import argparse
import threading
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from collections import defaultdict
CONCEPTS_DIR = Path("/opt/recon/data/concepts")
UNKNOWNS_OUTPUT = Path("/opt/recon/data/remap_unknowns.jsonl")
CANONICAL_DOMAINS = {
"Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
"Foundational Skills", "Communications", "Medical", "Food Systems",
"Navigation", "Logistics", "Power Systems", "Leadership",
"Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
}
# Variant → Canonical mapping
VARIANT_MAP = {
# Defense & Tactics
"Tactical Ops": "Defense & Tactics",
"Tactical_Ops": "Defense & Tactics",
"Tactical Operations": "Defense & Tactics",
"Tactical": "Defense & Tactics",
"Tactical Skills": "Defense & Tactics",
"Tactics": "Defense & Tactics",
"Tactics & Defense": "Defense & Tactics",
"Reconnaissance": "Defense & Tactics",
"Fire Support": "Defense & Tactics",
"Improvised Munitions": "Defense & Tactics",
"Military Intelligence": "Defense & Tactics",
"Military History": "Defense & Tactics",
"Military Engineering": "Defense & Tactics",
# Medical
"Medical Care": "Medical",
"Medical Alternatives": "Medical",
"Medical/Dental": "Medical",
"Medical & Dental": "Medical",
"medical": "Medical",
"Medical Awareness": "Medical",
"Medical Disasters": "Medical",
"Medical Emergency Survival": "Medical",
"Medical Procedures": "Medical",
"Medical Treatment": "Medical",
"Medical Science": "Medical",
"Medical History": "Medical",
"Medical Diagnosis": "Medical",
"Medical Skills": "Medical",
"Medical Supply": "Medical",
"Medical Gear": "Medical",
"Medical Kits": "Medical",
"Medical Logistics": "Logistics",
"Medical First Aid": "Medical",
"Medical Ethics": "Medical",
"Medical Reference Ranges": "Medical",
"Medical andSurgical Hints": "Medical",
"Medical Aspects of Radiation Injury": "Medical",
"Medical Uses": "Medical",
"Medical Care in Developing Countries": "Medical",
"Survival Medicine": "Medical",
"Emergency War Surgery": "Medical",
"First Aid": "Medical",
"First Aid and Life Saving": "Medical",
"Veterinary Medicine": "Medical",
"Veterinary Hygiene": "Medical",
"Veterinary": "Medical",
"Pharmacology": "Medical",
"Public Health": "Medical",
"Health": "Medical",
# Food Systems
"Food_Systems": "Food Systems",
"Food_systems": "Food Systems",
"food_systems": "Food Systems",
"Food Preservation": "Food Systems",
"Food Safety": "Food Systems",
"Food Security": "Food Systems",
"Food & Nutrition": "Food Systems",
"Diet & Nutrition": "Food Systems",
"Culinary Arts": "Food Systems",
"Foodprocessing": "Food Systems",
"Food": "Food Systems",
# Sustainment Systems
"Sustainment_Systems": "Sustainment Systems",
"Agriculture": "Sustainment Systems",
"Agriculture & Natural Resources": "Sustainment Systems",
"Agriculture and Natural Resources": "Sustainment Systems",
"Horticulture": "Sustainment Systems",
"Gardening": "Sustainment Systems",
"Hydroponics": "Sustainment Systems",
"Survival Skills": "Sustainment Systems",
# Foundational Skills
"Foundational_Skills": "Foundational Skills",
"Primitive Living Skills": "Foundational Skills",
"Woodcraft": "Foundational Skills",
"Home Workshop": "Foundational Skills",
"Science": "Foundational Skills",
"Engineering": "Foundational Skills",
"Construction": "Foundational Skills",
"Industrial Processes": "Foundational Skills",
"Machine Technology": "Foundational Skills",
"Training": "Foundational Skills",
"Education": "Foundational Skills",
# Off-Grid Systems
"Off-Grid_Systems": "Off-Grid Systems",
"Appropriate Technology": "Off-Grid Systems",
# Power Systems
"Homebrewed Electricity": "Power Systems",
"Renewable Energy": "Power Systems",
"Renewable Energy FAQs": "Power Systems",
"Alternative Fuels": "Power Systems",
"Power_Systems": "Power Systems",
# Water Systems
"Water_Systems": "Water Systems",
# Community Coordination
"Community_Coordination": "Community Coordination",
"Community_coordination": "Community Coordination",
"Community": "Community Coordination",
# Leadership
"Leadership & Planning": "Leadership",
"Planning": "Leadership",
"Administration": "Leadership",
"Governance": "Leadership",
"Government": "Leadership",
# Communications
"Emergency Communications": "Communications",
# Security
"Security Systems": "Security",
# Logistics
"Transportation": "Logistics",
# Scenario Playbooks
"General Preparedness": "Scenario Playbooks",
"Emergency Preparedness": "Scenario Playbooks",
"Emergency Management": "Scenario Playbooks",
"Wilderness Preparedness": "Scenario Playbooks",
"Urban Preparedness": "Scenario Playbooks",
"Winter Preparedness": "Scenario Playbooks",
# Discard (noise domains)
"Humor": None,
"Recreation": None,
"Business": None,
"Finance": None,
"Economics": None,
"Economics/Finances": None,
"Weird Science": None,
}
# Subdomain keyword → canonical domain (for solo-Reference reclassification)
SUBDOMAIN_MAP = {
"first aid": "Medical",
"emergency care": "Medical",
"emergency medicine": "Medical",
"trauma": "Medical",
"anatomy": "Medical",
"oral rehydration": "Medical",
"ors": "Medical",
"pharmacology": "Medical",
"toxicology": "Medical",
"antidote": "Medical",
"nerve agent": "Defense & Tactics",
"chemical warfare": "Defense & Tactics",
"biological warfare": "Defense & Tactics",
"nbc": "Defense & Tactics",
"infectious disease": "Medical",
"microbiology": "Medical",
"virology": "Medical",
"bacteriology": "Medical",
"pediatric": "Medical",
"surgery": "Medical",
"wound care": "Medical",
"veterinary": "Medical",
"dental": "Medical",
"dentistry": "Medical",
"herbal": "Medical",
"medicinal plant": "Medical",
"medicinal herb": "Medical",
"herbalism": "Medical",
"food preservation": "Food Systems",
"canning": "Food Systems",
"fermentation": "Food Systems",
"food storage": "Food Systems",
"food safety": "Food Systems",
"cooking": "Food Systems",
"food processing": "Food Systems",
"agriculture": "Sustainment Systems",
"soil": "Sustainment Systems",
"permaculture": "Sustainment Systems",
"agroforestry": "Sustainment Systems",
"livestock": "Sustainment Systems",
"animal husbandry": "Sustainment Systems",
"beekeeping": "Sustainment Systems",
"foraging": "Sustainment Systems",
"hunting": "Sustainment Systems",
"fishing": "Sustainment Systems",
"gardening": "Sustainment Systems",
"mycology": "Sustainment Systems",
"mushroom": "Sustainment Systems",
"water purification": "Water Systems",
"water filtration": "Water Systems",
"water sanitation": "Water Systems",
"water disinfection": "Water Systems",
"water storage": "Water Systems",
"well construction": "Water Systems",
"rainwater": "Water Systems",
"solar": "Power Systems",
"wind turbine": "Power Systems",
"battery": "Power Systems",
"batteries": "Power Systems",
"generator": "Power Systems",
"photovoltaic": "Power Systems",
"charge controller": "Power Systems",
"inverter": "Power Systems",
"biogas": "Off-Grid Systems",
"biomass": "Off-Grid Systems",
"wood gasification": "Off-Grid Systems",
"rocket stove": "Off-Grid Systems",
"mechanical system": "Off-Grid Systems",
"power transmission": "Off-Grid Systems",
"radio": "Communications",
"ham radio": "Communications",
"amateur radio": "Communications",
"antenna": "Communications",
"meshtastic": "Communications",
"encryption": "Communications",
"navigation": "Navigation",
"celestial navigation": "Navigation",
"land navigation": "Navigation",
"map reading": "Navigation",
"compass": "Navigation",
"pottery": "Foundational Skills",
"ceramics": "Foundational Skills",
"blacksmithing": "Foundational Skills",
"woodworking": "Foundational Skills",
"leatherwork": "Foundational Skills",
"textile": "Foundational Skills",
"masonry": "Foundational Skills",
"metalworking": "Foundational Skills",
"historical technology": "Foundational Skills",
"weapons": "Defense & Tactics",
"firearms": "Defense & Tactics",
"ballistics": "Defense & Tactics",
"tactics": "Defense & Tactics",
"perimeter": "Security",
"surveillance": "Security",
"supply chain": "Logistics",
"logistics": "Logistics",
"leadership": "Leadership",
"governance": "Leadership",
"community": "Community Coordination",
"emergency preparedness": "Scenario Playbooks",
"disaster": "Scenario Playbooks",
"evacuation": "Scenario Playbooks",
}
def remap_domains(domains):
"""Remap a list of domain strings — variants to canonical, strip Reference."""
result = set()
for d in domains:
if d == "Reference":
continue
if d in CANONICAL_DOMAINS:
result.add(d)
elif d in VARIANT_MAP:
mapped = VARIANT_MAP[d]
if mapped: # None means discard
result.add(mapped)
# Unknown non-canonical domains: drop them
return list(result)
def classify_by_subdomain(subdomains):
"""Try to infer canonical domain(s) from subdomain keyword matching."""
found = set()
for sd in subdomains:
sd_lower = sd.lower().strip()
for key, domain in SUBDOMAIN_MAP.items():
if key in sd_lower:
found.add(domain)
return list(found) if found else None
def process_window_file(filepath, dry_run):
"""Process one window JSON file (array of concepts). Returns per-file stats."""
stats = defaultdict(int)
unknowns = []
try:
with open(filepath, "r", encoding="utf-8") as f:
concepts = json.load(f)
except Exception as e:
return {"parse_error": 1}, []
if not isinstance(concepts, list):
return {"skip_not_list": 1}, []
modified = False
for concept in concepts:
if not isinstance(concept, dict):
continue
raw_domains = concept.get("domain", [])
if isinstance(raw_domains, str):
raw_domains = [raw_domains]
subdomains = concept.get("subdomain", [])
if isinstance(subdomains, str):
subdomains = [subdomains]
has_reference = "Reference" in raw_domains
non_reference = [d for d in raw_domains if d != "Reference"]
if not has_reference:
# No Reference — just fix any variant names
remapped = remap_domains(raw_domains)
if set(remapped) != set(raw_domains):
concept["domain"] = remapped
modified = True
stats["variant_remapped"] += 1
else:
stats["no_change"] += 1
continue
# Has Reference — what else does it have?
remapped_others = remap_domains(non_reference)
if remapped_others:
# Reference + real domains: drop Reference, keep the rest
concept["domain"] = remapped_others
modified = True
stats["reference_stripped"] += 1
continue
# Solo Reference (or Reference + only-noise): try subdomain lookup
inferred = classify_by_subdomain(subdomains)
if inferred:
concept["domain"] = inferred
concept["_reclassified_from_reference"] = True
modified = True
stats["subdomain_reclassified"] += 1
continue
# True unknown — needs API re-enrichment
unknowns.append({
"filepath": str(filepath),
"title": concept.get("title", ""),
"subdomain": subdomains,
"content_preview": str(concept.get("content", concept.get("summary", "")))[:300],
})
stats["needs_enrichment"] += 1
if modified and not dry_run:
with open(filepath, "w", encoding="utf-8") as f:
json.dump(concepts, f, indent=2, ensure_ascii=False)
return dict(stats), unknowns
def main():
parser = argparse.ArgumentParser(description="Remap RECON concept domains")
parser.add_argument("--dry-run", action="store_true", help="Report without writing")
parser.add_argument("--workers", type=int, default=16)
args = parser.parse_args()
print(f"[REMAP] Scanning {CONCEPTS_DIR}")
print(f"[REMAP] Dry run: {args.dry_run} | Workers: {args.workers}")
window_files = [
f for f in CONCEPTS_DIR.rglob("window_*.json")
]
print(f"[REMAP] Found {len(window_files):,} window files")
total_stats = defaultdict(int)
all_unknowns = []
lock = threading.Lock()
done = 0
with ThreadPoolExecutor(max_workers=args.workers) as ex:
futures = {ex.submit(process_window_file, f, args.dry_run): f for f in window_files}
for future in as_completed(futures):
file_stats, unknowns = future.result()
with lock:
for k, v in file_stats.items():
total_stats[k] += v
all_unknowns.extend(unknowns)
done += 1
if done % 5000 == 0:
print(f" {done:,}/{len(window_files):,} files processed...")
print("\n── Results ─────────────────────────────────────────────────")
for status, count in sorted(total_stats.items(), key=lambda x: -x[1]):
print(f" {status:<35} {count:>10,}")
total_concepts = sum(total_stats.values())
print(f"\n Total concepts processed: {total_concepts:>10,}")
print(f" True unknowns for re-enrichment:{len(all_unknowns):>10,}")
if not args.dry_run and all_unknowns:
with open(UNKNOWNS_OUTPUT, "w", encoding="utf-8") as f:
for item in all_unknowns:
f.write(json.dumps(item) + "\n")
print(f"\n Unknowns written to: {UNKNOWNS_OUTPUT}")
if args.dry_run:
print("\n [DRY RUN] No files were modified.")
if __name__ == "__main__":
main()

469
scripts/migrate_domains.py Normal file
View file

@ -0,0 +1,469 @@
#!/usr/bin/env python3
"""
migrate_domains.py Reclassify 5 legacy domains via Gemini Flash.
Targets: Sustainment Systems, Off-Grid Systems, Defense & Tactics,
Community Coordination, Leadership
Maps each to one of the 18 approved domains. 16 parallel workers,
checkpoint file, crash-safe, incremental saves, progress every 5,000.
Usage:
python3 /tmp/migrate_domains.py [--dry-run] [--workers 16] [--limit N]
"""
import json
import time
import random
import logging
import argparse
import threading
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from collections import defaultdict
import google.generativeai as genai
from qdrant_client import QdrantClient
from qdrant_client.models import FieldCondition, MatchValue, Filter
# Suppress noisy HTTP logs
import logging as _logging
_logging.getLogger("httpx").setLevel(_logging.WARNING)
_logging.getLogger("qdrant_client").setLevel(_logging.WARNING)
LOG_FILE = Path("/opt/recon/logs/migrate_domains.log")
CHECKPOINT_FILE = Path("/opt/recon/data/migrate_domains_checkpoint.json")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
)
log = logging.getLogger("migrate_domains")
# ── Constants ───────────────────────────────────────────────────────────────
VALID_DOMAINS = {
'Agriculture & Livestock', 'Civil Organization', 'Communications',
'Food Systems', 'Foundational Skills', 'Logistics', 'Medical',
'Navigation', 'Operations', 'Power Systems', 'Preservation & Storage',
'Security', 'Shelter & Construction', 'Technology', 'Tools & Equipment',
'Vehicles', 'Water Systems', 'Wilderness Skills',
}
SOURCE_DOMAINS = {
'Sustainment Systems', 'Off-Grid Systems', 'Defense & Tactics',
'Community Coordination', 'Leadership',
}
DOMAIN_LIST_STR = ', '.join(sorted(VALID_DOMAINS))
CLASSIFY_PROMPT = """\
Classify this knowledge concept into exactly one domain from this list:
Agriculture & Livestock, Civil Organization, Communications, Food Systems, Foundational Skills, Logistics, Medical, Navigation, Operations, Power Systems, Preservation & Storage, Security, Shelter & Construction, Technology, Tools & Equipment, Vehicles, Water Systems, Wilderness Skills
Return ONLY the exact domain string, nothing else. No explanation, no punctuation, no quotes.
Content: {content}
Summary: {summary}
Subdomain: {subdomain}
"""
DOMAIN_FALLBACK = 'Foundational Skills'
# ── Key management ──────────────────────────────────────────────────────────
def load_gemini_keys():
keys = []
env_path = Path("/opt/recon/.env")
if not env_path.exists():
raise FileNotFoundError(f"{env_path} not found")
for line in env_path.read_text().splitlines():
if line.startswith("GEMINI_KEY_"):
keys.append(line.split("=", 1)[1].strip())
if not keys:
raise ValueError("No GEMINI_KEY_* found in .env")
return keys
class KeyRotator:
def __init__(self, keys):
self.keys = keys
self._i = 0
self._lock = threading.Lock()
def next(self):
with self._lock:
key = self.keys[self._i % len(self.keys)]
self._i += 1
return key
# ── Classification ──────────────────────────────────────────────────────────
def classify_domain(content, summary, subdomains, key):
"""Call Gemini Flash to classify into one of 18 domains."""
prompt = CLASSIFY_PROMPT.format(
content=str(content)[:400] if content else "(none)",
summary=str(summary)[:200] if summary else "(none)",
subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
)
genai.configure(api_key=key)
model = genai.GenerativeModel(
"gemini-2.0-flash",
generation_config={"response_mime_type": "text/plain"}
)
for retry in range(4):
try:
resp = model.generate_content(prompt)
value = resp.text.strip().strip('"').strip("'").strip()
if value in VALID_DOMAINS:
return value
# Try case-insensitive match
for valid in VALID_DOMAINS:
if value.lower() == valid.lower():
return valid
# Partial match — Gemini sometimes returns with trailing period
clean = value.rstrip('.')
if clean in VALID_DOMAINS:
return clean
# Invalid — retry with stricter prompt
if retry < 3:
prompt = (
f"Your previous response '{value}' was invalid. "
f"You must return ONLY one of these exact strings: {DOMAIN_LIST_STR}\n\n"
f"Content: {str(content)[:300]}\n"
f"Return ONLY the exact domain string."
)
continue
except Exception as e:
err = str(e).lower()
if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
time.sleep(min(5 * (2 ** retry) + random.uniform(0, 3), 60))
else:
log.warning(f"Gemini error (attempt {retry+1}): {e}")
if retry >= 2:
break
return heuristic_fallback(content, summary, subdomains)
def heuristic_fallback(content, summary, subdomains):
"""Last-resort heuristic when Gemini fails or returns invalid."""
text = f"{summary or ''} {' '.join(subdomains or [])} {str(content or '')[:200]}".lower()
mapping = [
(["farming", "agriculture", "livestock", "animal husbandry", "poultry",
"cattle", "crop", "soil fertility", "irrigation for crops"], "Agriculture & Livestock"),
(["foraging", "hunting", "fishing", "bushcraft", "wilderness", "survival skill",
"fire starting", "shelter building", "trapping", "tracking"], "Wilderness Skills"),
(["food preservation", "canning", "dehydration", "smoking", "pickling",
"fermentation", "food storage", "freeze dry"], "Preservation & Storage"),
(["cooking", "recipe", "nutrition", "food preparation", "baking",
"food production", "meal"], "Food Systems"),
(["first aid", "medical", "trauma", "surgery", "anatomy", "pharmacology",
"wound", "triage", "diagnosis", "disease", "infection", "veterinary",
"herbal medicine", "medicinal plant"], "Medical"),
(["radio", "antenna", "ham radio", "communication", "signal",
"networking", "meshtastic", "comms"], "Communications"),
(["solar", "battery", "generator", "wind turbine", "hydroelectric",
"power grid", "inverter", "photovoltaic", "electricity"], "Power Systems"),
(["water purification", "water filter", "well", "rainwater",
"sanitation", "water treatment", "desalination"], "Water Systems"),
(["navigation", "compass", "map reading", "gps", "celestial",
"orienteering", "land nav"], "Navigation"),
(["security", "opsec", "perimeter", "surveillance", "threat",
"intrusion detection", "physical security"], "Security"),
(["vehicle", "engine", "motor", "aircraft", "boat", "motorcycle",
"truck", "maintenance", "diesel", "transmission"], "Vehicles"),
(["tool", "equipment", "wrench", "saw", "drill", "hammer",
"hand tool", "power tool", "blade", "sharpening"], "Tools & Equipment"),
(["construction", "building", "shelter", "carpentry", "masonry",
"roofing", "concrete", "framing", "plumbing"], "Shelter & Construction"),
(["electronics", "computer", "software", "circuit", "programming",
"technology", "digital", "engineering"], "Technology"),
(["supply chain", "logistics", "transport", "distribution",
"inventory", "supply", "stockpile"], "Logistics"),
(["governance", "civil", "community", "administration", "organization",
"council", "democratic", "municipal"], "Civil Organization"),
(["tactics", "combat", "military", "mission", "patrol", "ambush",
"defensive position", "fire team", "maneuver", "engagement",
"search and rescue", "sar", "reconnaissance"], "Operations"),
]
for keywords, domain in mapping:
if any(kw in text for kw in keywords):
return domain
return DOMAIN_FALLBACK
# ── Checkpoint ──────────────────────────────────────────────────────────────
class Checkpoint:
"""Thread-safe checkpoint tracker for crash recovery."""
def __init__(self, path):
self.path = path
self._lock = threading.Lock()
self._completed = set()
self._dirty = 0
self._load()
def _load(self):
if self.path.exists():
try:
data = json.loads(self.path.read_text())
self._completed = set(data.get("completed", []))
log.info(f"Loaded checkpoint: {len(self._completed):,} completed points")
except Exception:
self._completed = set()
def is_done(self, point_id):
return point_id in self._completed
def mark_done(self, point_id):
with self._lock:
self._completed.add(point_id)
self._dirty += 1
if self._dirty >= 1000:
self._flush()
def _flush(self):
tmp = self.path.with_suffix('.tmp')
tmp.write_text(json.dumps({"completed": list(self._completed)}))
tmp.rename(self.path)
self._dirty = 0
def flush(self):
with self._lock:
self._flush()
def count(self):
return len(self._completed)
# ── Per-point processing ───────────────────────────────────────────────────
def process_point(point, qdrant, collection, key_rotator, checkpoint, dry_run, stats):
point_id = point.id
if checkpoint.is_done(point_id):
return "skipped"
payload = point.payload
content = payload.get("content", payload.get("summary", ""))
summary = payload.get("summary", "")
subdomains = payload.get("subdomain", [])
if isinstance(subdomains, str):
subdomains = [subdomains]
old_domain = payload.get("domain", [])
if isinstance(old_domain, list):
old_domain_str = old_domain[0] if old_domain else "(empty)"
else:
old_domain_str = str(old_domain)
key = key_rotator.next()
new_domain = classify_domain(content, summary, subdomains, key)
# Track the mapping
stats_key = f"{old_domain_str} -> {new_domain}"
stats[stats_key] = stats.get(stats_key, 0) + 1
if dry_run:
return f"would: {old_domain_str} -> {new_domain}"
# Write new domain as single string
qdrant.set_payload(
collection_name=collection,
payload={"domain": new_domain},
points=[point_id],
)
checkpoint.mark_done(point_id)
return "ok"
# ── Main loop ───────────────────────────────────────────────────────────────
SCROLL_BATCH = 5000
def count_source_domains(qdrant, collection):
"""Count vectors with source domains."""
counts = {}
for domain in SOURCE_DOMAINS:
result = qdrant.count(
collection_name=collection,
count_filter=Filter(
must=[FieldCondition(key="domain", match=MatchValue(value=domain))]
),
exact=True,
)
counts[domain] = result.count
return counts
def stream_and_process(qdrant, collection, rotator, checkpoint, workers, limit=None, dry_run=False):
"""Scroll source domains in batches, process with thread pool."""
lock = threading.Lock()
done = 0
skipped_checkpoint = 0
start = time.time()
stats = {} # shared mapping stats
for source_domain in sorted(SOURCE_DOMAINS):
log.info(f"\n--- Processing domain: {source_domain} ---")
offset = None
domain_done = 0
while True:
scroll_results, offset = qdrant.scroll(
collection_name=collection,
limit=SCROLL_BATCH,
with_payload=True,
with_vectors=False,
offset=offset,
scroll_filter=Filter(
must=[FieldCondition(key="domain", match=MatchValue(value=source_domain))]
),
)
if not scroll_results:
if offset is None:
break
continue
# Filter already checkpointed
pending = [p for p in scroll_results if not checkpoint.is_done(p.id)]
skipped_checkpoint += len(scroll_results) - len(pending)
if pending:
with ThreadPoolExecutor(max_workers=workers) as ex:
futures = {
ex.submit(process_point, p, qdrant, collection, rotator,
checkpoint, dry_run, stats): p
for p in pending
}
for future in as_completed(futures):
try:
future.result()
except Exception as e:
log.error(f"Worker error: {e}")
with lock:
done += 1
domain_done += 1
if done % 5000 == 0:
elapsed = time.time() - start
rate = done / elapsed * 60
log.info(f" {done:,} done | {rate:.0f}/min | "
f"elapsed {elapsed/60:.1f}min")
checkpoint.flush()
time.sleep(0.02)
if limit and done >= limit:
break
if offset is None:
break
log.info(f" {source_domain}: {domain_done:,} vectors processed")
if limit and done >= limit:
break
checkpoint.flush()
return done, skipped_checkpoint, stats, start
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true",
help="Classify 20 samples without writing")
parser.add_argument("--workers", type=int, default=16)
parser.add_argument("--limit", type=int, default=None)
args = parser.parse_args()
keys = load_gemini_keys()
rotator = KeyRotator(keys)
qdrant = QdrantClient(host="localhost", port=6333, timeout=120)
collection = "recon_knowledge"
checkpoint = Checkpoint(CHECKPOINT_FILE)
# Count source domains
counts = count_source_domains(qdrant, collection)
total_source = sum(counts.values())
pre_checkpoint = checkpoint.count()
log.info(f"Source domain counts:")
for domain, count in sorted(counts.items(), key=lambda x: -x[1]):
log.info(f" {domain:30s} {count:>10,}")
log.info(f" {'TOTAL':30s} {total_source:>10,}")
log.info(f"Checkpoint: {pre_checkpoint:,} already completed")
log.info(f"Workers: {args.workers} | Keys: {len(keys)}")
# Cost estimate
remaining = total_source - pre_checkpoint
input_tokens = remaining * 200
output_tokens = remaining * 5
input_cost = input_tokens / 1_000_000 * 0.10
output_cost = output_tokens / 1_000_000 * 0.40
total_cost = input_cost + output_cost
log.info(f"\nEstimated Gemini 2.0 Flash cost:")
log.info(f" Vectors to process: {remaining:,}")
log.info(f" Input: ~{input_tokens/1_000_000:.1f}M tokens = ${input_cost:.2f}")
log.info(f" Output: ~{output_tokens/1_000_000:.1f}M tokens = ${output_cost:.2f}")
log.info(f" TOTAL: ~${total_cost:.2f}")
if args.dry_run:
log.info(f"\nDRY RUN: classifying 20 samples...\n")
for source_domain in sorted(SOURCE_DOMAINS):
scroll_results, _ = qdrant.scroll(
collection_name=collection,
limit=5,
with_payload=True,
with_vectors=False,
scroll_filter=Filter(
must=[FieldCondition(key="domain", match=MatchValue(value=source_domain))]
),
)
for p in scroll_results[:4]:
pay = p.payload
title = pay.get("title", "(no title)")
content = pay.get("content", pay.get("summary", ""))
summary = pay.get("summary", "")
subdomains = pay.get("subdomain", [])
if isinstance(subdomains, str):
subdomains = [subdomains]
key = rotator.next()
new_domain = classify_domain(content, summary, subdomains, key)
old = pay.get("domain", [])
if isinstance(old, list):
old = old[0] if old else "?"
print(f" [{old:25s}] -> [{new_domain:25s}] {title[:60]}")
print(f"\nDRY RUN complete. ~{remaining:,} vectors would be migrated.")
print(f"Estimated cost: ~${total_cost:.2f}")
return
# ── Full migration ──────────────────────────────────────────────────
log.info(f"\nStarting full migration...")
done, skipped_ckpt, stats, start = stream_and_process(
qdrant, collection, rotator, checkpoint, args.workers, args.limit
)
elapsed = time.time() - start
log.info(f"\n{'='*70}")
log.info(f"MIGRATION COMPLETE in {elapsed/60:.1f}min:")
log.info(f" Processed: {done:,}")
log.info(f" Skipped (checkpoint): {skipped_ckpt:,}")
log.info(f" Rate: {done/elapsed*60:.0f}/min")
log.info(f"\nMapping distribution:")
for mapping, count in sorted(stats.items(), key=lambda x: -x[1])[:30]:
log.info(f" {mapping:<55s} {count:>8,}")
if __name__ == "__main__":
main()

469
scripts/migrate_skill_level.py Executable file
View file

@ -0,0 +1,469 @@
#!/usr/bin/env python3
"""
migrate_skill_level.py Replaces skill_level with knowledge_type + complexity
on all vectors in Qdrant and on-disk concept JSONs.
Scrolls entire collection, classifies each concept via Gemini Flash,
writes knowledge_type + complexity, deletes skill_level.
Crash-safe: completed point IDs tracked in checkpoint file.
Usage:
python3 /opt/recon/scripts/migrate_skill_level.py [--dry-run] [--workers 16] [--limit N]
"""
import json
import time
import random
import logging
import argparse
import threading
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from collections import defaultdict
import google.generativeai as genai
from qdrant_client import QdrantClient
from qdrant_client.models import FieldCondition, MatchValue, Filter
import sys
sys.path.insert(0, '/opt/recon')
from lib.utils import get_config, setup_logging
# Suppress noisy HTTP request logging from qdrant_client/httpx
import logging as _logging
_logging.getLogger("httpx").setLevel(_logging.WARNING)
_logging.getLogger("qdrant_client").setLevel(_logging.WARNING)
LOG_FILE = Path("/opt/recon/logs/migrate_skill_level.log")
CHECKPOINT_FILE = Path("/opt/recon/data/migrate_skill_level_checkpoint.json")
CONCEPTS_DIR = Path("/opt/recon/data/concepts")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
)
log = logging.getLogger("migrate_skill_level")
# ── Prompt ──────────────────────────────────────────────────────────────────
CLASSIFY_PROMPT = """\
You are a knowledge classification engine. Given a concept, assign two fields:
knowledge_type what KIND of knowledge this is:
foundational concepts, definitions, theory, background knowledge, explanations of how things work
procedural step-by-step techniques, instructions, how-to skills, methods you execute
operational application under real conditions, decision-making, mission execution, judgment calls in context
complexity how much prior knowledge is needed:
basic requires little or no prior knowledge, introductory material, simple concepts
intermediate requires some domain familiarity, assumes foundational knowledge is in place
advanced requires significant experience or expertise, high-stakes or highly technical material
EXAMPLES:
- "Needle chest decompression procedure" procedural, advanced
- "What is soil texture and why does it matter" foundational, basic
- "Coordinating a fire team withdrawal under contact" operational, advanced
- "How to start a campfire with a ferro rod" procedural, basic
- "Antenna gain and radiation patterns explained" foundational, intermediate
- "Triage decision-making in a mass casualty event" operational, advanced
- "Step-by-step: building a Dakota fire hole" procedural, intermediate
- "Understanding the water cycle" foundational, basic
Concept title: {title}
Concept domain: {domain}
Concept subdomain: {subdomain}
Concept content: {content}
Return ONLY valid JSON, no markdown, no explanation:
{{"knowledge_type": "foundational|procedural|operational", "complexity": "basic|intermediate|advanced"}}
"""
VALID_KNOWLEDGE_TYPES = {"foundational", "procedural", "operational"}
VALID_COMPLEXITIES = {"basic", "intermediate", "advanced"}
# ── Key management ──────────────────────────────────────────────────────────
def load_gemini_keys():
keys = []
for line in Path("/opt/recon/.env").read_text().splitlines():
if line.startswith("GEMINI_KEY_"):
keys.append(line.split("=", 1)[1].strip())
return keys
class KeyRotator:
def __init__(self, keys):
self.keys = keys
self._i = 0
self._lock = threading.Lock()
def next(self):
with self._lock:
key = self.keys[self._i % len(self.keys)]
self._i += 1
return key
# ── Classification ──────────────────────────────────────────────────────────
def classify(title, domains, subdomains, content, key):
"""Call Gemini Flash to classify knowledge_type + complexity."""
prompt = CLASSIFY_PROMPT.format(
title=title or "(untitled)",
domain=", ".join(domains[:5]) if domains else "(none)",
subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
content=str(content)[:400] if content else "(none)",
)
genai.configure(api_key=key)
model = genai.GenerativeModel(
"gemini-2.0-flash",
generation_config={"response_mime_type": "application/json"}
)
for retry in range(4):
try:
resp = model.generate_content(prompt)
data = json.loads(resp.text)
kt = data.get("knowledge_type", "").lower().strip()
cx = data.get("complexity", "").lower().strip()
if kt in VALID_KNOWLEDGE_TYPES and cx in VALID_COMPLEXITIES:
return kt, cx
# Invalid values — retry once
if retry == 0:
continue
except Exception as e:
err = str(e).lower()
if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
time.sleep(min(5 * (2 ** retry) + random.uniform(0, 3), 60))
else:
break
# Fallback heuristic based on old skill_level + content analysis
return heuristic_fallback(title, subdomains, content)
def heuristic_fallback(title, subdomains, content):
"""Last-resort heuristic when Gemini fails."""
text = f"{title} {' '.join(subdomains)} {str(content)[:200]}".lower()
# Knowledge type heuristic
procedural_signals = ["how to", "step-by-step", "procedure", "instructions",
"method", "technique", "build", "make", "construct",
"install", "assemble", "recipe", "prepare"]
operational_signals = ["decision", "coordinate", "execute", "deploy",
"mission", "triage", "under fire", "in the field",
"real-world", "scenario", "assessment", "plan"]
if any(s in text for s in operational_signals):
kt = "operational"
elif any(s in text for s in procedural_signals):
kt = "procedural"
else:
kt = "foundational"
# Complexity heuristic — default intermediate (safest middle ground)
cx = "intermediate"
basic_signals = ["introduction", "what is", "basic", "beginner", "overview",
"definition", "simple", "fundamentals"]
advanced_signals = ["advanced", "expert", "complex", "critical", "high-stakes",
"surgery", "trauma", "tactical", "classified"]
if any(s in text for s in basic_signals):
cx = "basic"
elif any(s in text for s in advanced_signals):
cx = "advanced"
return kt, cx
# ── Checkpoint management ───────────────────────────────────────────────────
class Checkpoint:
"""Thread-safe checkpoint tracker for crash recovery."""
def __init__(self, path):
self.path = path
self._lock = threading.Lock()
self._completed = set()
self._dirty = 0
self._load()
def _load(self):
if self.path.exists():
try:
data = json.loads(self.path.read_text())
self._completed = set(data.get("completed", []))
log.info(f"Loaded checkpoint: {len(self._completed):,} completed points")
except Exception:
self._completed = set()
def is_done(self, point_id):
return point_id in self._completed
def mark_done(self, point_id):
with self._lock:
self._completed.add(point_id)
self._dirty += 1
if self._dirty >= 1000:
self._flush()
def _flush(self):
tmp = self.path.with_suffix('.tmp')
tmp.write_text(json.dumps({"completed": list(self._completed)}))
tmp.rename(self.path)
self._dirty = 0
def flush(self):
with self._lock:
self._flush()
def count(self):
return len(self._completed)
# ── Concept JSON update ────────────────────────────────────────────────────
def update_concept_json(doc_hash, title, knowledge_type, complexity):
"""Update on-disk concept JSON: add knowledge_type + complexity, remove skill_level."""
doc_dir = CONCEPTS_DIR / doc_hash
if not doc_dir.exists():
return False
for wf in doc_dir.glob("window_*.json"):
try:
with open(wf, "r", encoding="utf-8") as f:
concepts = json.load(f)
changed = False
for c in concepts:
if not isinstance(c, dict):
continue
if c.get("title") == title:
c["knowledge_type"] = knowledge_type
c["complexity"] = complexity
c.pop("skill_level", None)
changed = True
if changed:
with open(wf, "w", encoding="utf-8") as f:
json.dump(concepts, f, indent=2, ensure_ascii=False)
return True
except Exception:
pass
return False
# ── Per-point processing ───────────────────────────────────────────────────
def process_point(point, qdrant, collection, key_rotator, checkpoint, dry_run):
point_id = point.id
if checkpoint.is_done(point_id):
return "skipped"
payload = point.payload
title = payload.get("title", "")
domains = payload.get("domain", [])
if isinstance(domains, str):
domains = [domains]
subdomains = payload.get("subdomain", [])
if isinstance(subdomains, str):
subdomains = [subdomains]
content = payload.get("content", payload.get("summary", ""))
doc_hash = payload.get("doc_hash", "")
key = key_rotator.next()
knowledge_type, complexity = classify(title, domains, subdomains, content, key)
if dry_run:
return f"kt={knowledge_type}, cx={complexity}"
# Write new fields
qdrant.set_payload(
collection_name=collection,
payload={"knowledge_type": knowledge_type, "complexity": complexity},
points=[point_id],
)
# Delete old field
qdrant.delete_payload(
collection_name=collection,
keys=["skill_level"],
points=[point_id],
)
# Update JSON on disk
if doc_hash:
update_concept_json(doc_hash, title, knowledge_type, complexity)
checkpoint.mark_done(point_id)
return "ok"
# ── Streaming batch processor ───────────────────────────────────────────────
SCROLL_BATCH = 5000 # vectors per scroll batch — keeps memory bounded (~50MB)
def count_collection(qdrant, collection):
"""Quick count of total vectors via collection info."""
info = qdrant.get_collection(collection)
return info.points_count
def stream_and_process(qdrant, collection, rotator, checkpoint, workers, limit=None):
"""Scroll in batches, process each batch with thread pool, then discard.
Memory-bounded: only holds SCROLL_BATCH payloads at any time (~50MB).
"""
results_agg = defaultdict(int)
lock = threading.Lock()
done = 0
skipped_checkpoint = 0
skipped_no_skill = 0
total_estimate = count_collection(qdrant, collection)
start = time.time()
offset = None
batch_num = 0
while True:
batch_num += 1
scroll_results, offset = qdrant.scroll(
collection_name=collection,
limit=SCROLL_BATCH,
with_payload=True,
with_vectors=False,
offset=offset,
)
# Filter to points needing migration
pending = []
for p in scroll_results:
if "skill_level" not in p.payload:
skipped_no_skill += 1
continue
if checkpoint.is_done(p.id):
skipped_checkpoint += 1
continue
pending.append(p)
if pending:
with ThreadPoolExecutor(max_workers=workers) as ex:
futures = {
ex.submit(process_point, p, qdrant, collection, rotator, checkpoint, False): p
for p in pending
}
for future in as_completed(futures):
try:
status = future.result()
except Exception as e:
status = f"error: {str(e)[:80]}"
log.error(f"Worker error: {e}")
with lock:
results_agg[status] += 1
done += 1
if done % 5000 == 0:
elapsed = time.time() - start
rate = done / elapsed * 60
remaining = total_estimate - done - skipped_checkpoint - skipped_no_skill
eta = remaining / (done / elapsed) / 60 if done > 0 else 0
log.info(f" {done:,} done | {rate:.0f}/min | "
f"ETA ~{eta:.0f}min | {dict(results_agg)}")
checkpoint.flush()
time.sleep(0.02)
if limit and done >= limit:
break
if offset is None:
break
checkpoint.flush()
return done, skipped_checkpoint, skipped_no_skill, results_agg, start
# ── Main ────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true",
help="Classify 20 samples without writing anything")
parser.add_argument("--workers", type=int, default=16)
parser.add_argument("--limit", type=int, default=None)
args = parser.parse_args()
config = get_config()
keys = load_gemini_keys()
rotator = KeyRotator(keys)
qdrant = QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=120
)
collection = config['vector_db']['collection']
checkpoint = Checkpoint(CHECKPOINT_FILE)
total_vectors = count_collection(qdrant, collection)
pre_checkpoint = checkpoint.count()
log.info(f"Collection has {total_vectors:,} vectors")
log.info(f"Checkpoint: {pre_checkpoint:,} already completed")
log.info(f"Workers: {args.workers} | Keys: {len(keys)} | Dry run: {args.dry_run}")
log.info(f"Estimated Gemini Flash cost: ~${(total_vectors - pre_checkpoint) * 0.0004:.2f}")
log.info(f"Streaming in batches of {SCROLL_BATCH:,} (memory-bounded)")
if args.dry_run:
# Scroll one batch, classify 20 diverse samples
log.info(f"\nDRY RUN: classifying 20 samples...\n")
scroll_results, _ = qdrant.scroll(
collection_name=collection,
limit=200,
with_payload=True,
with_vectors=False,
)
samples = []
seen_domains = set()
for p in scroll_results:
if "skill_level" not in p.payload:
continue
domains = p.payload.get("domain", [])
if isinstance(domains, str):
domains = [domains]
d_key = tuple(sorted(domains[:2]))
if d_key not in seen_domains:
samples.append(p)
seen_domains.add(d_key)
if len(samples) >= 20:
break
for i, p in enumerate(samples, 1):
pay = p.payload
title = pay.get("title", "(no title)")
domains = pay.get("domain", [])
old_skill = pay.get("skill_level", "?")
subdomains = pay.get("subdomain", [])
if isinstance(subdomains, str):
subdomains = [subdomains]
content = pay.get("content", pay.get("summary", ""))
key = rotator.next()
kt, cx = classify(title, domains, subdomains, content, key)
print(f"\n--- Sample {i}/{len(samples)} ---")
print(f" Title: {title}")
print(f" Domain: {domains}")
print(f" Old skill: {old_skill}")
print(f" → knowledge_type: {kt}")
print(f" → complexity: {cx}")
est = total_vectors - pre_checkpoint
print(f"\nDRY RUN complete. ~{est:,} vectors would be migrated.")
print(f"Estimated Gemini Flash cost: ~${est * 0.0004:.2f}")
return
# ── Full migration run (streaming) ──────────────────────────────────────
done, skipped_ckpt, skipped_no_skill, results, start = stream_and_process(
qdrant, collection, rotator, checkpoint, args.workers, args.limit
)
elapsed = time.time() - start
log.info(f"\nComplete in {elapsed/60:.1f}min:")
log.info(f" Processed: {done:,}")
log.info(f" Skipped (checkpoint): {skipped_ckpt:,}")
log.info(f" Skipped (no skill): {skipped_no_skill:,}")
for status, count in sorted(results.items(), key=lambda x: -x[1]):
log.info(f" {status:<30} {count:>10,}")
if __name__ == "__main__":
main()

227
scripts/rebuild_qdrant.py Executable file
View file

@ -0,0 +1,227 @@
"""
RECON Qdrant Rebuilder patched for headless parallel execution
Deletes and recreates the Qdrant collection, then re-embeds ALL concept JSONs
from disk using parallel workers. Pass --confirm to skip interactive prompt.
Usage:
python3 scripts/rebuild_qdrant.py --confirm [--workers 8]
"""
import json
import os
import sys
import time
import argparse
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
from collections import defaultdict
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import requests as http_requests
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
from lib.utils import get_config, concept_id, setup_logging
from lib.status import StatusDB
logger = setup_logging('recon.rebuild')
def embed_content(config, content):
try:
tei_url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/embed"
resp = http_requests.post(tei_url, json={"inputs": content}, timeout=120)
resp.raise_for_status()
return resp.json()[0]
except Exception as tei_err:
logger.debug(f"TEI failed, trying Ollama: {tei_err}")
ollama_url = f"http://{config['embedding']['ollama_host']}:{config['embedding']['ollama_port']}/api/embed"
resp = http_requests.post(ollama_url, json={
"model": config['embedding']['model'],
"input": content
}, timeout=120)
resp.raise_for_status()
return resp.json()['embeddings'][0]
def process_doc(doc_hash, config, db, qdrant, collection):
"""Embed and upsert all concepts for a single document. Returns (inserted, failed)."""
doc_dir = os.path.join(config['paths']['concepts'], doc_hash)
doc = db.get_document(doc_hash)
filename = doc['filename'] if doc else doc_hash[:8]
window_files = sorted([
f for f in os.listdir(doc_dir)
if f.startswith('window_') and f.endswith('.json')
])
all_concepts = []
for wf in window_files:
path = os.path.join(doc_dir, wf)
try:
with open(path, encoding='utf-8') as f:
concepts = json.load(f)
if isinstance(concepts, list):
all_concepts.extend(concepts)
except Exception as e:
logger.warning(f"Skipping corrupted window {wf} in {doc_hash}: {e}")
if not all_concepts:
return 0, 0
is_web = doc.get('path', '').startswith(('http://', 'https://')) if doc else False
# Check meta.json for explicit source_type (e.g. 'transcript')
source_type = 'web' if is_web else 'document'
text_dir = os.path.join(config['paths']['text'], doc_hash)
meta_path = os.path.join(text_dir, 'meta.json')
if os.path.exists(meta_path):
try:
with open(meta_path) as mf:
meta = json.load(mf)
if meta.get('source_type'):
source_type = meta['source_type']
except Exception:
pass
points = []
failed = 0
batch_size = config['processing']['embed_batch_size']
for idx, concept in enumerate(all_concepts):
content = concept.get('content', '')
if not content or len(content.strip()) < 10:
continue
try:
vector = embed_content(config, content)
except Exception as e:
logger.warning(f"Embedding failed {doc_hash}:{idx}: {e}")
failed += 1
continue
start_page = concept.get('_start_page', 0)
point_id = concept_id(doc_hash, start_page, idx)
payload = {
'doc_hash': doc_hash,
'filename': filename,
'book_title': doc.get('book_title', '') if doc else '',
'book_author': doc.get('book_author', '') if doc else '',
'source_type': source_type,
'verification_status': 'unverified',
'credibility_score': 0.7,
'language': 'en',
}
for field in ['content', 'summary', 'title', 'domain', 'subdomain',
'keywords', 'skill_level', 'key_facts', 'scenario_applicable',
'cross_domain_tags', 'chapter', 'page_ref', 'notes',
'_window', '_start_page']:
if field in concept:
payload[field] = concept[field]
points.append(PointStruct(id=point_id, vector=vector, payload=payload))
if len(points) >= batch_size:
qdrant.upsert(collection_name=collection, points=points)
points = []
if points:
qdrant.upsert(collection_name=collection, points=points)
inserted = len(all_concepts) - failed
if doc:
db.update_status(doc_hash, 'complete', vectors_inserted=inserted)
return inserted, failed
def run_rebuild(workers=8):
config = get_config()
db = StatusDB()
qdrant = QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=60
)
collection = config['vector_db']['collection']
# Delete and recreate
try:
qdrant.delete_collection(collection)
logger.info(f"Deleted collection: {collection}")
except Exception:
pass
qdrant.create_collection(
collection_name=collection,
vectors_config=VectorParams(
size=config['embedding']['dimensions'],
distance=Distance.COSINE
)
)
logger.info(f"Created collection: {collection} ({config['embedding']['dimensions']}d, Cosine)")
concepts_root = config['paths']['concepts']
doc_dirs = sorted([
d for d in os.listdir(concepts_root)
if os.path.isdir(os.path.join(concepts_root, d))
])
logger.info(f"Found {len(doc_dirs)} document concept directories | {workers} workers")
total_inserted = 0
total_failed = 0
done = 0
lock = threading.Lock()
start = time.time()
with ThreadPoolExecutor(max_workers=workers) as ex:
futures = {
ex.submit(process_doc, h, config, StatusDB(), qdrant, collection): h
for h in doc_dirs
}
for future in as_completed(futures):
doc_hash = futures[future]
try:
inserted, failed = future.result()
except Exception as e:
logger.error(f"Worker error {doc_hash}: {e}")
inserted, failed = 0, 0
with lock:
total_inserted += inserted
total_failed += failed
done += 1
if done % 500 == 0:
elapsed = time.time() - start
rate = total_inserted / elapsed if elapsed > 0 else 0
remaining = (len(doc_dirs) - done) / (done / elapsed) if elapsed > 0 else 0
logger.info(
f" [{done}/{len(doc_dirs)}] "
f"{total_inserted:,} vectors | "
f"{rate:.0f}/sec | "
f"ETA {remaining/60:.0f}min"
)
elapsed = time.time() - start
logger.info(f"\nRebuild complete in {elapsed/60:.1f} min: "
f"{total_inserted:,} inserted, {total_failed:,} failed")
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--confirm', action='store_true', help='Skip interactive prompt')
parser.add_argument('--workers', type=int, default=8)
args = parser.parse_args()
if not args.confirm:
print("WARNING: This will DELETE and RECREATE the Qdrant collection.")
confirm = input("Type 'REBUILD' to proceed: ")
if confirm != 'REBUILD':
print("Aborted.")
sys.exit(0)
run_rebuild(workers=args.workers)

314
scripts/reenrich_reference.py Executable file
View file

@ -0,0 +1,314 @@
#!/usr/bin/env python3
"""
reenrich_reference.py Re-classifies all remaining Reference-tagged concepts.
Scrolls Qdrant for vectors with domain == ["Reference"] or containing "Reference",
calls Gemini with a hardened prompt that rejects Reference as a valid response,
updates both Qdrant payload and concept JSON on disk.
Usage:
python3 /opt/recon/scripts/reenrich_reference.py [--dry-run] [--workers 16] [--limit N]
"""
import json
import time
import random
import logging
import argparse
import threading
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from collections import defaultdict
import google.generativeai as genai
from qdrant_client import QdrantClient
from qdrant_client.models import FieldCondition, MatchAny, Filter
import sys
sys.path.insert(0, '/opt/recon')
from lib.utils import get_config, setup_logging
LOG_FILE = Path("/opt/recon/logs/reenrich_reference.log")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
)
log = logging.getLogger("reenrich_reference")
CONCEPTS_DIR = Path("/opt/recon/data/concepts")
CANONICAL_DOMAINS = {
"Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
"Foundational Skills", "Communications", "Medical", "Food Systems",
"Navigation", "Logistics", "Power Systems", "Leadership",
"Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
}
# Hardened prompt — Reference explicitly forbidden, classification rules detailed
CLASSIFY_PROMPT = """\
You are a knowledge classification engine. Classify this concept into its correct domain.
VALID DOMAINS use ONLY these exact strings:
Defense & Tactics
Sustainment Systems
Off-Grid Systems
Foundational Skills
Communications
Medical
Food Systems
Navigation
Logistics
Power Systems
Leadership
Scenario Playbooks
Water Systems
Security
Community Coordination
FORBIDDEN: Do NOT output "Reference" under any circumstances. It is not a valid domain.
FORBIDDEN: Do NOT output an empty domain list.
CLASSIFICATION RULES:
- First aid, anatomy, pharmacology, herbs, veterinary, austere medicine, wound care Medical
- Food growing, foraging, hunting, fishing, animal husbandry, livestock Sustainment Systems
- Food preservation, canning, fermentation, food storage, dehydrating Food Systems
- Solar, wind, hydro, batteries, generators, inverters, charge controllers Power Systems
- Water sourcing, filtration, purification, sanitation, wells, rainwater Water Systems
- Radio, antennas, mesh networking, SIGINT, amateur radio Communications
- Weapons, tactics, NBC, security operations, field craft Defense & Tactics
- Permaculture, soil science, agroforestry, composting Sustainment Systems
- Shelter, construction, masonry, blacksmithing, woodworking, crafts Foundational Skills
- Navigation, land nav, celestial nav, map reading, compass Navigation
- Emergency planning, disaster prep, scenario planning Scenario Playbooks
- Leadership, governance, community organization Leadership
- Supply chain, transportation, inventory Logistics
- Physical security, perimeter, surveillance Security
- Community building, cooperation, mutual aid Community Coordination
- Biogas, wood gasification, rocket stoves, appropriate technology Off-Grid Systems
If uncertain between two domains, pick the most actionable one for a self-reliant household.
Concept title: {title}
Concept subdomain tags: {subdomain}
Concept content: {content}
Return ONLY valid JSON, no markdown, no explanation:
{{"domain": ["Domain Name"]}}
"""
def load_gemini_keys():
keys = []
for line in Path("/opt/recon/.env").read_text().splitlines():
if line.startswith("GEMINI_KEY_"):
keys.append(line.split("=", 1)[1].strip())
return keys
class KeyRotator:
def __init__(self, keys):
self.keys = keys
self._i = 0
self._lock = threading.Lock()
def next(self):
with self._lock:
key = self.keys[self._i % len(self.keys)]
self._i += 1
return key
def classify(title, subdomains, content, key, attempt=0):
"""Call Gemini. Rejects Reference. Falls back to subdomain heuristic if needed."""
prompt = CLASSIFY_PROMPT.format(
title=title or "(untitled)",
subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
content=str(content)[:400] if content else "(none)",
)
genai.configure(api_key=key)
model = genai.GenerativeModel(
"gemini-2.0-flash",
generation_config={"response_mime_type": "application/json"}
)
for retry in range(4):
try:
resp = model.generate_content(prompt)
data = json.loads(resp.text)
domains = [
d for d in data.get("domain", [])
if d in CANONICAL_DOMAINS # strips Reference automatically
]
if domains:
return domains
# Gemini returned Reference or empty — try once more with stronger wording
if retry == 0:
continue
except Exception as e:
err = str(e).lower()
if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
time.sleep(min(5 * (2 ** retry) + random.uniform(0, 3), 60))
else:
break
# Last resort: subdomain keyword heuristic
return subdomain_fallback(subdomains)
SUBDOMAIN_FALLBACK_MAP = [
(["first aid", "trauma", "wound", "anatomy", "pharmacol", "herbal", "medicin", "veterinar", "dental", "surgery"], "Medical"),
(["foraging", "hunting", "fishing", "livestock", "permaculture", "soil", "agroforestry", "mycolog", "mushroom"], "Sustainment Systems"),
(["canning", "preservation", "fermentation", "food storage", "dehydrat"], "Food Systems"),
(["solar", "battery", "generator", "inverter", "wind turbine", "photovoltaic"], "Power Systems"),
(["water purif", "filtration", "sanitation", "well", "rainwater"], "Water Systems"),
(["radio", "antenna", "mesh", "sigint", "amateur radio", "meshtastic"], "Communications"),
(["weapon", "firearm", "tactic", "nbc", "chemical warfare", "ballistic"], "Defense & Tactics"),
(["navigation", "compass", "land nav", "celestial"], "Navigation"),
(["blacksmith", "woodwork", "masonry", "construct", "craft", "pottery"], "Foundational Skills"),
(["biogas", "gasif", "rocket stove", "appropriate tech"], "Off-Grid Systems"),
(["disaster", "emergency prep", "evacuation", "scenario"], "Scenario Playbooks"),
(["leadership", "governance", "community"], "Leadership"),
(["logistics", "supply chain", "transport"], "Logistics"),
(["security", "perimeter", "surveillance"], "Security"),
]
def subdomain_fallback(subdomains):
combined = " ".join(s.lower() for s in subdomains)
for keywords, domain in SUBDOMAIN_FALLBACK_MAP:
if any(kw in combined for kw in keywords):
return [domain]
return ["Foundational Skills"] # absolute last resort
def update_concept_json(doc_hash, title, new_domains):
"""Update domain in concept JSON files on disk."""
doc_dir = CONCEPTS_DIR / doc_hash
if not doc_dir.exists():
return False
for wf in doc_dir.glob("window_*.json"):
try:
with open(wf, "r", encoding="utf-8") as f:
concepts = json.load(f)
changed = False
for c in concepts:
if not isinstance(c, dict):
continue
if c.get("title") == title:
raw = c.get("domain", [])
if isinstance(raw, str):
raw = [raw]
if "Reference" in raw or not [d for d in raw if d in CANONICAL_DOMAINS]:
c["domain"] = new_domains
changed = True
if changed:
with open(wf, "w", encoding="utf-8") as f:
json.dump(concepts, f, indent=2, ensure_ascii=False)
return True
except Exception:
pass
return False
def process_point(point, qdrant, collection, key_rotator, dry_run):
payload = point.payload
title = payload.get("title", "")
subdomains = payload.get("subdomain", [])
if isinstance(subdomains, str):
subdomains = [subdomains]
content = payload.get("content", payload.get("summary", ""))
doc_hash = payload.get("doc_hash", "")
key = key_rotator.next()
new_domains = classify(title, subdomains, content, key)
if dry_run:
return "would_classify"
# Update Qdrant payload
qdrant.set_payload(
collection_name=collection,
payload={"domain": new_domains},
points=[point.id],
)
# Update JSON on disk
if doc_hash:
update_concept_json(doc_hash, title, new_domains)
return "ok"
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--workers", type=int, default=16)
parser.add_argument("--limit", type=int, default=None)
args = parser.parse_args()
config = get_config()
keys = load_gemini_keys()
rotator = KeyRotator(keys)
qdrant = QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=60
)
collection = config['vector_db']['collection']
log.info("Scrolling Qdrant for Reference-tagged concepts...")
# Scroll all points containing Reference in domain
offset = None
reference_points = []
while True:
results, offset = qdrant.scroll(
collection_name=collection,
scroll_filter=Filter(
must=[FieldCondition(
key="domain",
match=MatchAny(any=["Reference"])
)]
),
limit=1000,
with_payload=True,
with_vectors=False,
offset=offset,
)
reference_points.extend(results)
if offset is None:
break
if args.limit and len(reference_points) >= args.limit:
reference_points = reference_points[:args.limit]
break
total = len(reference_points)
log.info(f"Found {total:,} Reference-tagged vectors")
log.info(f"Workers: {args.workers} | Keys: {len(keys)} | Dry run: {args.dry_run}")
log.info(f"Estimated Gemini Flash cost: ~${total * 0.0004:.2f}")
if args.dry_run:
log.info(f"DRY RUN: would re-classify {total:,} concepts. Exiting.")
return
results = defaultdict(int)
lock = threading.Lock()
done = 0
start = time.time()
with ThreadPoolExecutor(max_workers=args.workers) as ex:
futures = {
ex.submit(process_point, p, qdrant, collection, rotator, False): p
for p in reference_points
}
for future in as_completed(futures):
status = future.result()
with lock:
results[status] += 1
done += 1
if done % 5000 == 0:
elapsed = time.time() - start
rate = done / elapsed * 60
eta = (total - done) / (done / elapsed) / 60
log.info(f" {done:,}/{total:,} | {rate:.0f}/min | ETA {eta:.0f}min | {dict(results)}")
time.sleep(0.02)
elapsed = time.time() - start
log.info(f"\nComplete in {elapsed/60:.1f}min:")
for status, count in sorted(results.items(), key=lambda x: -x[1]):
log.info(f" {status:<20} {count:>10,}")
if __name__ == "__main__":
main()

315
scripts/repair_corrupted.py Executable file
View file

@ -0,0 +1,315 @@
#!/usr/bin/env python3
"""
repair_corrupted.py Repairs window files corrupted by concurrent writes.
Strategy:
1. Read corrupted_windows.txt to get the list of bad files
2. For each bad file, identify the parent doc hash from the path
3. Check if the text directory still exists for that doc
4. If yes: re-run Gemini enrichment on just that window
5. If no text: mark as unrecoverable
6. Report summary
Usage:
python3 /opt/recon/scripts/repair_corrupted.py [--dry-run] [--workers 8]
"""
import json
import time
import random
import logging
import argparse
import re
import threading
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from collections import defaultdict
import google.generativeai as genai
CORRUPTED_LIST = Path("/opt/recon/data/corrupted_windows.txt")
TEXT_DIR = Path("/opt/recon/data/text")
CONCEPTS_DIR = Path("/opt/recon/data/concepts")
LOG_FILE = Path("/opt/recon/logs/repair_corrupted.log")
UNRECOVERABLE_LOG = Path("/opt/recon/data/unrecoverable_windows.txt")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[
logging.FileHandler(LOG_FILE),
logging.StreamHandler(),
]
)
log = logging.getLogger("repair_corrupted")
CANONICAL_DOMAINS = [
"Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
"Foundational Skills", "Communications", "Medical", "Food Systems",
"Navigation", "Logistics", "Power Systems", "Leadership",
"Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
]
ENRICH_PROMPT = """Extract knowledge concepts from this document text.
A concept is a SELF-CONTAINED piece of knowledge that can stand alone.
For each concept, provide ALL fields:
Required:
- content: Full text of the concept (complete procedure, definition, etc.)
- summary: 1-2 sentence summary
- title: Brief descriptive title
- domain: Array of 1-5 from ONLY these exact strings (no others):
Defense & Tactics, Sustainment Systems, Off-Grid Systems, Foundational Skills,
Communications, Medical, Food Systems, Navigation, Logistics, Power Systems,
Leadership, Scenario Playbooks, Water Systems, Security, Community Coordination
CRITICAL: Do NOT use "Reference". Every concept belongs somewhere specific.
- subdomain: Array of specific subcategories (up to 10)
- keywords: Array of 3-30 searchable terms
- skill_level: novice | intermediate | advanced
- key_facts: Array of specific extractable claims, measurements, data points
Optional (include when present):
- scenario_applicable: Array from: tuesday_prepper, month_prepper, year_prepper, multi_year, eotwawki
- cross_domain_tags: Array from: sustainment, medical, security, communications, leadership, logistics, navigation, power_systems, water_systems, food_systems, tactical_ops, community_coordination
- chapter: Chapter name if identifiable
- page_ref: Page reference
Return JSON array. If no extractable concepts, return [].
Document text:
"""
def load_gemini_keys():
env = Path("/opt/recon/.env")
keys = []
for line in env.read_text().splitlines():
if line.startswith("GEMINI_KEY_"):
keys.append(line.split("=", 1)[1].strip())
return keys
class KeyRotator:
def __init__(self, keys):
self.keys = keys
self._i = 0
self._lock = threading.Lock()
def next(self):
with self._lock:
key = self.keys[self._i % len(self.keys)]
self._i += 1
return key
def repair_json_truncated(text):
"""Last-ditch attempt to salvage a truncated JSON array."""
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
text = re.sub(r',\s*([}\]])', r'\1', text)
try:
return json.loads(text)
except Exception:
pass
# Find last complete object
last_close = -1
depth = 0
in_str = False
esc = False
for i, ch in enumerate(text):
if esc:
esc = False; continue
if ch == '\\' and in_str:
esc = True; continue
if ch == '"' and not esc:
in_str = not in_str; continue
if in_str:
continue
if ch == '{': depth += 1
elif ch == '}':
depth -= 1
if depth == 0:
last_close = i
if last_close > 0:
trimmed = text[:last_close + 1].rstrip().rstrip(',')
open_brackets = trimmed.count('[') - trimmed.count(']')
try:
return json.loads(trimmed + ']' * open_brackets)
except Exception:
pass
return None
def enrich_window_text(text, key):
"""Call Gemini on raw window text, return concepts list."""
genai.configure(api_key=key)
model = genai.GenerativeModel(
"gemini-2.0-flash",
generation_config={"response_mime_type": "application/json"}
)
for attempt in range(4):
try:
resp = model.generate_content(ENRICH_PROMPT + text)
raw = resp.text
try:
result = json.loads(raw)
except Exception:
result = repair_json_truncated(raw)
if isinstance(result, list):
return [c for c in result if isinstance(c, dict)]
elif isinstance(result, dict):
return [result]
return []
except Exception as e:
err = str(e).lower()
if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
delay = min(5 * (2 ** attempt) + random.uniform(0, 3), 60)
time.sleep(delay)
else:
log.warning(f" Non-transient error: {e}")
break
return None # failed
def get_window_text(doc_hash, window_filename):
"""Reconstruct window text from page files."""
# Window filename: window_NNNN.json -> window index is NNNN
try:
w_idx = int(Path(window_filename).stem.split('_')[1]) - 1
except (IndexError, ValueError):
return None
text_path = TEXT_DIR / doc_hash
if not text_path.exists():
return None
page_files = sorted([
f for f in text_path.iterdir()
if f.name.startswith('page_') and f.name.endswith('.txt')
])
if not page_files:
return None
# Re-derive which pages this window covered (window_size=5 from config)
window_size = 5
start = w_idx * window_size
window_pages = page_files[start:start + window_size]
if not window_pages:
return None
parts = []
for j, pf in enumerate(window_pages):
try:
text = pf.read_text(encoding='utf-8')
parts.append(f"--- Page {start + j + 1} ---\n{text}")
except Exception:
pass
return "\n\n".join(parts) if parts else None
def repair_file(corrupted_path, key_rotator, dry_run):
"""Attempt to repair a single corrupted window file."""
path = Path(corrupted_path)
# Sanity check -- maybe it fixed itself somehow
try:
with open(path) as f:
existing = json.load(f)
return "already_valid"
except Exception:
pass
# Extract doc hash and window name from path structure
# Expected: /opt/recon/data/concepts/{hash}/window_NNNN.json
doc_hash = path.parent.name
window_filename = path.name
# Get source text for this window
window_text = get_window_text(doc_hash, window_filename)
if not window_text:
return "no_source_text"
if dry_run:
return "would_repair"
# Re-enrich from source text
key = key_rotator.next()
concepts = enrich_window_text(window_text, key)
if concepts is None:
return "enrichment_failed"
# Tag concepts with metadata
try:
w_idx = int(Path(window_filename).stem.split('_')[1]) - 1
window_size = 5
start_page = w_idx * window_size + 1
except Exception:
w_idx = 0
start_page = 0
for c in concepts:
c['_window'] = w_idx + 1
c['_start_page'] = start_page
c['_doc_hash'] = doc_hash
c['_repaired'] = True
# Write repaired file
try:
with open(path, 'w', encoding='utf-8') as f:
json.dump(concepts, f, indent=2, ensure_ascii=False)
return "repaired"
except Exception as e:
return "write_error"
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--workers", type=int, default=8)
args = parser.parse_args()
if not CORRUPTED_LIST.exists():
log.error(f"Corrupted list not found: {CORRUPTED_LIST}")
log.error("Run Task 1 first to generate it.")
return
keys = load_gemini_keys()
rotator = KeyRotator(keys)
corrupted = []
with open(CORRUPTED_LIST) as f:
for line in f:
parts = line.strip().split('\t')
if parts:
corrupted.append(parts[0])
log.info(f"Repairing {len(corrupted):,} corrupted window files")
log.info(f"Dry run: {args.dry_run} | Workers: {args.workers} | Keys: {len(keys)}")
results = defaultdict(int)
unrecoverable = []
lock = threading.Lock()
with ThreadPoolExecutor(max_workers=args.workers) as ex:
futures = {ex.submit(repair_file, p, rotator, args.dry_run): p for p in corrupted}
done = 0
for future in as_completed(futures):
path = futures[future]
status = future.result()
with lock:
results[status] += 1
if status in ("no_source_text", "enrichment_failed", "write_error"):
unrecoverable.append((path, status))
done += 1
if done % 100 == 0:
log.info(f" {done:,}/{len(corrupted):,} | {dict(results)}")
time.sleep(0.05)
log.info("── Results ─────────────────────────────────────────────────")
for status, count in sorted(results.items(), key=lambda x: -x[1]):
log.info(f" {status:<25} {count:>8,}")
if unrecoverable:
with open(UNRECOVERABLE_LOG, 'w') as f:
for path, reason in unrecoverable:
f.write(f"{path}\t{reason}\n")
log.info(f"\n Unrecoverable: {len(unrecoverable)} — logged to {UNRECOVERABLE_LOG}")
else:
log.info("\n All files repaired successfully.")
if __name__ == "__main__":
main()

178
scripts/validate.py Executable file
View file

@ -0,0 +1,178 @@
#!/usr/bin/env python3
"""
RECON Pipeline Validator
Checks pipeline consistency: paths, DB state, file integrity, and service connectivity.
Validates TEI, Ollama, and Qdrant are reachable. Deep mode checks every document on disk.
Usage: python3 scripts/validate.py [--deep]
"""
import json
import os
import sys
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from lib.utils import get_config, setup_logging
from lib.status import StatusDB
logger = setup_logging('recon.validate')
def run_validation(deep=False):
config = get_config()
db = StatusDB()
issues = []
warnings = []
print("=== RECON Validation ===\n")
# Check paths
for name, path in config['paths'].items():
if name == 'db':
if not os.path.exists(path):
issues.append(f"Database not found: {path}")
else:
if not os.path.exists(path):
warnings.append(f"Directory missing: {name} = {path}")
# Check library
if not os.path.exists(config['library_root']):
issues.append(f"Library root not found: {config['library_root']}")
# Check Gemini keys
keys = config.get('gemini_keys', [])
if not keys:
warnings.append("No Gemini API keys configured in .env")
else:
print(f" Gemini keys: {len(keys)} configured")
# DB status counts
counts = db.get_status_counts()
cat = counts.get('catalogue', {})
doc = counts.get('documents', {})
print(f" Catalogue: {sum(cat.values())} entries")
print(f" Documents: {sum(doc.values())} entries")
print(f" Complete: {doc.get('complete', 0)}")
print(f" Failed: {doc.get('failed', 0)}")
if deep:
print("\n--- Deep Validation ---\n")
# Check every document in pipeline has corresponding files
all_docs = db.get_all_documents()
text_dir = config['paths']['text']
concepts_dir = config['paths']['concepts']
for d in all_docs:
h = d['hash']
status = d['status']
if status in ('extracted', 'enriched', 'complete'):
doc_text_dir = os.path.join(text_dir, h)
if not os.path.exists(doc_text_dir):
issues.append(f"[{h[:8]}] {d['filename']}: text dir missing but status={status}")
elif deep:
pages = [f for f in os.listdir(doc_text_dir) if f.startswith('page_')]
if not pages:
issues.append(f"[{h[:8]}] {d['filename']}: no page files in text dir")
if status in ('enriched', 'complete'):
doc_concepts_dir = os.path.join(concepts_dir, h)
if not os.path.exists(doc_concepts_dir):
issues.append(f"[{h[:8]}] {d['filename']}: concepts dir missing but status={status}")
elif deep:
windows = [f for f in os.listdir(doc_concepts_dir) if f.startswith('window_')]
if not windows:
issues.append(f"[{h[:8]}] {d['filename']}: no window files in concepts dir")
else:
for wf in windows:
try:
with open(os.path.join(doc_concepts_dir, wf)) as f:
data = json.load(f)
if not isinstance(data, list):
issues.append(f"[{h[:8]}] {wf}: not a JSON array")
except json.JSONDecodeError:
issues.append(f"[{h[:8]}] {wf}: invalid JSON")
# Check orphaned directories
if os.path.exists(text_dir):
doc_hashes = {d['hash'] for d in all_docs}
for dirname in os.listdir(text_dir):
if dirname not in doc_hashes:
warnings.append(f"Orphaned text dir: {dirname}")
if os.path.exists(concepts_dir):
for dirname in os.listdir(concepts_dir):
if dirname not in doc_hashes:
warnings.append(f"Orphaned concepts dir: {dirname}")
print(f" Checked {len(all_docs)} documents")
# Connectivity checks
print("\n--- Connectivity ---\n")
import requests as http_requests
# Check TEI (primary embedding backend)
try:
tei_url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/info"
resp = http_requests.get(tei_url, timeout=10)
if resp.status_code == 200:
print(f" TEI: OK (bge-m3 at {config['embedding']['tei_host']}:{config['embedding']['tei_port']})")
else:
issues.append(f"TEI: HTTP {resp.status_code}")
except Exception as e:
issues.append(f"TEI: unreachable ({e})")
# Check Ollama (fallback)
try:
ollama_url = f"http://{config['embedding']['ollama_host']}:{config['embedding']['ollama_port']}/api/tags"
resp = http_requests.get(ollama_url, timeout=10)
if resp.status_code == 200:
print(f" Ollama: OK (fallback at {config['embedding']['ollama_host']}:{config['embedding']['ollama_port']})")
else:
warnings.append(f"Ollama: HTTP {resp.status_code}")
except Exception as e:
warnings.append(f"Ollama: unreachable ({e}) — fallback only, not critical")
try:
from qdrant_client import QdrantClient
qdrant = QdrantClient(
host=config['vector_db']['host'],
port=config['vector_db']['port'],
timeout=10
)
collections = [c.name for c in qdrant.get_collections().collections]
target = config['vector_db']['collection']
if target in collections:
info = qdrant.get_collection(target)
print(f" Qdrant: OK ({target}: {info.points_count} points)")
else:
issues.append(f"Qdrant: collection {target} not found")
except Exception as e:
issues.append(f"Qdrant: unreachable ({e})")
# Summary
print("\n--- Summary ---\n")
if warnings:
print(f"Warnings ({len(warnings)}):")
for w in warnings:
print(f"{w}")
if issues:
print(f"\nIssues ({len(issues)}):")
for i in issues:
print(f"{i}")
print(f"\nValidation FAILED: {len(issues)} issue(s)")
else:
print("Validation PASSED")
if __name__ == '__main__':
deep = '--deep' in sys.argv
run_validation(deep=deep)

316
static/css/recon.css Normal file
View file

@ -0,0 +1,316 @@
/* RECON Design System
* Knowledge Extraction Pipeline Dashboard CSS
*/
:root {
--bg-primary: #0a0a0a;
--bg-secondary: #111;
--bg-tertiary: #1a1a1a;
--border: #222;
--border-light: #333;
--text-primary: #c0c0c0;
--text-muted: #888;
--text-dim: #666;
--text-faint: #555;
--green: #00ff41;
--green-dim: #16a34a;
--red: #ff4444;
--red-dim: #dc2626;
--orange: #ffa500;
--blue: #00bfff;
--blue-sky: #0ea5e9;
--blue-dark: #0284c7;
--purple: #7c3aed;
--yellow: #fbbf24;
/* Pipeline colors */
--pipe-queued: #555;
--pipe-extracting: #b45309;
--pipe-extracted: #d97706;
--pipe-enriching: #0284c7;
--pipe-enriched: #0ea5e9;
--pipe-embedding: #7c3aed;
--pipe-complete: #16a34a;
--pipe-failed: #dc2626;
--font-mono: 'Courier New', monospace;
--radius: 3px;
--radius-md: 4px;
}
* { margin: 0; padding: 0; box-sizing: border-box; }
body { font-family: var(--font-mono); background: var(--bg-primary); color: var(--text-primary); }
/* ── Header ── */
.header {
background: var(--bg-secondary);
border-bottom: 1px solid var(--border-light);
padding: 10px 24px;
flex-shrink: 0;
display: flex;
justify-content: space-between;
align-items: center;
}
.header-left {
display: flex;
align-items: baseline;
gap: 12px;
}
.header-subtitle {
font-size: 11px;
color: var(--text-dim);
letter-spacing: 1px;
text-transform: uppercase;
}
.header h1 { color: var(--green); font-size: 18px; font-weight: 700; letter-spacing: 3px; }
.header .stats { font-size: 12px; color: var(--text-dim); }
.header .quick-stats { font-size: 11px; color: var(--text-muted); display: flex; gap: 16px; }
.header .quick-stats span { white-space: nowrap; }
/* Heartbeat indicator */
.heartbeat {
display: inline-block;
width: 8px;
height: 8px;
border-radius: 50%;
background: var(--green);
margin-right: 6px;
vertical-align: middle;
animation: pulse 2s ease-in-out infinite;
}
.heartbeat.dead {
background: var(--red);
animation: none;
}
@keyframes pulse {
0%, 100% { opacity: 1; }
50% { opacity: 0.4; }
}
/* ── Navigation ── */
.nav-domain {
background: #0d0d0d;
border-bottom: 1px solid var(--border);
padding: 0 24px;
display: flex;
gap: 0;
flex-shrink: 0;
}
.nav-domain a {
color: var(--text-muted);
text-decoration: none;
font-size: 13px;
text-transform: uppercase;
letter-spacing: 1px;
padding: 10px 16px;
border-bottom: 2px solid transparent;
transition: color 0.15s, border-color 0.15s;
}
.nav-domain a:hover { color: var(--text-primary); }
.nav-domain a.active {
color: var(--green);
border-bottom-color: var(--green);
}
.nav-sub {
background: var(--bg-primary);
border-bottom: 1px solid var(--border);
padding: 6px 24px;
}
.nav-sub a {
color: var(--text-dim);
text-decoration: none;
margin-right: 16px;
font-size: 12px;
transition: color 0.15s;
}
.nav-sub a:hover { color: var(--text-primary); }
.nav-sub a.active { color: var(--green); }
/* ── Content ── */
.content { padding: 24px; max-width: 1400px; margin: 0 auto; }
/* ── Panels ── */
.panel {
background: var(--bg-secondary);
border: 1px solid var(--border);
padding: 24px;
margin-bottom: 24px;
}
/* ── Forms ── */
.search-box {
width: 100%;
padding: 10px 16px;
background: var(--bg-secondary);
border: 1px solid var(--border-light);
color: var(--text-primary);
font-family: inherit;
font-size: 14px;
margin-bottom: 16px;
}
.search-box:focus { outline: none; border-color: var(--green); }
/* ── Tables ── */
table { width: 100%; border-collapse: collapse; font-size: 13px; }
th { background: var(--bg-secondary); color: var(--green); text-align: left; padding: 8px 12px; border-bottom: 1px solid var(--border-light); }
td { padding: 6px 12px; border-bottom: 1px solid var(--bg-tertiary); }
tr:hover { background: var(--bg-secondary); }
/* ── Status badges ── */
.status { padding: 2px 8px; border-radius: var(--radius); font-size: 11px; }
.status-complete { color: var(--green); }
.status-enriched { color: var(--blue); }
.status-extracted { color: var(--orange); }
.status-failed { color: var(--red); }
.status-queued { color: var(--text-muted); }
.status-duplicate { color: var(--text-muted); }
/* ── Stat cards ── */
.stat-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; margin-bottom: 24px; }
.stat-card { background: var(--bg-secondary); border: 1px solid var(--border); padding: 16px; }
.stat-card .label { color: var(--text-dim); font-size: 11px; text-transform: uppercase; }
.stat-card .value { color: var(--green); font-size: 28px; margin-top: 4px; }
.stat-card .sublabel { color: var(--text-faint); font-size: 10px; margin-top: 2px; }
/* ── Search results ── */
.result { background: var(--bg-secondary); border: 1px solid var(--border); padding: 16px; margin-bottom: 12px; }
.result .title { color: var(--green); font-size: 14px; margin-bottom: 4px; }
.result .meta { color: var(--text-dim); font-size: 11px; margin-bottom: 8px; }
.result .content-text { color: #999; font-size: 12px; line-height: 1.5; }
.result .score { color: var(--orange); font-size: 12px; float: right; }
/* ── Buttons ── */
.btn {
background: var(--bg-tertiary);
border: 1px solid var(--border-light);
color: var(--text-primary);
padding: 6px 14px;
cursor: pointer;
font-family: inherit;
font-size: 12px;
}
.btn:hover { border-color: var(--green); color: var(--green); }
.btn:disabled { opacity: 0.5; cursor: not-allowed; }
.btn.active { border-color: var(--green); color: var(--green); }
.btn-danger { color: var(--red); }
.btn-danger:hover { border-color: var(--red); }
.btn-warn { color: #ff8800; }
.btn-warn:hover { border-color: #ff8800; }
/* ── Tags ── */
.domain-tag {
display: inline-block;
background: var(--bg-tertiary);
border: 1px solid var(--border-light);
padding: 1px 6px;
margin: 1px;
font-size: 10px;
color: var(--text-muted);
}
.badge-web { background: #1e3a5f; color: #60a5fa; padding: 2px 8px; border-radius: var(--radius); font-size: 11px; }
.badge-pdf { background: #2d5a2d; color: #4ade80; padding: 2px 8px; border-radius: var(--radius); font-size: 11px; }
/* ── Trend indicators ── */
.trend { font-size: 11px; margin-left: 6px; }
.trend-up { color: var(--green); }
.trend-down { color: var(--red); }
.trend-flat { color: var(--text-faint); }
/* ── Pipeline bar ── */
.pipeline-bar {
height: 24px;
background: var(--bg-secondary);
border: 1px solid var(--border);
border-radius: var(--radius-md);
overflow: hidden;
display: flex;
}
.pipeline-bar .segment { height: 100%; transition: width 0.3s ease; }
.pipeline-legend { display: flex; gap: 14px; margin-top: 6px; font-size: 10px; color: var(--text-muted); flex-wrap: wrap; }
.legend-dot {
display: inline-block;
width: 10px; height: 10px;
border-radius: 2px;
margin-right: 4px;
vertical-align: middle;
}
/* ── Service status dots ── */
.svc-dot {
display: inline-block;
width: 10px;
height: 10px;
border-radius: 50%;
margin-right: 6px;
vertical-align: middle;
}
.svc-dot.active { background: var(--green); }
.svc-dot.inactive { background: var(--red); }
.svc-dot.unknown { background: var(--text-faint); }
/* ── Service status row ── */
.svc-row {
display: flex;
gap: 24px;
background: var(--bg-secondary);
border: 1px solid var(--border);
padding: 12px 16px;
margin-bottom: 24px;
font-size: 12px;
}
.svc-row .svc-item { display: flex; align-items: center; }
/* ── Pagination ── */
.pagination {
display: flex;
gap: 4px;
margin-top: 16px;
justify-content: center;
}
.pagination a, .pagination span {
padding: 4px 10px;
border: 1px solid var(--border-light);
color: var(--text-muted);
text-decoration: none;
font-size: 12px;
}
.pagination a:hover { border-color: var(--green); color: var(--green); }
.pagination .current {
border-color: var(--green);
color: var(--green);
background: var(--bg-tertiary);
}
/* ── Misc helpers ── */
.section-title { color: var(--green); margin-bottom: 12px; }
.mt-24 { margin-top: 24px; }
.mb-16 { margin-bottom: 16px; }
.mb-24 { margin-bottom: 24px; }
.text-muted { color: var(--text-muted); }
.text-dim { color: var(--text-dim); }
.text-faint { color: var(--text-faint); }
.text-green { color: var(--green); }
.text-red { color: var(--red); }
.text-orange { color: var(--orange); }
.text-blue { color: var(--blue); }
.text-small { font-size: 12px; }
.text-xs { font-size: 11px; }
.text-xxs { font-size: 10px; }
.mono { font-family: monospace; }
.flex { display: flex; }
.flex-between { display: flex; justify-content: space-between; }
.flex-center { display: flex; align-items: center; }
.gap-8 { gap: 8px; }
.gap-16 { gap: 16px; }
.grid-2 { display: grid; grid-template-columns: 1fr 1fr; gap: 24px; }
.grid-3 { display: grid; grid-template-columns: repeat(3, 1fr); gap: 16px; }
/* ── Collapsible errors panel ── */
.errors-panel { display: none; }
.errors-panel.has-errors { display: block; }
.errors-panel summary { color: var(--red); cursor: pointer; font-size: 13px; margin-bottom: 8px; }
.errors-panel .error-line { color: var(--text-muted); font-size: 11px; padding: 2px 0; border-bottom: 1px solid var(--border); }

120
static/js/channels.js Normal file
View file

@ -0,0 +1,120 @@
/* RECON PeerTube Channels page JS */
(function() {
'use strict';
async function loadChannelStats() {
try {
var resp = await fetch('/api/peertube/channels/stats');
var data = await resp.json();
if (resp.ok) {
document.getElementById('pt-total-ch').textContent = data.total_channels;
document.getElementById('pt-total-vid').textContent = data.total_videos;
var dlEl = document.getElementById('pt-dl-status');
dlEl.textContent = data.downloader_active ? 'Active' : 'Stopped';
dlEl.style.color = data.downloader_active ? '#00ff41' : '#ff4444';
}
} catch(e) {
console.error('Stats error:', e);
}
}
async function loadChannels() {
try {
var resp = await fetch('/api/peertube/channels');
var data = await resp.json();
if (!resp.ok) throw new Error(data.error || 'Failed');
var tbody = document.getElementById('pt-channel-tbody');
if (!data.length) {
tbody.innerHTML = '<tr><td colspan="6" style="text-align:center;padding:20px;color:#555;">No channels configured</td></tr>';
return;
}
var cats = [];
var catSet = {};
data.forEach(function(c) { if (c.category && !catSet[c.category]) { catSet[c.category] = true; cats.push(c.category); } });
document.getElementById('pt-cat-list').innerHTML = cats.map(function(c) { return '<option value="' + c + '">'; }).join('');
var html = '';
data.forEach(function(ch) {
var vids = ch.videos_in_peertube || 0;
var statusColor = vids > 0 ? '#00ff41' : '#ffa500';
var statusText = vids > 0 ? 'syncing' : 'new';
var ytLink = ch.youtube_url ? '<a href="' + ch.youtube_url + '" target="_blank" style="color:#00a0d0;text-decoration:none;">' + ch.channel_name + '</a>' : ch.channel_name;
html += '<tr style="border-bottom:1px solid #1a1a1a;">' +
'<td style="padding:8px 10px;">' + ytLink + '</td>' +
'<td style="padding:8px 10px;text-align:center;">' + vids + '</td>' +
'<td style="padding:8px 10px;color:#888;">' + (ch.category || '') + '</td>' +
'<td style="padding:8px 10px;text-align:center;">' + (ch.priority || 'M') + '</td>' +
'<td style="padding:8px 10px;text-align:center;"><span style="color:' + statusColor + ';">' + statusText + '</span></td>' +
'<td style="padding:8px 10px;text-align:center;"><button onclick="removeChannel(\'' + ch.actor_name + '\')" style="background:none;border:1px solid #333;color:#ff4444;cursor:pointer;padding:2px 8px;font-size:11px;font-family:inherit;">x</button></td>' +
'</tr>';
});
tbody.innerHTML = html;
} catch(e) {
document.getElementById('pt-channel-tbody').innerHTML = '<tr><td colspan="6" style="text-align:center;padding:20px;color:#ff4444;">Error: ' + e.message + '</td></tr>';
}
}
window.addChannel = async function() {
var fb = document.getElementById('pt-feedback');
var url = document.getElementById('pt-yt-url').value.trim();
if (!url) {
fb.style.color = '#ff4444';
fb.textContent = 'Enter a YouTube channel URL';
return;
}
var category = document.getElementById('pt-category').value.trim();
var priority = document.getElementById('pt-priority').value;
var btn = document.getElementById('pt-add-btn');
btn.disabled = true;
fb.style.color = '#ffa500';
fb.textContent = 'Resolving channel...';
try {
var resp = await fetch('/api/peertube/channels/add', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({youtube_url: url, category: category, priority: priority})
});
var data = await resp.json();
if (resp.ok) {
fb.style.color = '#00ff41';
fb.textContent = 'Added: ' + (data.channel_name || 'channel');
document.getElementById('pt-yt-url').value = '';
loadChannels();
loadChannelStats();
} else {
fb.style.color = '#ff4444';
fb.textContent = data.error || 'Failed to add channel';
}
} catch(e) {
fb.style.color = '#ff4444';
fb.textContent = 'Error: ' + e.message;
}
btn.disabled = false;
};
window.removeChannel = async function(actorName) {
if (!confirm('Remove channel ' + actorName + '?')) return;
var fb = document.getElementById('pt-feedback');
fb.style.color = '#ffa500';
fb.textContent = 'Removing...';
try {
var resp = await fetch('/api/peertube/channels/' + encodeURIComponent(actorName), {method: 'DELETE'});
var data = await resp.json();
if (resp.ok) {
fb.style.color = '#00ff41';
fb.textContent = data.message || 'Removed';
loadChannels();
loadChannelStats();
} else {
fb.style.color = '#ff4444';
fb.textContent = data.error || 'Failed';
}
} catch(e) {
fb.style.color = '#ff4444';
fb.textContent = 'Error: ' + e.message;
}
};
loadChannelStats();
loadChannels();
})();

186
static/js/charts.js Normal file
View file

@ -0,0 +1,186 @@
/* RECON Lightweight Canvas Line Chart
* No dependencies. drawLineChart(canvasId, datasets, opts)
* DPI-aware rendering for sharp lines on all displays.
*/
var ReconChart = (function() {
'use strict';
var COLORS = ['#00ff41', '#0ea5e9', '#ffa500', '#ff4444', '#7c3aed', '#fbbf24'];
function drawLineChart(canvasId, datasets, opts) {
opts = opts || {};
var canvas = document.getElementById(canvasId);
if (!canvas) return;
// DPI-aware sizing — match canvas bitmap to actual CSS pixels
var dpr = window.devicePixelRatio || 1;
var rect = canvas.getBoundingClientRect();
var cssW = rect.width || 800;
var cssH = rect.height || 200;
canvas.width = cssW * dpr;
canvas.height = cssH * dpr;
var ctx = canvas.getContext('2d');
ctx.scale(dpr, dpr);
var W = cssW;
var H = cssH;
var pad = {top: 20, right: 20, bottom: 30, left: 60};
var plotW = W - pad.left - pad.right;
var plotH = H - pad.top - pad.bottom;
// Clear
ctx.fillStyle = '#111';
ctx.fillRect(0, 0, W, H);
if (!datasets || datasets.length === 0) {
ctx.fillStyle = '#666';
ctx.font = '12px Courier New';
ctx.textAlign = 'center';
ctx.fillText('No data', W/2, H/2);
return;
}
// Find global min/max Y
var allY = [];
var allX = [];
datasets.forEach(function(ds) {
ds.points.forEach(function(p) {
allY.push(p.y);
allX.push(p.x);
});
});
if (allY.length === 0) return;
var minY = Math.min.apply(null, allY);
var maxY = Math.max.apply(null, allY);
var minX = Math.min.apply(null, allX);
var maxX = Math.max.apply(null, allX);
// Add 10% padding to Y
var yRange = maxY - minY || 1;
minY = Math.max(0, minY - yRange * 0.05);
maxY = maxY + yRange * 0.1;
var xRange = maxX - minX || 1;
function xToCanvas(x) { return pad.left + ((x - minX) / xRange) * plotW; }
function yToCanvas(y) { return pad.top + plotH - ((y - minY) / (maxY - minY)) * plotH; }
// Grid lines
ctx.strokeStyle = '#222';
ctx.lineWidth = 1;
var ySteps = 5;
for (var i = 0; i <= ySteps; i++) {
var yVal = minY + (maxY - minY) * (i / ySteps);
var cy = yToCanvas(yVal);
ctx.beginPath();
ctx.moveTo(pad.left, cy);
ctx.lineTo(W - pad.right, cy);
ctx.stroke();
// Y labels
ctx.fillStyle = '#666';
ctx.font = '10px Courier New';
ctx.textAlign = 'right';
ctx.fillText(Math.round(yVal).toLocaleString(), pad.left - 6, cy + 3);
}
// X labels (time)
ctx.textAlign = 'center';
ctx.fillStyle = '#666';
var xSteps = Math.min(6, allX.length);
for (var j = 0; j < xSteps; j++) {
var xVal = minX + xRange * (j / (xSteps - 1 || 1));
var cx = xToCanvas(xVal);
var d = new Date(xVal);
var label = d.getHours().toString().padStart(2, '0') + ':' + d.getMinutes().toString().padStart(2, '0');
ctx.fillText(label, cx, H - 8);
}
// Draw lines + dots at each data point
datasets.forEach(function(ds, idx) {
var color = ds.color || COLORS[idx % COLORS.length];
ctx.strokeStyle = color;
ctx.lineWidth = 2;
ctx.beginPath();
var pts = ds.points.sort(function(a, b) { return a.x - b.x; });
pts.forEach(function(p, i) {
var x = xToCanvas(p.x);
var y = yToCanvas(p.y);
if (i === 0) ctx.moveTo(x, y);
else ctx.lineTo(x, y);
});
ctx.stroke();
// Draw dots at each point for visibility with sparse data
ctx.fillStyle = color;
pts.forEach(function(p) {
var x = xToCanvas(p.x);
var y = yToCanvas(p.y);
ctx.beginPath();
ctx.arc(x, y, 3, 0, Math.PI * 2);
ctx.fill();
});
// Legend label
if (ds.label) {
ctx.fillStyle = color;
ctx.font = '10px Courier New';
ctx.textAlign = 'left';
ctx.fillText(ds.label, pad.left + idx * 100, 12);
}
});
}
function loadAndDraw(canvasId, metricType, keys, labels, hours) {
hours = hours || 24;
RECON.fetchJSON('/api/metrics/history?type=' + metricType + '&hours=' + hours).then(function(data) {
if (!data.points || data.points.length < 2) {
// Show "collecting data" message instead of hiding
var canvas = document.getElementById(canvasId);
if (!canvas) return;
var container = canvas.parentElement;
if (container) container.style.display = 'block';
var dpr = window.devicePixelRatio || 1;
var rect = canvas.getBoundingClientRect();
canvas.width = (rect.width || 800) * dpr;
canvas.height = (rect.height || 200) * dpr;
var ctx = canvas.getContext('2d');
ctx.scale(dpr, dpr);
ctx.fillStyle = '#111';
ctx.fillRect(0, 0, rect.width, rect.height);
ctx.fillStyle = '#555';
ctx.font = '12px Courier New';
ctx.textAlign = 'center';
var msg = data.points && data.points.length === 1
? 'Collecting data... (1 snapshot, need 2+)'
: 'Collecting data... (snapshots every 2 min)';
ctx.fillText(msg, (rect.width || 800) / 2, (rect.height || 200) / 2);
return;
}
var container = document.getElementById(canvasId).parentElement;
if (container) container.style.display = 'block';
var datasets = keys.map(function(key, i) {
return {
label: labels[i] || key,
color: COLORS[i % COLORS.length],
points: data.points.map(function(p) {
return {
x: new Date(p.timestamp).getTime(),
y: p.data[key] || 0
};
})
};
});
drawLineChart(canvasId, datasets);
}).catch(function() {});
}
return {
drawLineChart: drawLineChart,
loadAndDraw: loadAndDraw
};
})();

163
static/js/common.js Normal file
View file

@ -0,0 +1,163 @@
/* RECON Common Utilities
* Shared fetch helpers, formatters, auto-refresh
*/
var RECON = (function() {
'use strict';
// Pipeline color/label maps
var pipeColors = {
queued: '#555', extracting: '#b45309', extracted: '#d97706',
enriching: '#0284c7', enriched: '#0ea5e9', embedding: '#7c3aed',
complete: '#16a34a', failed: '#dc2626'
};
var pipeLabels = {
queued: 'Queued', extracting: 'Extracting', extracted: 'Extracted',
enriching: 'Enriching', enriched: 'Enriched', embedding: 'Embedding',
complete: 'Complete', failed: 'Failed'
};
var _refreshTimers = [];
var _heartbeatEl = null;
function fetchJSON(url) {
return fetch(url).then(function(r) {
if (!r.ok) throw new Error('HTTP ' + r.status);
return r.json();
});
}
function postJSON(url, body) {
return fetch(url, {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify(body || {})
}).then(function(r) { return r.json(); });
}
function set(id, text) {
var el = document.getElementById(id);
if (el) el.textContent = text;
}
function setHTML(id, html) {
var el = document.getElementById(id);
if (el) el.innerHTML = html;
}
function fmt(n) {
if (typeof n !== 'number' || isNaN(n)) return '—';
return n.toLocaleString();
}
function fmtBytes(bytes) {
if (!bytes || bytes === 0) return '0 B';
var units = ['B', 'KB', 'MB', 'GB', 'TB'];
var i = Math.floor(Math.log(bytes) / Math.log(1024));
return (bytes / Math.pow(1024, i)).toFixed(1) + ' ' + units[i];
}
function pct(n, total) {
if (!total || total === 0) return '0';
return (n / total * 100).toFixed(1);
}
// Trend indicator: compare current to previous
function trend(current, previous) {
if (previous === undefined || previous === null) return '';
var diff = current - previous;
if (diff > 0) return '<span class="trend trend-up">+' + fmt(diff) + ' &#9650;</span>';
if (diff < 0) return '<span class="trend trend-down">' + fmt(diff) + ' &#9660;</span>';
return '<span class="trend trend-flat">&mdash; &#9654;</span>';
}
// Build a segmented pipeline progress bar
function progressBar(segments, total) {
var html = '';
segments.forEach(function(seg) {
var w = total > 0 ? (seg.count / total * 100) : 0;
if (w > 0) {
html += '<div class="segment" style="width:' + w + '%;background:' +
(seg.color || pipeColors[seg.status] || '#555') + ';" title="' +
(seg.label || pipeLabels[seg.status] || seg.status) + ': ' + fmt(seg.count) + '"></div>';
}
});
return html;
}
// Build legend for pipeline bar
function progressLegend(segments) {
var html = '';
segments.forEach(function(seg) {
if (seg.count > 0) {
html += '<span><span class="legend-dot" style="background:' +
(seg.color || pipeColors[seg.status] || '#555') + ';"></span>' +
(seg.label || pipeLabels[seg.status] || seg.status) + ': ' + fmt(seg.count) + '</span>';
}
});
return html;
}
// Auto-refresh with heartbeat
function startRefresh(callback, intervalMs) {
_heartbeatEl = document.getElementById('heartbeat');
function tick() {
try {
var result = callback();
if (result && typeof result.then === 'function') {
result.then(function() {
if (_heartbeatEl) {
_heartbeatEl.classList.remove('dead');
}
}).catch(function() {
if (_heartbeatEl) {
_heartbeatEl.classList.add('dead');
}
});
} else {
if (_heartbeatEl) _heartbeatEl.classList.remove('dead');
}
} catch(e) {
if (_heartbeatEl) _heartbeatEl.classList.add('dead');
}
}
// Initial load
tick();
var timer = setInterval(tick, intervalMs || 30000);
_refreshTimers.push(timer);
return timer;
}
function stopRefresh(timer) {
if (timer) clearInterval(timer);
}
// Quick-stats loader for header
function loadQuickStats() {
fetchJSON('/api/quick-stats').then(function(data) {
setHTML('qs-docs', fmt(data.catalogued));
setHTML('qs-vectors', fmt(data.vectors));
setHTML('qs-pipeline', fmt(data.in_pipeline));
}).catch(function() {});
}
return {
fetchJSON: fetchJSON,
postJSON: postJSON,
set: set,
setHTML: setHTML,
fmt: fmt,
fmtBytes: fmtBytes,
pct: pct,
trend: trend,
progressBar: progressBar,
progressLegend: progressLegend,
startRefresh: startRefresh,
stopRefresh: stopRefresh,
loadQuickStats: loadQuickStats,
pipeColors: pipeColors,
pipeLabels: pipeLabels
};
})();

232
static/js/dashboard.js Normal file
View file

@ -0,0 +1,232 @@
/* RECON Knowledge Dashboard */
(function() {
'use strict';
var pipeColors = RECON.pipeColors;
var pipeLabels = RECON.pipeLabels;
function loadDashboard() {
return RECON.fetchJSON('/api/knowledge-stats').then(function(data) {
var t = data.totals;
// Top cards
RECON.set('kv-catalogued', RECON.fmt(t.catalogued || 0));
RECON.set('kv-pipeline', RECON.fmt(t.in_pipeline || 0));
var pipeSub = document.getElementById('kv-pipeline-sub');
if (t.in_pipeline > 0) {
var active = data.pipeline.filter(function(p) { return ['extracting','enriching','embedding'].indexOf(p.status) >= 0; });
var activeText = active.map(function(p) { return p.count + ' ' + p.status; }).join(', ');
pipeSub.textContent = activeText || 'processing';
} else { pipeSub.textContent = 'idle'; }
RECON.set('kv-complete', RECON.fmt(t.complete || 0));
var failEl = document.getElementById('kv-failed');
failEl.textContent = RECON.fmt(t.failed || 0);
failEl.style.color = t.failed > 0 ? '#ff4444' : '#00ff41';
RECON.set('kv-concepts', RECON.fmt(t.concepts || 0));
RECON.set('kv-vectors', RECON.fmt(t.vectors || 0));
RECON.set('kv-pages', RECON.fmt(t.pages_processed || 0));
// Progress bar
var total = t.catalogued || 1;
var notYetQueued = total - (t.documents || 0);
var segments = [];
if (notYetQueued > 0) {
segments.push({status: 'unqueued', count: notYetQueued, color: '#1a1a1a', label: 'Not queued'});
}
data.pipeline.forEach(function(p) {
if (p.count > 0) segments.push(p);
});
RECON.setHTML('progress-bar', RECON.progressBar(segments, total));
var completePct = total > 0 ? (t.complete / total * 100).toFixed(1) : 0;
RECON.set('progress-pct', completePct + '% complete (' + RECON.fmt(t.complete || 0) + ' / ' + RECON.fmt(total) + ')');
// Legend
var legendSegments = [];
if (notYetQueued > 0) legendSegments.push({status: 'unqueued', count: notYetQueued, color: '#1a1a1a', label: 'Not queued'});
data.pipeline.forEach(function(p) { if (p.count > 0) legendSegments.push(p); });
RECON.setHTML('progress-legend', RECON.progressLegend(legendSegments));
// Pipeline activity
var activeStatuses = data.pipeline.filter(function(p) { return ['extracting','enriching','embedding'].indexOf(p.status) >= 0 && p.count > 0; });
var actDiv = document.getElementById('pipeline-activity');
if (activeStatuses.length > 0) {
actDiv.style.display = 'block';
var actHtml = '';
activeStatuses.forEach(function(p) {
actHtml += '<div style="margin:4px 0;"><span style="color:' + (pipeColors[p.status]||'#ffa500') + ';">&#9679; ' + (pipeLabels[p.status]||p.status) + ':</span> ' + p.count + ' documents</div>';
});
if (data.active_titles) {
Object.keys(data.active_titles).forEach(function(st) {
var titles = data.active_titles[st];
if (titles.length > 0) actHtml += '<div style="color:#666;font-size:11px;margin-left:16px;">' + titles.slice(0,3).join(', ') + (titles.length > 3 ? ', ...' : '') + '</div>';
});
}
RECON.setHTML('activity-content', actHtml);
} else { actDiv.style.display = 'none'; }
// Qdrant health
var q = data.qdrant;
var qEl = document.getElementById('qdrant-status');
if (q.error) {
qEl.innerHTML = '<span style="color:#ff4444;">&#9679; Offline</span> &mdash; ' + q.error;
} else {
var idxType = q.index_type || (q.vectors >= 20000 ? 'HNSW' : 'brute-force');
var idxColor = idxType === 'HNSW' ? '#00ff41' : '#ffa500';
qEl.innerHTML = '<span style="color:#00ff41;">&#9679; Online</span> | ' +
RECON.fmt(q.vectors) + ' vectors | ' +
'<span style="color:' + idxColor + ';">' + idxType + '</span>' +
(idxType === 'HNSW' ? ' (' + RECON.fmt(q.indexed||0) + ' indexed)' : ' (HNSW auto-builds at 20K)') +
' | <span style="color:#555;">recon_knowledge</span>';
}
// Sources table
var tbody = document.getElementById('sources-tbody');
var totalCat = 0, totalComp = 0, totalPipe = 0, totalConcepts = 0, totalVectors = 0;
tbody.innerHTML = data.sources.map(function(s) {
var catCount = s.catalogued || 0;
var compCount = s.complete || 0;
var pipeCount = s.in_pipeline || 0;
totalCat += catCount; totalComp += compCount; totalPipe += pipeCount;
totalConcepts += s.concepts; totalVectors += s.vectors;
var badge = s.type === 'web' ? '<span class="badge-web">WEB</span>' : '<span class="badge-pdf">PDF</span>';
var compPct = catCount > 0 ? (compCount / catCount * 100) : 0;
var pipePct = catCount > 0 ? (pipeCount / catCount * 100) : 0;
var compColor = compPct >= 100 ? '#00ff41' : compPct > 0 ? '#ffa500' : '#666';
var pipeColor = pipeCount > 0 ? '#0ea5e9' : '#555';
var barW = 80;
var compW = (compPct / 100 * barW).toFixed(1);
var pipeW = (pipePct / 100 * barW).toFixed(1);
var miniBar = '<div style="display:flex;align-items:center;gap:6px;">' +
'<div style="width:' + barW + 'px;height:10px;background:#1a1a1a;border-radius:3px;overflow:hidden;display:flex;">' +
'<div style="width:' + compW + 'px;background:#16a34a;height:100%;"></div>' +
'<div style="width:' + pipeW + 'px;background:#0284c7;height:100%;"></div>' +
'</div><span style="color:#888;font-size:10px;">' + compPct.toFixed(0) + '%</span></div>';
return '<tr><td>' + s.name + '</td><td>' + badge + '</td><td>' +
RECON.fmt(catCount) + '</td><td><span style="color:' + compColor + ';">' +
RECON.fmt(compCount) + '</span></td><td><span style="color:' + pipeColor + ';">' +
RECON.fmt(pipeCount) + '</span></td><td>' + miniBar + '</td><td>' +
RECON.fmt(s.concepts) + '</td><td>' + RECON.fmt(s.vectors) + '</td></tr>';
}).join('');
RECON.setHTML('sources-tfoot',
'<tr style="border-top:1px solid #333;font-weight:bold;"><td>TOTAL</td><td></td><td>' +
RECON.fmt(totalCat) + '</td><td>' + RECON.fmt(totalComp) + '</td><td>' +
RECON.fmt(totalPipe) + '</td><td></td><td>' +
RECON.fmt(totalConcepts) + '</td><td>' + RECON.fmt(totalVectors) + '</td></tr>');
// Domain bars
var dc = document.getElementById('domain-bars');
var domEntries = Object.entries(data.domains);
if (domEntries.length === 0) {
dc.innerHTML = '<span class="text-dim">No domain data</span>';
} else {
var maxD = Math.max.apply(null, domEntries.map(function(e) { return e[1]; }));
dc.innerHTML = domEntries.map(function(entry) {
var name = entry[0], count = entry[1];
var pct = (count / maxD * 100).toFixed(1);
return '<div style="display:flex;align-items:center;gap:10px;margin:5px 0;">' +
'<span style="width:160px;text-align:right;color:#aaa;white-space:nowrap;overflow:hidden;text-overflow:ellipsis;">' + name + '</span>' +
'<div style="flex:1;height:18px;background:#1a1a1a;border-radius:3px;overflow:hidden;">' +
'<div style="height:100%;background:#00cc66;border-radius:3px;width:' + pct + '%;"></div></div>' +
'<span style="width:50px;color:#ccc;text-align:right;">' + RECON.fmt(count) + '</span></div>';
}).join('');
}
// Knowledge Type bars
var ktEl = document.getElementById('knowledge-type-bars');
var ktEntries = Object.entries(data.knowledge_types || {});
var totalKt = ktEntries.reduce(function(a, e) { return a + e[1]; }, 0);
if (ktEntries.length === 0) {
ktEl.innerHTML = '<span class="text-dim">No data yet (migration in progress)</span>';
} else {
var ktColors = {foundational: '#60a5fa', procedural: '#4ade80', operational: '#fbbf24'};
var maxKt = Math.max.apply(null, ktEntries.map(function(e) { return e[1]; }));
ktEl.innerHTML = ktEntries.map(function(entry) {
var name = entry[0], count = entry[1];
var pctVal = totalKt > 0 ? (count / totalKt * 100).toFixed(0) : 0;
var barPct = (count / maxKt * 100).toFixed(1);
var color = ktColors[name] || '#888';
return '<div style="display:flex;align-items:center;gap:10px;margin:5px 0;">' +
'<span style="width:100px;text-align:right;color:' + color + ';">' + name + '</span>' +
'<div style="flex:1;height:18px;background:#1a1a1a;border-radius:3px;overflow:hidden;">' +
'<div style="height:100%;background:' + color + ';opacity:0.6;border-radius:3px;width:' + barPct + '%;"></div></div>' +
'<span style="width:80px;color:#ccc;text-align:right;">' + RECON.fmt(count) + ' (' + pctVal + '%)</span></div>';
}).join('');
}
var ktMig = document.getElementById('knowledge-type-migration');
ktMig.textContent = RECON.fmt(totalKt) + ' / ' + RECON.fmt(data.sample_size) + ' migrated';
// Complexity bars
var cxEl = document.getElementById('complexity-bars');
var cxEntries = Object.entries(data.complexities || {});
var totalCx = cxEntries.reduce(function(a, e) { return a + e[1]; }, 0);
if (cxEntries.length === 0) {
cxEl.innerHTML = '<span class="text-dim">No data yet (migration in progress)</span>';
} else {
var cxColors = {basic: '#4ade80', intermediate: '#fbbf24', advanced: '#f87171'};
var maxCx = Math.max.apply(null, cxEntries.map(function(e) { return e[1]; }));
cxEl.innerHTML = cxEntries.map(function(entry) {
var name = entry[0], count = entry[1];
var pctVal = totalCx > 0 ? (count / totalCx * 100).toFixed(0) : 0;
var barPct = (count / maxCx * 100).toFixed(1);
var color = cxColors[name] || '#888';
return '<div style="display:flex;align-items:center;gap:10px;margin:5px 0;">' +
'<span style="width:100px;text-align:right;color:' + color + ';">' + name + '</span>' +
'<div style="flex:1;height:18px;background:#1a1a1a;border-radius:3px;overflow:hidden;">' +
'<div style="height:100%;background:' + color + ';opacity:0.6;border-radius:3px;width:' + barPct + '%;"></div></div>' +
'<span style="width:80px;color:#ccc;text-align:right;">' + RECON.fmt(count) + ' (' + pctVal + '%)</span></div>';
}).join('');
}
var cxMig = document.getElementById('complexity-migration');
cxMig.textContent = RECON.fmt(totalCx) + ' / ' + RECON.fmt(data.sample_size) + ' migrated';
// Recent completions
var rtb = document.getElementById('recent-tbody');
if (data.recent_complete.length === 0) {
rtb.innerHTML = '<tr><td colspan="4" class="text-dim">None yet</td></tr>';
} else {
rtb.innerHTML = data.recent_complete.map(function(r) {
var badge = r.type === 'web' ? '<span class="badge-web">WEB</span>' : '<span class="badge-pdf">PDF</span>';
return '<tr><td>' + r.title + '</td><td>' + badge + '</td><td>' +
r.concepts + '</td><td>' + r.vectors + '</td></tr>';
}).join('');
}
});
}
function loadCharts() {
if (typeof ReconChart !== 'undefined') {
ReconChart.loadAndDraw('kb-chart', 'knowledge',
['complete', 'concepts'], ['Completed', 'Concepts'], 24);
}
}
function initSourcesToggle() {
var toggle = document.getElementById('sources-toggle');
var arrow = document.getElementById('sources-arrow');
var thead = document.getElementById('sources-thead');
var tbody = document.getElementById('sources-tbody');
var expanded = localStorage.getItem('recon-sources-expanded') === 'true';
function apply() {
var show = expanded ? '' : 'none';
thead.style.display = show;
tbody.style.display = show;
arrow.innerHTML = expanded ? '&#9660;' : '&#9654;';
}
toggle.addEventListener('click', function() {
expanded = !expanded;
localStorage.setItem('recon-sources-expanded', expanded);
apply();
});
apply();
}
document.addEventListener('DOMContentLoaded', function() {
initSourcesToggle();
RECON.startRefresh(loadDashboard, 30000);
loadCharts();
setInterval(loadCharts, 300000); // refresh charts every 5 min
});
})();

106
static/js/peertube.js Normal file
View file

@ -0,0 +1,106 @@
/* RECON PeerTube Dashboard JS */
(function() {
'use strict';
function loadPTDashboard() {
return RECON.fetchJSON('/api/peertube/dashboard').then(function(data) {
// Video states
var vs = data.video_states || {};
// PeerTube state codes: 1=published, 2=to_transcode, 3=to_import, 4=waiting_for_live, 5=live_ended, 6=to_move_to_external_storage, 7=transcoding_failed, 8=to_edit, 9=waiting_for_live_to_end
var published = vs['1'] || 0;
var inPipeline = (vs['2'] || 0) + (vs['3'] || 0) + (vs['6'] || 0) + (vs['8'] || 0);
var failed = vs['7'] || 0;
RECON.set('pt-published', RECON.fmt(published));
RECON.set('pt-in-pipeline', RECON.fmt(inPipeline));
var failEl = document.getElementById('pt-failed');
failEl.textContent = RECON.fmt(failed);
failEl.style.color = failed > 0 ? '#ff4444' : '#00ff41';
// Import rate from downloader state
var ds = data.downloader_state || {};
var rate = ds.imports_last_hour || 0;
RECON.set('pt-import-rate', RECON.fmt(rate));
// GPU
var gpu = data.gpu || {};
if (gpu.name) {
RECON.set('pt-gpu-util', gpu.utilization_gpu || '—');
RECON.set('pt-gpu-temp', gpu.temperature_gpu || '—');
var gpuPanel = document.getElementById('pt-gpu-panel');
gpuPanel.style.display = 'block';
document.getElementById('pt-gpu-detail').innerHTML =
'<strong>' + gpu.name + '</strong> | VRAM: ' +
RECON.fmt(parseInt(gpu.memory_used || 0)) + ' / ' + RECON.fmt(parseInt(gpu.memory_total || 0)) + ' MiB | ' +
'Util: ' + (gpu.utilization_gpu || '?') + '% | ' +
'Temp: ' + (gpu.temperature_gpu || '?') + '&deg;C';
} else {
RECON.set('pt-gpu-util', '—');
RECON.set('pt-gpu-temp', '—');
document.getElementById('pt-gpu-panel').style.display = 'none';
}
// Services
var svcs = data.services || {};
['downloader', 'importer', 'transcoder', 'runner'].forEach(function(s) {
var el = document.getElementById('svc-' + s);
el.className = 'svc-dot ' + (svcs[s] === 'active' ? 'active' : svcs[s] === 'inactive' ? 'inactive' : 'unknown');
});
// Pipeline dirs
var dirs = data.pipeline_dirs || {};
var storageHtml = '';
var dirOrder = ['staging', 'completed', 'transcoded', 'failed'];
var dirLabels = {staging: 'Downloaded', completed: 'Awaiting Transcode', transcoded: 'Ready to Import', failed: 'Failed'};
var dirColors = {staging: '#b45309', completed: '#0284c7', transcoded: '#7c3aed', failed: '#dc2626'};
var totalVideos = 0;
dirOrder.forEach(function(d) {
var info = dirs[d] || {};
var videos = info.videos || 0;
var bytes = info.bytes || 0;
totalVideos += videos;
storageHtml += '<div class="flex-between" style="margin:4px 0;">' +
'<span><span class="legend-dot" style="background:' + (dirColors[d] || '#555') + ';"></span>' + (dirLabels[d] || d) + '</span>' +
'<span>' + videos + ' videos / ' + RECON.fmtBytes(bytes) + '</span></div>';
});
RECON.setHTML('pt-storage-content', storageHtml);
// Pipeline bar (using video counts)
var segments = dirOrder.map(function(d) {
return {status: d, count: (dirs[d] || {}).videos || 0, color: dirColors[d], label: dirLabels[d] || d};
});
RECON.setHTML('pt-pipeline-bar', RECON.progressBar(segments, totalVideos || 1));
RECON.setHTML('pt-pipeline-legend', RECON.progressLegend(segments));
RECON.set('pt-pipeline-summary', totalVideos + ' videos in pipeline');
// Errors
var errors = data.recent_errors || [];
var errPanel = document.getElementById('pt-errors-panel');
RECON.set('pt-error-count', errors.length);
if (errors.length > 0) {
errPanel.classList.add('has-errors');
var errHtml = '';
errors.forEach(function(e) {
errHtml += '<div class="error-line">' + e + '</div>';
});
RECON.setHTML('pt-errors-content', errHtml);
} else {
errPanel.classList.remove('has-errors');
}
}).catch(function(err) {
console.error('PT dashboard error:', err);
});
}
function loadCharts() {
if (typeof ReconChart !== 'undefined') {
ReconChart.loadAndDraw('pt-chart', 'peertube',
['published', 'backlog'], ['Published', 'Backlog'], 24);
}
}
document.addEventListener('DOMContentLoaded', function() {
RECON.startRefresh(loadPTDashboard, 30000);
loadCharts();
setInterval(loadCharts, 300000);
});
})();

193
static/js/web-ingest.js Normal file
View file

@ -0,0 +1,193 @@
/* RECON Web Ingest page JS */
(function() {
'use strict';
window.showSection = function(name) {
document.getElementById('section-single').style.display = name === 'single' ? '' : 'none';
document.getElementById('section-crawl').style.display = name === 'crawl' ? '' : 'none';
document.getElementById('tab-single').className = 'btn' + (name === 'single' ? ' active' : '');
document.getElementById('tab-crawl').className = 'btn' + (name === 'crawl' ? ' active' : '');
};
window.doWebIngest = async function() {
var btn = document.getElementById('wi-btn');
var status = document.getElementById('wi-status');
var resultsDiv = document.getElementById('wi-results');
var urlText = document.getElementById('wi-urls').value.trim();
var category = document.getElementById('wi-category').value.trim() || 'Web';
if (!urlText) {
status.style.color = '#ff4444';
status.textContent = 'Enter at least one URL';
return;
}
var urls = urlText.split('\n').map(function(u) { return u.trim(); }).filter(function(u) { return u && !u.startsWith('#'); });
if (urls.length === 0) {
status.style.color = '#ff4444';
status.textContent = 'No valid URLs';
return;
}
btn.disabled = true;
status.style.color = '#ffa500';
resultsDiv.style.display = 'none';
if (urls.length === 1) {
status.textContent = 'Fetching and extracting...';
try {
var resp = await fetch('/api/ingest-url', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({ url: urls[0], category: category, process: true })
});
var data = await resp.json();
if (resp.ok || resp.status === 409) {
var color = data.status === 'duplicate' ? '#888' : '#00ff41';
status.style.color = color;
status.textContent = data.status.toUpperCase() + ': ' + (data.title || urls[0]);
resultsDiv.style.display = 'block';
resultsDiv.innerHTML = '<span style="color:' + color + ';">' + data.status.toUpperCase() + '</span><br>' +
'<span class="text-dim">Hash: ' + data.hash + '</span><br>' +
(data.page_count ? '<span class="text-dim">Pages: ' + data.page_count + '</span><br>' : '') +
(data.title ? '<span class="text-dim">Title: ' + data.title + '</span><br>' : '') +
(data.pipeline ? '<span style="color:#00ff41;">Pipeline: enriched ' + (data.pipeline.enriched || 0) + ', embedded ' + (data.pipeline.embedded || 0) + '</span>' : '');
} else {
status.style.color = '#ff4444';
status.textContent = data.error || 'Ingestion failed';
}
} catch (err) {
status.style.color = '#ff4444';
status.textContent = 'Network error: ' + err.message;
}
} else {
status.textContent = 'Processing ' + urls.length + ' URLs...';
try {
var resp = await fetch('/api/ingest-urls', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({ urls: urls, category: category, process: true })
});
var data = await resp.json();
if (resp.ok) {
var s = data.summary;
status.style.color = '#00ff41';
var batchPipe = data.pipeline && data.pipeline.enriched ? ' | enriched: ' + data.pipeline.enriched + ', embedded: ' + data.pipeline.embedded : '';
status.textContent = s.succeeded + ' new, ' + s.duplicates + ' dupes, ' + s.failed + ' failed' + batchPipe;
resultsDiv.style.display = 'block';
var html = '';
for (var i = 0; i < data.results.length; i++) {
var r = data.results[i];
var c = r.status === 'failed' ? '#ff4444' : r.status === 'duplicate' ? '#888' : '#00ff41';
html += '<div style="margin-bottom:4px;"><span style="color:' + c + ';">' +
r.status.toUpperCase() + '</span> ' + (r.title || r.url) + '</div>';
}
resultsDiv.innerHTML = html;
} else {
status.style.color = '#ff4444';
status.textContent = data.error || 'Batch ingestion failed';
}
} catch (err) {
status.style.color = '#ff4444';
status.textContent = 'Network error: ' + err.message;
}
}
btn.disabled = false;
};
window.doCrawl = async function(dryRun) {
var status = document.getElementById('crawl-status');
var resultsDiv = document.getElementById('crawl-results');
var url = document.getElementById('crawl-url').value.trim();
var category = document.getElementById('crawl-category').value.trim() || 'Web';
var maxPages = parseInt(document.getElementById('crawl-max-pages').value) || 500;
var includeRaw = document.getElementById('crawl-include').value.trim();
var excludeRaw = document.getElementById('crawl-exclude').value.trim();
if (!url) {
status.style.color = '#ff4444';
status.textContent = 'Enter a site URL';
return;
}
var include = includeRaw ? includeRaw.split(',').map(function(s) { return s.trim(); }).filter(Boolean) : null;
var exclude = excludeRaw ? excludeRaw.split(',').map(function(s) { return s.trim(); }).filter(Boolean) : null;
var btnP = document.getElementById('crawl-preview-btn');
var btnC = document.getElementById('crawl-btn');
btnP.disabled = true;
btnC.disabled = true;
status.style.color = '#ffa500';
status.textContent = dryRun ? 'Discovering URLs...' : 'Starting crawl...';
resultsDiv.style.display = 'none';
try {
var body = { url: url, category: category, max_pages: maxPages, dry_run: dryRun };
if (include) body.include = include;
if (exclude) body.exclude = exclude;
var resp = await fetch('/api/crawl', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify(body)
});
var data = await resp.json();
if (dryRun) {
var urls = data.urls || [];
status.style.color = '#00ff41';
status.textContent = urls.length + ' URLs found (' + (data.discovery_method || 'unknown') + ')';
resultsDiv.style.display = 'block';
var html = '<div style="color:#00ff41;margin-bottom:8px;">Discovery: ' + (data.discovery_method || 'unknown') + ' — ' + urls.length + ' URLs</div>';
urls.forEach(function(u, i) {
html += '<div class="text-muted">' + (i+1) + '. ' + u + '</div>';
});
resultsDiv.innerHTML = html;
} else if (data.crawl_id) {
status.style.color = '#00ff41';
status.textContent = 'Crawl started — ID: ' + data.crawl_id;
resultsDiv.style.display = 'block';
resultsDiv.innerHTML = '<div style="color:#ffa500;">Crawl running in background...</div>' +
'<div class="text-dim" style="margin-top:4px;">ID: ' + data.crawl_id + '</div>';
pollCrawl(data.crawl_id, resultsDiv);
} else {
status.style.color = '#ff4444';
status.textContent = data.error || 'Crawl failed';
}
} catch (err) {
status.style.color = '#ff4444';
status.textContent = 'Network error: ' + err.message;
}
btnP.disabled = false;
btnC.disabled = false;
};
function pollCrawl(crawlId, resultsDiv) {
var check = async function() {
try {
var resp = await fetch('/api/crawl/' + crawlId + '/status');
var data = await resp.json();
if (data.status === 'running') {
var stageText = data.stage ? ' (' + data.stage + ')' : '';
resultsDiv.innerHTML = '<div style="color:#ffa500;">Pipeline running' + stageText + '...</div>' +
'<div class="text-dim">Site: ' + (data.site || '') + '</div>';
setTimeout(check, 5000);
} else if (data.summary) {
var s = data.summary;
var pipeInfo = data.pipeline ? ' | Enriched: ' + (data.pipeline.enriched || 0) + ' | Embedded: ' + (data.pipeline.embedded || 0) : '';
resultsDiv.innerHTML = '<div style="color:#00ff41;">Pipeline complete!</div>' +
'<div class="text-dim" style="margin-top:4px;">New: ' + s.succeeded + ' | Duplicates: ' + s.duplicates + ' | Failed: ' + s.failed + ' | Total: ' + s.total + pipeInfo + '</div>';
document.getElementById('crawl-status').style.color = '#00ff41';
document.getElementById('crawl-status').textContent = 'Complete: ' + s.succeeded + ' new' + pipeInfo;
} else if (data.error) {
resultsDiv.innerHTML = '<div style="color:#ff4444;">Crawl failed: ' + data.error + '</div>';
}
} catch (err) {
resultsDiv.innerHTML += '<div style="color:#ff4444;">Poll error: ' + err.message + '</div>';
}
};
setTimeout(check, 5000);
}
showSection('single');
})();

115
sweep_gated.sh Executable file
View file

@ -0,0 +1,115 @@
#!/usr/bin/env bash
# sweep_gated.sh — Qdrant-gated sweep wrapper for Stream B.2 Phase 4
# Runs recon.py pipeline sweep in bounded chunks with Qdrant health checks
# between each invocation. Aborts cleanly if Qdrant becomes unreachable.
set -euo pipefail
QDRANT_URL="${QDRANT_URL:-http://192.168.1.150:6333/collections/recon_knowledge_hybrid}"
BATCH_SIZE="${BATCH_SIZE:-500}"
MAX_ENTRIES="${MAX_ENTRIES:-500}"
PLAN_FILE="${PLAN_FILE:-/opt/recon/data/sweep/sweep_plan.json}"
RECON_DIR="/opt/recon"
# Checkpoint co-locates with plan file: plan.json -> plan_checkpoint.json
CHECKPOINT_FILE="${PLAN_FILE%.json}_checkpoint.json"
log() { echo "[$(date +%Y-%m-%dT%H:%M:%S)] $*"; }
probe_qdrant() {
local resp
resp=$(curl -sf -o /dev/null -w '%{http_code}' --connect-timeout 5 --max-time 10 "$QDRANT_URL" 2>/dev/null) || true
if [ "$resp" = "200" ]; then
return 0
else
return 1
fi
}
report_progress() {
if [ -f "$CHECKPOINT_FILE" ]; then
python3 -c "
import json
cp = json.load(open('$CHECKPOINT_FILE'))
s = cp['stats']
idx = cp['last_completed_index']
print(f' last_completed_index={idx}')
print(f' relocated={s[\"relocated\"]} rescued={s[\"rescued\"]} unclassified={s[\"unclassified_moved\"]}')
print(f' noop={s[\"no_op_marked\"]} dup={s[\"duplicates\"]} skip={s[\"skipped\"]} fail={s[\"failed\"]}')
print(f' qdrant_updated={s[\"qdrant_updated\"]}')
" 2>/dev/null || log " (could not read checkpoint)"
else
log " no checkpoint file at $CHECKPOINT_FILE"
fi
}
parse_processed() {
# Parse the sweep output to count total entries processed this iteration
python3 -c "
import sys, re
lines = sys.stdin.read()
total = 0
for key in ['Relocated', 'Rescued', 'Unclassified moved', 'No-op .marked.', 'Duplicates', 'Skipped', 'Failed']:
m = re.search(key + r':\s+(\d+)', lines)
if m:
total += int(m.group(1))
print(total)
" 2>/dev/null || echo "-1"
}
log "Plan file: $PLAN_FILE"
log "Batch size: $BATCH_SIZE, Max entries per chunk: $MAX_ENTRIES"
iteration=0
while true; do
iteration=$((iteration + 1))
log "=== Iteration $iteration ==="
# Pre-flight Qdrant probe
log "Probing Qdrant at $QDRANT_URL ..."
if ! probe_qdrant; then
log "ABORT: Qdrant unreachable before iteration $iteration"
report_progress
exit 1
fi
log "Qdrant OK"
# Run sweep chunk
log "Running: recon.py pipeline sweep --execute --resume --batch-size $BATCH_SIZE --max-entries $MAX_ENTRIES --plan-file $PLAN_FILE"
set +e
output=$(cd "$RECON_DIR" && python3 recon.py pipeline sweep --execute --resume \
--batch-size "$BATCH_SIZE" --max-entries "$MAX_ENTRIES" --plan-file "$PLAN_FILE" 2>&1)
rc=$?
set -e
echo "$output"
if [ $rc -ne 0 ]; then
log "ABORT: recon.py exited with code $rc"
report_progress
exit 2
fi
# Check if sweep is done (all counters zero = nothing left to process)
processed=$(echo "$output" | parse_processed)
if [ "$processed" = "0" ]; then
log "Sweep complete — nothing left to process"
report_progress
exit 0
fi
log "Chunk processed $processed entries"
# Post-flight Qdrant probe
log "Post-flight Qdrant probe..."
if ! probe_qdrant; then
log "ABORT: Qdrant unreachable after iteration $iteration"
log "Last chunk may have filesystem/Qdrant drift — verify with: recon.py pipeline sweep --verify"
report_progress
exit 3
fi
log "Qdrant still healthy, continuing..."
report_progress
echo
done

39
templates/base.html Normal file
View file

@ -0,0 +1,39 @@
<!DOCTYPE html>
<html>
<head>
<title>RECON // Aurora Intelligence Pipeline{% if page_title %} — {{ page_title }}{% endif %}</title>
<meta charset="utf-8">
<link rel="stylesheet" href="/static/css/recon.css">
</head>
<body>
<div class="header">
<div class="header-left"><h1><span id="heartbeat" class="heartbeat"></span>RECON</h1><span class="header-subtitle">AURORA INTELLIGENCE PIPELINE</span></div>
<div class="flex gap-16">
<div class="quick-stats">
<span>Docs: <span id="qs-docs"></span></span>
<span>Vectors: <span id="qs-vectors"></span></span>
<span>Pipeline: <span id="qs-pipeline"></span></span>
</div>
</div>
</div>
<div class="nav-domain">
<a href="/"{% if domain == 'knowledge' %} class="active"{% endif %}>Knowledge</a>
<a href="/peertube"{% if domain == 'peertube' %} class="active"{% endif %}>PeerTube</a>
<a href="/search"{% if domain == 'search' %} class="active"{% endif %}>Search</a>
<a href="/settings/keys"{% if domain == 'settings' %} class="active"{% endif %}>Settings</a>
</div>
{% if subnav %}
<div class="nav-sub">
{% for item in subnav %}
<a href="{{ item.href }}"{% if item.href == active_page %} class="active"{% endif %}>{{ item.label }}</a>
{% endfor %}
</div>
{% endif %}
<div class="content" id="main">
{% block content %}{% endblock %}
</div>
<script src="/static/js/common.js"></script>
<script>document.addEventListener('DOMContentLoaded', function() { RECON.loadQuickStats(); });</script>
{% block scripts %}{% endblock %}
</body>
</html>

View file

@ -0,0 +1,53 @@
{% extends "base.html" %}
{% block content %}
<h3 class="section-title mb-16">Document Catalogue</h3>
{% if sources %}
<div class="mb-16">
<a href="/catalogue" class="btn{% if not current_source %} active{% endif %}" style="margin-right:4px;">All</a>
{% for s in sources %}
<a href="/catalogue?source={{ s }}" class="btn{% if current_source == s %} active{% endif %}" style="margin-right:4px;">{{ s }}</a>
{% endfor %}
</div>
{% endif %}
<div class="text-dim text-xs mb-16">
Showing {{ docs|length }}{% if total_count %} of {{ total_count }}{% endif %} documents
{% if current_source %} in <strong>{{ current_source }}</strong>{% endif %}
(page {{ page }} of {{ total_pages }})
</div>
<table>
<tr><th>Filename</th><th>Source</th><th>Status</th><th>Pages</th><th>Concepts</th><th>Vectors</th></tr>
{% for d in docs %}
<tr>
<td>{{ d.filename or '?' }}</td>
<td>{{ d.source or '' }}</td>
<td><span class="status status-{{ d.status or 'unknown' }}">{{ d.status or 'unknown' }}</span></td>
<td>{{ d.pages_extracted or 0 }}</td>
<td>{{ d.concepts_extracted or 0 }}</td>
<td>{{ d.vectors_inserted or 0 }}</td>
</tr>
{% endfor %}
</table>
{% if total_pages > 1 %}
<div class="pagination">
{% if page > 1 %}
<a href="/catalogue?page={{ page - 1 }}{% if current_source %}&source={{ current_source }}{% endif %}&per_page={{ per_page }}">&laquo;</a>
{% endif %}
{% for p in range(1, total_pages + 1) %}
{% if p == page %}
<span class="current">{{ p }}</span>
{% elif p <= 3 or p > total_pages - 3 or (p >= page - 2 and p <= page + 2) %}
<a href="/catalogue?page={{ p }}{% if current_source %}&source={{ current_source }}{% endif %}&per_page={{ per_page }}">{{ p }}</a>
{% elif p == 4 or p == total_pages - 3 %}
<span class="text-dim">...</span>
{% endif %}
{% endfor %}
{% if page < total_pages %}
<a href="/catalogue?page={{ page + 1 }}{% if current_source %}&source={{ current_source }}{% endif %}&per_page={{ per_page }}">&raquo;</a>
{% endif %}
</div>
{% endif %}
{% endblock %}

View file

@ -0,0 +1,72 @@
{% extends "base.html" %}
{% block content %}
<div id="kb-dashboard">
<div class="stat-grid">
<div class="stat-card"><div class="label">Catalogued</div><div class="value" id="kv-catalogued"></div><div class="sublabel">total known documents</div></div>
<div class="stat-card"><div class="label">In Pipeline</div><div class="value" id="kv-pipeline"></div><div class="sublabel" id="kv-pipeline-sub">processing</div></div>
<div class="stat-card"><div class="label">Complete</div><div class="value" id="kv-complete"></div><div class="sublabel">in Qdrant</div></div>
<div class="stat-card"><div class="label">Failed</div><div class="value" id="kv-failed"></div><div class="sublabel">&nbsp;</div></div>
</div>
<div class="mb-24">
<div class="flex-between mb-16" style="margin-bottom:4px;font-size:11px;color:#888;">
<span id="progress-label">Pipeline Progress</span>
<span id="progress-pct"></span>
</div>
<div id="progress-bar" class="pipeline-bar"></div>
<div id="progress-legend" class="pipeline-legend"></div>
</div>
<div class="stat-grid grid-3">
<div class="stat-card"><div class="label">Concepts</div><div class="value" id="kv-concepts"></div><div class="sublabel">extracted</div></div>
<div class="stat-card"><div class="label">Vectors</div><div class="value" id="kv-vectors"></div><div class="sublabel">in Qdrant</div></div>
<div class="stat-card"><div class="label">Pages</div><div class="value" id="kv-pages"></div><div class="sublabel">processed</div></div>
</div>
<div id="pipeline-activity" class="panel" style="display:none;">
<h3 style="color:#ffa500;font-size:13px;margin-bottom:8px;">Pipeline Activity</h3>
<div id="activity-content" style="font-size:12px;color:#ccc;"></div>
</div>
<div id="qdrant-health" class="panel" style="padding:10px 16px;font-size:12px;color:#888;">
Qdrant: <span id="qdrant-status">checking...</span>
</div>
<div id="kb-chart-container" class="panel" style="display:none;">
<h3 class="section-title" style="margin-bottom:8px;">Pipeline Activity (24h)</h3>
<canvas id="kb-chart" width="800" height="200" style="width:100%;height:200px;"></canvas>
</div>
<h3 class="section-title" id="sources-toggle" style="cursor:pointer;user-select:none;"><span id="sources-arrow">&#9654;</span> Sources</h3>
<table>
<thead id="sources-thead" style="display:none;"><tr><th>Source</th><th>Type</th><th>Catalogued</th><th>Complete</th><th>In Pipeline</th><th>Progress</th><th>Concepts</th><th>Vectors</th></tr></thead>
<tbody id="sources-tbody" style="display:none;"><tr><td colspan="8" class="text-dim">Loading...</td></tr></tbody>
<tfoot id="sources-tfoot"></tfoot>
</table>
<div class="grid-2 mt-24">
<div>
<h3 class="section-title">Domain Distribution</h3>
<div id="domain-bars" class="text-small">Loading...</div>
</div>
<div>
<h3 class="section-title">Knowledge Type</h3>
<div id="knowledge-type-bars" class="text-small">Loading...</div>
<div id="knowledge-type-migration" class="text-small" style="margin-top:6px;color:#666;font-size:11px;"></div>
<h3 class="section-title" style="margin-top:16px;">Complexity</h3>
<div id="complexity-bars" class="text-small">Loading...</div>
<div id="complexity-migration" class="text-small" style="margin-top:6px;color:#666;font-size:11px;"></div>
</div>
</div>
<h3 class="section-title mt-24">Recently Completed</h3>
<table>
<thead><tr><th>Title</th><th>Type</th><th>Concepts</th><th>Vectors</th></tr></thead>
<tbody id="recent-tbody"><tr><td colspan="4" class="text-dim">Loading...</td></tr></tbody>
</table>
</div>
{% endblock %}
{% block scripts %}
<script src="/static/js/charts.js"></script>
<script src="/static/js/dashboard.js"></script>
{% endblock %}

View file

@ -0,0 +1,56 @@
{% extends "base.html" %}
{% block content %}
<h3 style="color:#ff4444;margin-bottom:16px;">Failed Documents</h3>
{% if not failures %}
<p class="text-dim">No failures.</p>
{% else %}
<div style="margin-bottom:16px;">
<button class="btn" id="retry-all-btn" onclick="retryAll()">Retry All ({{ failures|length }})</button>
<span id="retry-all-status" style="margin-left:12px;font-size:12px;"></span>
</div>
<table>
<tr><th>Filename</th><th>Error</th><th>Age</th><th>Retries</th><th>Actions</th></tr>
{% for f in failures %}
<tr>
<td>{{ f.filename or '?' }}</td>
<td style="color:#ff4444;font-size:11px;">{{ (f.error_message or 'unknown')[:100] }}</td>
<td class="text-dim text-xs">{{ f.discovered_at or '' }}</td>
<td>{{ f.retry_count or 0 }}</td>
<td>
<form method="post" action="/api/retry/{{ f.hash }}" style="display:inline;">
<button class="btn" type="submit">Retry</button>
</form>
</td>
</tr>
{% endfor %}
</table>
{% endif %}
{% endblock %}
{% block scripts %}
<script>
async function retryAll() {
var btn = document.getElementById('retry-all-btn');
var status = document.getElementById('retry-all-status');
if (!confirm('Retry all {{ failures|length }} failed documents?')) return;
btn.disabled = true;
status.style.color = '#ffa500';
status.textContent = 'Retrying...';
try {
var resp = await fetch('/api/retry-all', {method: 'POST'});
var data = await resp.json();
if (resp.ok) {
status.style.color = '#00ff41';
status.textContent = 'Retried ' + data.count + ' documents';
setTimeout(function() { location.reload(); }, 2000);
} else {
status.style.color = '#ff4444';
status.textContent = data.error || 'Failed';
}
} catch(e) {
status.style.color = '#ff4444';
status.textContent = 'Error: ' + e.message;
}
btn.disabled = false;
}
</script>
{% endblock %}

View file

@ -0,0 +1,83 @@
{% extends "base.html" %}
{% block content %}
<h3 class="section-title mb-16">Upload PDF</h3>
<div class="panel">
<form id="upload-form" enctype="multipart/form-data">
<div class="mb-16">
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">PDF File</label>
<input type="file" name="file" accept=".pdf" id="upload-file"
style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:8px;width:100%;font-family:inherit;">
</div>
<div class="mb-16">
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
<input type="text" name="category" id="upload-category" list="cat-list" class="search-box"
placeholder="Select or type a category..." style="margin-bottom:0;">
<datalist id="cat-list">{{ options_html|safe }}</datalist>
</div>
<button type="submit" class="btn" id="upload-btn">Upload</button>
<span id="upload-status" style="margin-left:12px;font-size:12px;"></span>
</form>
</div>
<div id="upload-result" style="display:none;" class="panel"></div>
<h3 class="section-title">Recent Documents</h3>
<table>
<tr><th>Filename</th><th>Source</th><th>Status</th></tr>
{% for d in recent %}
<tr>
<td>{{ d.filename or '?' }}</td>
<td>{{ d.source or '' }}</td>
<td><span class="status status-{{ d.status or 'unknown' }}">{{ d.status or 'unknown' }}</span></td>
</tr>
{% endfor %}
</table>
{% endblock %}
{% block scripts %}
<script>
document.getElementById('upload-form').addEventListener('submit', async function(e) {
e.preventDefault();
var btn = document.getElementById('upload-btn');
var status = document.getElementById('upload-status');
var result = document.getElementById('upload-result');
var fileInput = document.getElementById('upload-file');
var category = document.getElementById('upload-category').value;
if (!fileInput.files.length) {
status.style.color = '#ff4444';
status.textContent = 'No file selected';
return;
}
btn.disabled = true;
status.style.color = '#ffa500';
status.textContent = 'Uploading...';
result.style.display = 'none';
var formData = new FormData();
formData.append('file', fileInput.files[0]);
formData.append('category', category);
try {
var resp = await fetch('/api/upload', { method: 'POST', body: formData });
var data = await resp.json();
if (resp.ok) {
status.style.color = '#00ff41';
status.textContent = 'Upload successful';
result.style.display = 'block';
result.innerHTML = '<span style="color:#00ff41;">Queued for processing</span><br>' +
'<span class="text-dim">Hash: ' + data.hash + '</span><br>' +
'<span class="text-dim">File: ' + data.filename + '</span><br>' +
'<span class="text-dim">Category: ' + data.source + '/' + data.category + '</span>';
fileInput.value = '';
} else {
status.style.color = '#ff4444';
status.textContent = data.error || 'Upload failed';
}
} catch (err) {
status.style.color = '#ff4444';
status.textContent = 'Network error: ' + err.message;
}
btn.disabled = false;
});
</script>
{% endblock %}

View file

@ -0,0 +1,76 @@
{% extends "base.html" %}
{% block content %}
<h3 class="section-title mb-16">Web Ingest</h3>
<div style="margin-bottom:8px;">
<a href="#single" class="btn active" onclick="showSection('single')" id="tab-single">Single/Batch URL</a>
<a href="#crawl" class="btn" onclick="showSection('crawl')" id="tab-crawl">Site Crawl</a>
</div>
<div id="section-single">
<div class="panel">
<div class="mb-16">
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">URL(s) — one per line for batch</label>
<textarea id="wi-urls" class="search-box" rows="4" placeholder="https://example.com/article" style="resize:vertical;margin-bottom:0;"></textarea>
</div>
<div class="mb-16">
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
<input type="text" id="wi-category" list="wi-cat-list" class="search-box" value="Web"
placeholder="Category..." style="margin-bottom:0;">
<datalist id="wi-cat-list">{{ options_html|safe }}</datalist>
</div>
<button class="btn" id="wi-btn" onclick="doWebIngest()">Ingest</button>
<span id="wi-status" style="margin-left:12px;font-size:12px;"></span>
</div>
<div id="wi-results" style="display:none;" class="panel" style="max-height:300px;overflow-y:auto;"></div>
</div>
<div id="section-crawl" style="display:none;">
<div class="panel">
<div class="mb-16">
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Site URL</label>
<input type="text" id="crawl-url" class="search-box" placeholder="https://example.com" style="margin-bottom:0;">
</div>
<div class="grid-2 mb-16">
<div>
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
<input type="text" id="crawl-category" list="wi-cat-list" class="search-box" value="Web" style="margin-bottom:0;">
</div>
<div>
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Max Pages</label>
<input type="number" id="crawl-max-pages" class="search-box" value="500" min="1" max="5000" style="margin-bottom:0;">
</div>
</div>
<div class="grid-2 mb-16">
<div>
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Include Paths (comma-separated)</label>
<input type="text" id="crawl-include" class="search-box" placeholder="/docs/, /blog/" style="margin-bottom:0;">
</div>
<div>
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Exclude Paths (comma-separated)</label>
<input type="text" id="crawl-exclude" class="search-box" placeholder="/search, /login" style="margin-bottom:0;">
</div>
</div>
<button class="btn" id="crawl-preview-btn" onclick="doCrawl(true)">Preview</button>
<button class="btn" id="crawl-btn" onclick="doCrawl(false)" style="margin-left:8px;">Crawl &amp; Ingest</button>
<span id="crawl-status" style="margin-left:12px;font-size:12px;"></span>
</div>
<div id="crawl-results" style="display:none;" class="panel" style="max-height:400px;overflow-y:auto;font-size:12px;"></div>
</div>
<h3 class="section-title mt-24">Recent Web Ingestions</h3>
<table>
<tr><th>Title</th><th>Source/Category</th><th>Status</th><th>Pages</th><th>Concepts</th></tr>
{% for d in web_docs %}
<tr>
<td title="{{ d.path or '' }}" style="max-width:400px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;">{{ d.book_title or d.filename or '?' }}</td>
<td>{{ d.source or '' }}/{{ d.category or '' }}</td>
<td><span class="status status-{{ d.status or 'unknown' }}">{{ d.status or 'unknown' }}</span></td>
<td>{{ d.pages_extracted or 0 }}</td>
<td>{{ d.concepts_extracted or 0 }}</td>
</tr>
{% endfor %}
</table>
{% endblock %}
{% block scripts %}
<script src="/static/js/web-ingest.js"></script>
{% endblock %}

View file

@ -0,0 +1,53 @@
{% extends "base.html" %}
{% block content %}
<h3 class="section-title mb-16">PeerTube Channels</h3>
<div class="stat-grid" id="pt-stats" style="margin-bottom:24px;">
<div class="stat-card"><div class="value" id="pt-total-ch"></div><div class="label">Channels</div></div>
<div class="stat-card"><div class="value" id="pt-total-vid"></div><div class="label">Videos</div></div>
<div class="stat-card"><div class="value" id="pt-dl-status"></div><div class="label">Downloader</div></div>
</div>
<div class="panel">
<div class="flex gap-8" style="flex-wrap:wrap;align-items:flex-end;margin-bottom:12px;">
<div style="flex:1;min-width:250px;">
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">YouTube URL</label>
<input type="text" id="pt-yt-url" class="search-box" placeholder="https://www.youtube.com/@ChannelName" style="margin-bottom:0;width:100%;">
</div>
<div style="min-width:150px;">
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
<input type="text" id="pt-category" list="pt-cat-list" class="search-box" placeholder="e.g. OPSEC/Privacy" style="margin-bottom:0;width:100%;">
<datalist id="pt-cat-list"></datalist>
</div>
<div style="min-width:60px;">
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Priority</label>
<select id="pt-priority" style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:6px 10px;font-family:inherit;font-size:12px;width:100%;">
<option value="M">M</option>
<option value="H">H</option>
<option value="L">L</option>
</select>
</div>
<button class="btn" id="pt-add-btn" onclick="addChannel()">Add Channel</button>
</div>
<div id="pt-feedback" style="font-size:12px;min-height:18px;"></div>
</div>
<div style="background:#111;border:1px solid #222;overflow-x:auto;">
<table style="width:100%;border-collapse:collapse;font-size:12px;" id="pt-channel-table">
<thead>
<tr style="border-bottom:1px solid #222;">
<th style="text-align:left;padding:10px;">Channel</th>
<th style="text-align:center;padding:10px;">Videos</th>
<th style="text-align:left;padding:10px;">Category</th>
<th style="text-align:center;padding:10px;">Pri</th>
<th style="text-align:center;padding:10px;">Status</th>
<th style="text-align:center;padding:10px;width:60px;"></th>
</tr>
</thead>
<tbody id="pt-channel-tbody"><tr><td colspan="6" style="text-align:center;padding:20px;color:#555;">Loading...</td></tr></tbody>
</table>
</div>
{% endblock %}
{% block scripts %}
<script src="/static/js/channels.js"></script>
{% endblock %}

View file

@ -0,0 +1,53 @@
{% extends "base.html" %}
{% block content %}
<div id="pt-dashboard">
<div class="stat-grid" style="grid-template-columns:repeat(6, 1fr);">
<div class="stat-card"><div class="label">Published</div><div class="value" id="pt-published"></div></div>
<div class="stat-card"><div class="label">In Pipeline</div><div class="value" id="pt-in-pipeline"></div></div>
<div class="stat-card"><div class="label">Failed</div><div class="value" id="pt-failed"></div></div>
<div class="stat-card"><div class="label">Import Rate</div><div class="value" id="pt-import-rate"></div><div class="sublabel">/hour</div></div>
<div class="stat-card"><div class="label">GPU Util</div><div class="value" id="pt-gpu-util"></div><div class="sublabel">%</div></div>
<div class="stat-card"><div class="label">GPU Temp</div><div class="value" id="pt-gpu-temp"></div><div class="sublabel">&deg;C</div></div>
</div>
<div class="mb-24">
<div class="flex-between" style="margin-bottom:4px;font-size:11px;color:#888;">
<span>Pipeline Flow</span>
<span id="pt-pipeline-summary"></span>
</div>
<div id="pt-pipeline-bar" class="pipeline-bar"></div>
<div id="pt-pipeline-legend" class="pipeline-legend"></div>
</div>
<div class="svc-row">
<div class="svc-item"><span class="svc-dot unknown" id="svc-downloader"></span>Downloader</div>
<div class="svc-item"><span class="svc-dot unknown" id="svc-importer"></span>Importer</div>
<div class="svc-item"><span class="svc-dot unknown" id="svc-transcoder"></span>Transcoder</div>
<div class="svc-item"><span class="svc-dot unknown" id="svc-runner"></span>Runner</div>
</div>
<div id="pt-gpu-panel" class="panel" style="display:none;">
<h3 class="section-title" style="margin-bottom:8px;">GPU Status</h3>
<div id="pt-gpu-detail" class="text-small text-muted"></div>
</div>
<div id="pt-chart-container" class="panel" style="display:none;">
<h3 class="section-title" style="margin-bottom:8px;">Pipeline Activity (24h)</h3>
<canvas id="pt-chart" width="800" height="200" style="width:100%;height:200px;"></canvas>
</div>
<div id="pt-storage" class="panel">
<h3 class="section-title" style="margin-bottom:12px;">Pipeline Storage</h3>
<div id="pt-storage-content" class="text-small text-muted">Loading...</div>
</div>
<details id="pt-errors-panel" class="errors-panel panel">
<summary>Recent Errors (<span id="pt-error-count">0</span>)</summary>
<div id="pt-errors-content" style="margin-top:8px;"></div>
</details>
</div>
{% endblock %}
{% block scripts %}
<script src="/static/js/charts.js"></script>
<script src="/static/js/peertube.js"></script>
{% endblock %}

41
templates/search.html Normal file
View file

@ -0,0 +1,41 @@
{% extends "base.html" %}
{% block content %}
<h3 class="section-title mb-16">Semantic Search</h3>
<form method="get" action="/search">
<input type="text" name="q" class="search-box" placeholder="Search the knowledge base..." value="{{ query or '' }}" autofocus>
</form>
{% if not query %}
<p class="text-dim text-small" style="margin-top:8px;">Enter a query to search across all embedded concepts.</p>
{% elif results is defined %}
<p class="text-dim text-small mb-16">{{ results|length }} results for: <strong class="text-green">{{ query }}</strong></p>
{% for r in results %}
<div class="result">
<span class="score">{{ '%.4f'|format(r.score) }}</span>
<div class="title">{{ r.title }}</div>
<div class="meta">
{{ r.citation }}
{% if r.download_url %}
{% if r.source_type == 'web' or (r.download_url.startswith('http') and 'files.echo6.co' not in r.download_url) %}
| <a href="{{ r.download_url }}" target="_blank" style="color:#00bfff;text-decoration:none;">Web</a>
{% else %}
| <a href="{{ r.download_url }}" style="color:#00bfff;text-decoration:none;">PDF</a>
{% endif %}
{% endif %}
{% if r.knowledge_type %}| {{ r.knowledge_type }}{% endif %}
{% if r.complexity %}/ {{ r.complexity }}{% endif %}
</div>
<div class="content-text">{{ r.summary }}</div>
<div style="margin-top:6px;">
{% for d in r.domains %}
<span class="domain-tag">{{ d }}</span>
{% endfor %}
</div>
</div>
{% endfor %}
{% elif error %}
<p style="color:#ff4444;">Search error: {{ error }}</p>
{% endif %}
{% endblock %}

View file

@ -0,0 +1,94 @@
{% extends "base.html" %}
{% block content %}
<h3 class="section-title mb-16">YouTube Cookies</h3>
<div class="panel">
<div id="cookie-status" style="margin-bottom:16px;font-size:12px;color:#666;">Loading cookie status...</div>
<div class="mb-16">
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Cookies.txt File (Netscape format)</label>
<input type="file" id="cookie-file" accept=".txt"
style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:8px;width:100%;font-family:inherit;">
</div>
<button class="btn" id="cookie-btn" onclick="uploadCookies()">Upload Cookies</button>
<span id="cookie-upload-status" style="margin-left:12px;font-size:12px;"></span>
<div id="cookie-result" style="display:none;background:#0a0a0a;border:1px solid #222;padding:12px;margin-top:16px;font-size:11px;white-space:pre-wrap;color:#888;max-height:200px;overflow-y:auto;"></div>
</div>
{% endblock %}
{% block scripts %}
<script>
async function loadCookieStatus() {
try {
var resp = await fetch('/api/cookies/status');
var data = await resp.json();
if (resp.ok) {
var age = data.age_hours;
var ageStr, ageColor;
if (age < 24) {
ageStr = Math.round(age) + ' hours ago';
ageColor = '#00ff41';
} else {
var days = Math.round(age / 24);
ageStr = days + ' days ago';
ageColor = days > 14 ? '#ff4444' : days > 7 ? '#ffa500' : '#00ff41';
}
var html = '<span style="color:' + ageColor + ';">Last updated: ' + ageStr + '</span>';
if (data.is_stale) {
html += ' <span style="color:#ff4444;font-weight:bold;">[STALE - cookies likely expired]</span>';
}
if (data.recent_rate_limits > 0) {
html += '<br><span style="color:#ffa500;">YouTube rate limits in last 30min: ' + data.recent_rate_limits + '</span>';
}
html += '<br><span class="text-faint">Downloader: ' + (data.downloader_active ? 'active' : 'stopped') + '</span>';
document.getElementById('cookie-status').innerHTML = html;
} else {
document.getElementById('cookie-status').innerHTML = '<span class="text-red">Could not check cookie status</span>';
}
} catch(e) {
document.getElementById('cookie-status').innerHTML = '<span class="text-red">Error: ' + e.message + '</span>';
}
}
async function uploadCookies() {
var fileInput = document.getElementById('cookie-file');
var btn = document.getElementById('cookie-btn');
var status = document.getElementById('cookie-upload-status');
var result = document.getElementById('cookie-result');
if (!fileInput.files.length) {
status.style.color = '#ff4444';
status.textContent = 'No file selected';
return;
}
btn.disabled = true;
status.style.color = '#ffa500';
status.textContent = 'Uploading and testing cookies...';
result.style.display = 'none';
var formData = new FormData();
formData.append('file', fileInput.files[0]);
try {
var resp = await fetch('/api/cookies/upload', { method: 'POST', body: formData });
var data = await resp.json();
if (data.ok) {
status.style.color = '#00ff41';
status.textContent = 'Cookies updated and verified';
result.style.display = 'block';
result.style.borderColor = '#00ff41';
result.innerHTML = '<span style="color:#00ff41;">SUCCESS</span><br>' + (data.test_output || '') + '<br>Data lines: ' + data.data_lines;
loadCookieStatus();
} else {
status.style.color = data.error ? '#ff4444' : '#ffa500';
status.textContent = data.error || data.message || 'Upload issue';
if (data.test_output) {
result.style.display = 'block';
result.style.borderColor = '#ff4444';
result.textContent = data.test_output;
}
}
} catch(e) {
status.style.color = '#ff4444';
status.textContent = 'Network error: ' + e.message;
}
btn.disabled = false;
}
loadCookieStatus();
</script>
{% endblock %}

View file

@ -0,0 +1,68 @@
{% extends "base.html" %}
{% block content %}
<h3 class="section-title mb-16">Service Health</h3>
<div id="health-grid" class="stat-grid" style="grid-template-columns:repeat(auto-fit, minmax(250px, 1fr));">
<div class="stat-card">
<div class="label">Qdrant</div>
<div class="value text-small" id="h-qdrant"><span class="svc-dot unknown"></span>Checking...</div>
</div>
<div class="stat-card">
<div class="label">TEI Embeddings</div>
<div class="value text-small" id="h-tei"><span class="svc-dot unknown"></span>Checking...</div>
</div>
<div class="stat-card">
<div class="label">NFS Mount</div>
<div class="value text-small" id="h-nfs"><span class="svc-dot unknown"></span>Checking...</div>
</div>
<div class="stat-card">
<div class="label">Gemini API</div>
<div class="value text-small" id="h-gemini"><span class="svc-dot unknown"></span>Checking...</div>
</div>
</div>
<h3 class="section-title mt-24">Pipeline Status</h3>
<div id="h-pipeline" class="panel text-small text-dim">Loading...</div>
{% endblock %}
{% block scripts %}
<script>
async function loadHealth() {
try {
var resp = await fetch('/api/health');
var data = await resp.json();
var c = data.components || {};
function dot(status) {
var cls = status === 'up' ? 'active' : (status === 'configured' ? 'active' : 'inactive');
return '<span class="svc-dot ' + cls + '"></span>';
}
var q = c.qdrant || {};
document.getElementById('h-qdrant').innerHTML = dot(q.status) + (q.status === 'up' ? 'Online — ' + RECON.fmt(q.vectors) + ' vectors' : 'Offline' + (q.error ? ' — ' + q.error : ''));
var t = c.tei || {};
document.getElementById('h-tei').innerHTML = dot(t.status) + (t.status === 'up' ? 'Online' : 'Offline' + (t.error ? ' — ' + t.error : ''));
var n = c.nfs || {};
document.getElementById('h-nfs').innerHTML = dot(n.status) + (n.status === 'up' ? 'Mounted' : 'Not mounted');
var g = c.gemini || {};
document.getElementById('h-gemini').innerHTML = dot(g.status === 'configured' ? 'up' : 'down') + (g.status === 'configured' ? g.keys + ' keys configured' : 'No keys');
// Pipeline
var p = data.pipeline || {};
var html = '';
Object.keys(p).forEach(function(k) {
html += '<div style="margin:4px 0;"><span class="status status-' + k + '">' + k + '</span>: ' + p[k] + '</div>';
});
document.getElementById('h-pipeline').innerHTML = html || '<span class="text-dim">No pipeline data</span>';
} catch(e) {
document.getElementById('h-qdrant').innerHTML = '<span class="svc-dot inactive"></span>Error: ' + e.message;
}
}
document.addEventListener('DOMContentLoaded', function() {
RECON.startRefresh(loadHealth, 30000);
});
</script>
{% endblock %}

View file

@ -0,0 +1,137 @@
{% extends "base.html" %}
{% block content %}
<h3 class="section-title mb-16">API Keys</h3>
<div style="margin-bottom:20px;">
<button class="btn" onclick="validateAll()" id="btn-validate">Validate All</button>
<button class="btn" onclick="reloadKeys()" style="margin-left:8px;">Reload from .env</button>
<button class="btn btn-warn" onclick="restartService()" style="margin-left:8px;">Restart Service</button>
<span id="validate-status" style="margin-left:12px;color:#666;font-size:12px;"></span>
</div>
<table id="keys-table">
<tr><th>#</th><th>Key</th><th>Status</th><th>Calls</th><th>Errors</th><th>Last Used</th><th>Actions</th></tr>
{% for k in keys_data %}
<tr id="key-row-{{ k.index }}">
<td>{{ k.index + 1 }}</td>
<td class="mono text-small">{{ k.masked }}</td>
<td>
{% if k.valid is true %}
<span class="text-green">Valid</span>
{% elif k.valid is false %}
<span class="text-red">Invalid</span>
{% else %}
<span class="text-dim">&mdash;</span>
{% endif %}
</td>
<td>{{ k.calls }}</td>
<td class="{% if k.errors %}text-red{% else %}text-muted{% endif %}">{{ k.errors }}</td>
<td class="text-dim text-xs">{{ k.last_used or '&mdash;' }}</td>
<td>
<button class="btn text-xs" onclick="validateKey({{ k.index }})">Test</button>
<button class="btn btn-danger text-xs" onclick="removeKey({{ k.index }})">Remove</button>
</td>
</tr>
{% endfor %}
</table>
<div style="margin-top:24px;border-top:1px solid #222;padding-top:16px;">
<h4 class="text-muted" style="margin-bottom:12px;">Add Key</h4>
<div class="flex gap-8" style="align-items:center;">
<input type="text" id="new-key" placeholder="Paste Gemini API key..."
style="flex:1;background:#1a1a1a;border:1px solid #333;color:#ccc;padding:8px 12px;border-radius:4px;font-family:monospace;font-size:13px;">
<button class="btn" onclick="addKey()">Add</button>
</div>
<div id="add-result" style="margin-top:8px;font-size:12px;"></div>
</div>
<div style="margin-top:24px;border-top:1px solid #222;padding-top:16px;">
<h4 class="text-muted" style="margin-bottom:12px;">Replace Key</h4>
<div class="flex gap-8" style="align-items:center;">
<input type="number" id="replace-index" placeholder="#" min="0" max="9"
style="width:50px;background:#1a1a1a;border:1px solid #333;color:#ccc;padding:8px;border-radius:4px;text-align:center;">
<input type="text" id="replace-key" placeholder="New Gemini API key..."
style="flex:1;background:#1a1a1a;border:1px solid #333;color:#ccc;padding:8px 12px;border-radius:4px;font-family:monospace;font-size:13px;">
<button class="btn" onclick="replaceKey()">Replace</button>
</div>
<div id="replace-result" style="margin-top:8px;font-size:12px;"></div>
</div>
{% endblock %}
{% block scripts %}
<script>
async function validateAll() {
document.getElementById('btn-validate').disabled = true;
document.getElementById('validate-status').textContent = 'Validating...';
try {
var r = await fetch('/api/keys/validate', {method:'POST'});
var data = await r.json();
document.getElementById('validate-status').textContent = 'Done — ' + data.results.filter(function(r){return r.valid;}).length + '/' + data.results.length + ' valid';
setTimeout(function() { location.reload(); }, 1000);
} catch(e) {
document.getElementById('validate-status').textContent = 'Error: ' + e;
}
document.getElementById('btn-validate').disabled = false;
}
async function validateKey(idx) {
try {
var r = await fetch('/api/keys/' + idx + '/validate', {method:'POST'});
var data = await r.json();
alert('Key ' + (idx+1) + ': ' + data.message);
location.reload();
} catch(e) { alert('Error: ' + e); }
}
async function removeKey(idx) {
if (!confirm('Remove key ' + (idx+1) + '? Pipeline needs at least 1 key.')) return;
try {
var r = await fetch('/api/keys/' + idx, {method:'DELETE'});
var data = await r.json();
if (data.error) { alert(data.error); return; }
location.reload();
} catch(e) { alert('Error: ' + e); }
}
async function addKey() {
var key = document.getElementById('new-key').value.trim();
if (!key) return;
try {
var r = await fetch('/api/keys', {method:'POST', headers:{'Content-Type':'application/json'}, body:JSON.stringify({key:key})});
var data = await r.json();
if (data.error) { document.getElementById('add-result').innerHTML = '<span class="text-red">' + data.error + '</span>'; return; }
document.getElementById('add-result').innerHTML = '<span class="text-green">Added at position ' + (data.index+1) + '</span>';
setTimeout(function() { location.reload(); }, 1000);
} catch(e) { document.getElementById('add-result').innerHTML = '<span class="text-red">' + e + '</span>'; }
}
async function replaceKey() {
var idx = parseInt(document.getElementById('replace-index').value) - 1;
var key = document.getElementById('replace-key').value.trim();
if (isNaN(idx) || !key) return;
try {
var r = await fetch('/api/keys/' + idx, {method:'PUT', headers:{'Content-Type':'application/json'}, body:JSON.stringify({key:key})});
var data = await r.json();
if (data.error) { document.getElementById('replace-result').innerHTML = '<span class="text-red">' + data.error + '</span>'; return; }
document.getElementById('replace-result').innerHTML = '<span class="text-green">Replaced key ' + (idx+1) + '</span>';
setTimeout(function() { location.reload(); }, 1000);
} catch(e) { document.getElementById('replace-result').innerHTML = '<span class="text-red">' + e + '</span>'; }
}
async function restartService() {
if (!confirm('Restart RECON service? Pipeline will pause for ~10 seconds.')) return;
document.getElementById('validate-status').textContent = 'Restarting...';
try {
await fetch('/api/service/restart', {method:'POST'});
} catch(e) {}
document.getElementById('validate-status').innerHTML = '<span style="color:#ff8800;">Restarting... page will reload in 10s</span>';
setTimeout(function() { location.reload(); }, 30000);
}
async function reloadKeys() {
try {
var r = await fetch('/api/keys/reload', {method:'POST'});
var data = await r.json();
alert('Reloaded ' + data.count + ' key(s) from .env');
location.reload();
} catch(e) { alert('Error: ' + e); }
}
</script>
{% endblock %}

View file

@ -0,0 +1,97 @@
{% extends "base.html" %}
{% block content %}
<h3 class="section-title mb-16">NordVPN</h3>
<div class="panel">
<div id="vpn-status" style="margin-bottom:16px;font-size:12px;color:#666;">Loading VPN status...</div>
<div class="flex gap-8" style="flex-wrap:wrap;margin-bottom:12px;">
<button class="btn" onclick="vpnRotate()" id="vpn-rotate-btn">Rotate</button>
<button class="btn" onclick="vpnDisconnect()" id="vpn-disconnect-btn">Disconnect</button>
<select id="vpn-country" style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:6px;font-family:inherit;font-size:12px;">
<option value="United_States">United States</option>
<option value="Canada">Canada</option>
<option value="United_Kingdom">United Kingdom</option>
<option value="Germany">Germany</option>
<option value="Netherlands">Netherlands</option>
<option value="Sweden">Sweden</option>
</select>
<button class="btn" onclick="vpnConnect()" id="vpn-connect-btn">Connect</button>
</div>
<span id="vpn-action-status" style="font-size:12px;"></span>
<details style="margin-top:16px;">
<summary class="text-faint" style="cursor:pointer;font-size:11px;">Setup (one-time)</summary>
<div style="margin-top:8px;">
<input type="password" id="vpn-token" placeholder="NordVPN token"
style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:6px;width:300px;font-family:inherit;font-size:12px;">
<button class="btn" onclick="vpnLogin()">Login</button>
<span id="vpn-login-status" style="font-size:11px;margin-left:8px;"></span>
</div>
</details>
</div>
{% endblock %}
{% block scripts %}
<script>
async function loadVpnStatus() {
try {
var resp = await fetch('/api/vpn/status');
var data = await resp.json();
if (resp.ok) {
var dot = data.connected ? '<span style="color:#00ff41;">&#9679;</span>' : '<span style="color:#ff4444;">&#9679;</span>';
var html = dot + ' ' + (data.connected ? 'Connected' : 'Disconnected');
if (data.connected) {
html += ' &mdash; <span style="color:#00ff41;">' + data.country + '</span>';
html += ' <span class="text-faint">(' + data.ip + ')</span>';
}
if (data.rotations_today > 0) {
html += '<br><span class="text-faint">Rotations today: ' + data.rotations_today + '</span>';
}
document.getElementById('vpn-status').innerHTML = html;
}
} catch(e) {
document.getElementById('vpn-status').innerHTML = '<span class="text-red">Error: ' + e.message + '</span>';
}
}
async function vpnAction(url, opts, statusEl) {
var el = document.getElementById(statusEl || 'vpn-action-status');
el.style.color = '#ffa500';
el.textContent = 'Working...';
try {
var resp = await fetch(url, opts);
var data = await resp.json();
if (data.ok) {
el.style.color = '#00ff41';
el.textContent = data.country ? (data.country + ' (' + data.ip + ')') : (data.message || 'Done');
} else {
el.style.color = '#ff4444';
el.textContent = data.error || data.message || 'Failed';
}
loadVpnStatus();
} catch(e) {
el.style.color = '#ff4444';
el.textContent = 'Error: ' + e.message;
}
}
function vpnRotate() { vpnAction('/api/vpn/rotate', {method:'POST'}); }
function vpnDisconnect() { vpnAction('/api/vpn/disconnect', {method:'POST'}); }
function vpnConnect() {
var country = document.getElementById('vpn-country').value;
vpnAction('/api/vpn/connect', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({country: country})
});
}
function vpnLogin() {
var token = document.getElementById('vpn-token').value;
if (!token) return;
vpnAction('/api/vpn/login', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({token: token})
}, 'vpn-login-status');
}
loadVpnStatus();
</script>
{% endblock %}