mirror of
https://github.com/zvx-echo6/recon.git
synced 2026-06-10 00:44:37 +02:00
Initial commit: RECON codebase baseline
Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete). Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
commit
563c16bb71
59 changed files with 18327 additions and 0 deletions
26
.gitignore
vendored
Normal file
26
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
# Python
|
||||
venv/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
*.pyo
|
||||
|
||||
# Secrets
|
||||
.env
|
||||
|
||||
# Runtime data
|
||||
data/
|
||||
logs/
|
||||
pipeline.log
|
||||
recon.db
|
||||
|
||||
# Backups
|
||||
*.bak
|
||||
*.bak-*
|
||||
*.bak.*
|
||||
*.bak2.*
|
||||
|
||||
# Junk
|
||||
-.png
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
785
PROJECT-BIBLE.md
Normal file
785
PROJECT-BIBLE.md
Normal file
|
|
@ -0,0 +1,785 @@
|
|||
# RECON Project Bible v2.0
|
||||
|
||||
*Last updated: 2026-02-16*
|
||||
|
||||
---
|
||||
|
||||
## 1. Mission Statement
|
||||
|
||||
RECON (Reconnaissance, Extraction, Conceptualization, and Operationalization of kNowledge) is a knowledge extraction pipeline that processes PDFs and web content into structured concepts stored in a Qdrant vector database. These concepts power Aurora, the RAG-enabled AI assistant running on OpenWebUI.
|
||||
|
||||
**The core loop:** Content in (PDF/web) -> Text extracted -> Concepts enriched (Gemini) -> Vectors embedded (TEI/BGE-M3) -> Searchable knowledge (Qdrant) -> Aurora answers questions with citations.
|
||||
|
||||
---
|
||||
|
||||
## 2. Infrastructure
|
||||
|
||||
### Hosts
|
||||
|
||||
| Host | IP (Tailscale) | Role |
|
||||
|------|---------------|------|
|
||||
| recon LXC | 100.64.0.24 (CT 130 on toc) | RECON application, dashboard, pipeline |
|
||||
| cortex VM | 100.64.0.14 (VM 150 on toc) | Qdrant, TEI, Ollama, OpenWebUI |
|
||||
| pi-nas | 100.64.0.21 (192.168.1.245) | NFS file server for PDF library |
|
||||
| Contabo VPS | 100.64.0.1 (5.189.158.149) | Backup destination |
|
||||
|
||||
### Services on cortex (100.64.0.14)
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| Qdrant | 6333 | Vector database (recon_knowledge collection) |
|
||||
| TEI (text-embeddings-inference) | 8090 | Embedding server (bge-m3, 1024-dim, ~1,711 emb/sec) |
|
||||
| Ollama | 11434 | LLM server + fallback embeddings (~8 emb/sec) |
|
||||
| OpenWebUI | 8080 | Aurora chat interface (ai.echo6.co) |
|
||||
|
||||
### Services on recon LXC (100.64.0.24)
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| RECON Dashboard | 8420 | Web UI + API for pipeline management |
|
||||
| File Server | 8888 | PDF downloads (files.echo6.co) |
|
||||
|
||||
### NFS Mount
|
||||
|
||||
```
|
||||
pi-nas:/export/library -> /mnt/library (22TB, rw, NFSv3)
|
||||
```
|
||||
|
||||
Contains ~13,000+ PDFs across:
|
||||
- `Survival-Companion-Library/` (~12,900 PDFs in ~220 subdirectories)
|
||||
- `Army_Pubs/` (~160 military field manuals)
|
||||
- Other: `Gaming/`, `Reference/`, `Technical/`
|
||||
|
||||
---
|
||||
|
||||
## 3. Architecture Overview
|
||||
|
||||
```
|
||||
/mnt/library/ (NFS)
|
||||
|
|
||||
[recon scan]
|
||||
|
|
||||
catalogue (SQLite)
|
||||
|
|
||||
[recon queue]
|
||||
|
|
||||
+-----------+ [recon extract] +-----------+
|
||||
| PyPDF2 |--> data/text/ | Gemini |
|
||||
| pdftotext | {hash}/page_N.txt | Flash |
|
||||
| tesseract | | | 4 keys |
|
||||
+-----------+ [recon enrich] +-----------+
|
||||
|
|
||||
data/concepts/
|
||||
{hash}/window_N.json
|
||||
|
|
||||
[recon embed]
|
||||
|
|
||||
+----------+-----------+
|
||||
| TEI (primary) |
|
||||
| bge-m3, 1024-dim |
|
||||
| 1,711 emb/sec |
|
||||
+----------+-----------+
|
||||
|
|
||||
Qdrant (cortex:6333)
|
||||
recon_knowledge collection
|
||||
|
|
||||
Aurora (OpenWebUI)
|
||||
RAG search + citations
|
||||
```
|
||||
|
||||
### Web Content Path
|
||||
|
||||
```
|
||||
URL(s) ──> [recon ingest-url / crawl]
|
||||
|
|
||||
trafilatura extraction
|
||||
chunk into ~2000-word pages
|
||||
|
|
||||
data/text/{hash}/page_N.txt
|
||||
(enters at "extracted" status)
|
||||
|
|
||||
[enrich] -> [embed]
|
||||
(same as PDF path)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Pipeline Stages
|
||||
|
||||
### Status Flow
|
||||
|
||||
```
|
||||
catalogued -> queued -> extracting -> extracted -> enriching -> enriched -> embedding -> complete
|
||||
\-> failed
|
||||
```
|
||||
|
||||
Web content enters at `extracted` status (text already extracted by trafilatura).
|
||||
|
||||
### Stage Details
|
||||
|
||||
| Stage | Tool | Input | Output | Speed |
|
||||
|-------|------|-------|--------|-------|
|
||||
| Scan | `recon scan` | /mnt/library/*.pdf | catalogue table | ~13K PDFs in ~30 min |
|
||||
| Queue | `recon queue` | catalogue entries | documents table (status=queued) | Instant |
|
||||
| Extract | `recon extract` | PDF files | data/text/{hash}/page_NNNN.txt | 4 workers, ~200/hr |
|
||||
| Enrich | `recon enrich` | Text pages (10-page windows) | data/concepts/{hash}/window_N.json | 16 workers, 4 Gemini keys |
|
||||
| Embed | `recon embed` | Concept JSONs | Qdrant vectors | TEI: 1,711 emb/sec |
|
||||
|
||||
### Extraction Fallback Chain
|
||||
|
||||
1. **PyPDF2** (fast, clean text) -> 2. **pdftotext** (handles complex layouts) -> 3. **Tesseract OCR** (scanned documents)
|
||||
|
||||
### Enrichment Details
|
||||
|
||||
- Model: `gemini-2.0-flash`
|
||||
- Window size: 10 pages per API call (configurable)
|
||||
- Workers: 16 concurrent (4 API keys x 4 workers each)
|
||||
- Output format: JSON array of concept objects
|
||||
- **CRITICAL**: Concept JSONs are saved to disk BEFORE any database operations
|
||||
- Key rotation via `KeyRotator` class distributing across 4 Gemini API keys
|
||||
|
||||
### Embedding Details
|
||||
|
||||
- **Primary**: TEI at cortex:8090 (bge-m3 model, 1024 dimensions, ~1,711 embeddings/sec)
|
||||
- **Fallback**: Ollama at cortex:11434 (bge-m3 model, ~8 embeddings/sec)
|
||||
- Batch size: 128 embeddings per TEI request
|
||||
- Distance metric: Cosine similarity
|
||||
- **CRITICAL**: Dimensions are 1024 (bge-m3), NOT 384. Getting this wrong creates silent failures.
|
||||
|
||||
---
|
||||
|
||||
## 5. Directory Structure
|
||||
|
||||
```
|
||||
/opt/recon/ # Application root
|
||||
recon.py # CLI entry point
|
||||
config.yaml # Central configuration
|
||||
.env # Gemini API keys (4 keys)
|
||||
requirements.txt # Python dependencies
|
||||
PROJECT-BIBLE.md # This file
|
||||
README.md # Quick-start reference
|
||||
run-full-pipeline.sh # Background pipeline runner
|
||||
|
||||
lib/ # Core modules
|
||||
__init__.py
|
||||
api.py # Flask web dashboard + API (port 8420)
|
||||
crawler.py # Site crawler (sitemap + BFS link-following)
|
||||
embedder.py # Concept -> vector embedding (TEI/Ollama -> Qdrant)
|
||||
enricher.py # Text -> concept extraction (Gemini)
|
||||
extractor.py # PDF -> text extraction (PyPDF2/pdftotext/OCR)
|
||||
ingester.py # ARGUS intel feed intake
|
||||
status.py # SQLite DB operations (catalogue + documents)
|
||||
utils.py # Config, hashing, URL generation, logging
|
||||
web_scraper.py # URL -> text extraction (trafilatura)
|
||||
|
||||
scripts/ # Operational scripts
|
||||
backup.sh # Automated backup to Contabo (cron every 6h)
|
||||
rebuild_qdrant.py # Nuclear recovery: re-embed all concepts
|
||||
validate.py # Pipeline consistency validation
|
||||
|
||||
data/ # Pipeline data (on local disk)
|
||||
recon.db # SQLite status database
|
||||
text/ # Extracted text
|
||||
{content_hash}/
|
||||
meta.json # Document metadata
|
||||
page_0001.txt # Page text (4-digit, 1-indexed)
|
||||
page_0002.txt
|
||||
...
|
||||
concepts/ # Enriched concepts (**BACK THESE UP**)
|
||||
{content_hash}/
|
||||
window_1.json # Concept JSON array (10-page window)
|
||||
window_2.json
|
||||
...
|
||||
intel/ # ARGUS intel feeds
|
||||
|
||||
logs/ # Application logs
|
||||
recon.log # Main rotating log
|
||||
backup.log # Backup operation log
|
||||
backup_cron.log # Cron backup log
|
||||
|
||||
venv/ # Python virtual environment
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Database Schema
|
||||
|
||||
### SQLite (data/recon.db)
|
||||
|
||||
Two tables in WAL mode with thread-local connections.
|
||||
|
||||
#### catalogue
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| hash | TEXT PK | MD5 content hash |
|
||||
| filename | TEXT | Original filename |
|
||||
| path | TEXT | Full filesystem path |
|
||||
| size_bytes | INTEGER | File size |
|
||||
| source | TEXT | Top-level directory (e.g., "Survival-Companion-Library") |
|
||||
| category | TEXT | Second-level directory (e.g., "Bushcraft") |
|
||||
| status | TEXT | "catalogued" or "processed" |
|
||||
| discovered_at | TEXT | ISO timestamp |
|
||||
|
||||
#### documents
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| hash | TEXT PK | MD5 content hash |
|
||||
| filename | TEXT | Original filename |
|
||||
| path | TEXT | Full path or URL |
|
||||
| size_bytes | INTEGER | File/content size |
|
||||
| page_count | INTEGER | Number of text pages |
|
||||
| book_title | TEXT | Gemini-extracted title |
|
||||
| book_author | TEXT | Gemini-extracted author |
|
||||
| status | TEXT | Pipeline status |
|
||||
| pages_extracted | INTEGER | Pages extracted |
|
||||
| concepts_extracted | INTEGER | Concepts generated |
|
||||
| vectors_inserted | INTEGER | Vectors in Qdrant |
|
||||
| error_message | TEXT | Last error (if failed) |
|
||||
| retry_count | INTEGER | Failure retry count |
|
||||
| created_at | TEXT | ISO timestamp |
|
||||
| updated_at | TEXT | ISO timestamp |
|
||||
|
||||
### Qdrant (cortex:6333)
|
||||
|
||||
Collection: `recon_knowledge`
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| vector | float[1024] | BGE-M3 embedding |
|
||||
| doc_hash | keyword | Links to SQLite document |
|
||||
| filename | keyword | Source filename |
|
||||
| book_title | keyword | Document title |
|
||||
| book_author | keyword | Author name |
|
||||
| source_type | keyword | "document", "web", or "intel_feed" |
|
||||
| download_url | keyword | files.echo6.co URL or source URL |
|
||||
| content | text | Concept text (searchable) |
|
||||
| summary | text | Concept summary |
|
||||
| title | keyword | Concept title |
|
||||
| domain | keyword | Knowledge domain |
|
||||
| subdomain | keyword | Knowledge subdomain |
|
||||
| keywords | keyword[] | Concept keywords |
|
||||
| skill_level | keyword | beginner/intermediate/advanced/expert |
|
||||
| key_facts | text[] | Key facts list |
|
||||
| scenario_applicable | text[] | Applicable scenarios |
|
||||
| cross_domain_tags | keyword[] | Cross-references |
|
||||
| chapter | keyword | Source chapter |
|
||||
| page_ref | keyword | Source page reference |
|
||||
| notes | text | Additional notes |
|
||||
| _window | integer | Source window number |
|
||||
| _start_page | integer | Starting page in document |
|
||||
| verification_status | keyword | "unverified" (default) |
|
||||
| credibility_score | float | 0.7 (default) |
|
||||
| language | keyword | "en" (default) |
|
||||
|
||||
---
|
||||
|
||||
## 7. CLI Reference
|
||||
|
||||
```
|
||||
recon <command> [options]
|
||||
```
|
||||
|
||||
| Command | Description | Key Options |
|
||||
|---------|-------------|-------------|
|
||||
| `scan` | Scan library, catalogue new PDFs | `--path` |
|
||||
| `queue` | Queue catalogued docs for processing | `--hash`, `--source`, `--category`, `--limit` |
|
||||
| `extract` | Extract text from queued PDFs | `--workers` |
|
||||
| `enrich` | Enrich extracted text via Gemini | `--workers`, `--limit` |
|
||||
| `embed` | Embed concepts into Qdrant | `--workers`, `--limit` |
|
||||
| `run` | Full pipeline (extract->enrich->embed) | `--workers`, `--enrich-workers`, `--limit` |
|
||||
| `status` | Show pipeline status counts | |
|
||||
| `catalogue` | Browse catalogue | `--sources`, `--categories`, `--source`, `--limit` |
|
||||
| `failures` | Show failed documents | `--retry` |
|
||||
| `search` | Semantic search | `query`, `--limit` |
|
||||
| `upload` | Upload PDFs | `--file`, `--dir`, `--category` |
|
||||
| `ingest-url` | Ingest web content | `url`, `--file`, `--category`, `--process` |
|
||||
| `crawl` | Crawl a site | `url`, `--category`, `--include`, `--exclude`, `--max-pages`, `--dry-run`, `--process` |
|
||||
| `validate` | Check pipeline consistency | `--deep` |
|
||||
| `rebuild` | Rebuild Qdrant from concept JSONs | |
|
||||
| `serve` | Start web dashboard (port 8420) | |
|
||||
| `ingest` | Ingest ARGUS intel JSON | `--file`, `--directory` |
|
||||
|
||||
### Common Workflows
|
||||
|
||||
```bash
|
||||
# Full library processing
|
||||
recon scan && recon queue && recon run
|
||||
|
||||
# Ingest a single web page with full processing
|
||||
recon ingest-url "https://example.com/article" --category "Reference" --process
|
||||
|
||||
# Dry-run crawl to preview URLs
|
||||
recon crawl "https://docs.example.com" --include /docs/ --dry-run
|
||||
|
||||
# Full crawl with processing
|
||||
recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process
|
||||
|
||||
# Upload a PDF
|
||||
recon upload --file /path/to/document.pdf --category "Technical"
|
||||
|
||||
# Check what failed and retry
|
||||
recon failures
|
||||
recon failures --retry
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Web Dashboard
|
||||
|
||||
### URL
|
||||
|
||||
```
|
||||
http://100.64.0.24:8420
|
||||
```
|
||||
|
||||
### Pages
|
||||
|
||||
| Route | Page | Description |
|
||||
|-------|------|-------------|
|
||||
| `/` | Dashboard | Knowledge base overview: document/concept/vector counts, source table, domain distribution bars, skill level breakdown, Qdrant health, recent completions, pipeline status |
|
||||
| `/search` | Search | Semantic search with score bars, Web/PDF badges, download links |
|
||||
| `/catalogue` | Catalogue | Browse all catalogued PDFs with source/category filters |
|
||||
| `/upload` | Upload | PDF upload form with category datalist, recent uploads table |
|
||||
| `/web-ingest` | Web Ingest | Two tabs: Single/Batch URL ingest, Site Crawl with preview |
|
||||
| `/failures` | Failures | Failed documents with error messages and retry button |
|
||||
|
||||
### API Endpoints
|
||||
|
||||
| Method | Endpoint | Description |
|
||||
|--------|----------|-------------|
|
||||
| GET | `/api/search?q=...&limit=N` | Semantic search |
|
||||
| GET | `/api/catalogue?source=...&limit=N` | Browse catalogue |
|
||||
| GET | `/api/knowledge-stats` | Dashboard aggregation (totals, sources, domains, skills, Qdrant health) |
|
||||
| POST | `/api/upload` | Upload PDF (multipart: file + category) |
|
||||
| GET | `/api/upload/<hash>/status` | Check upload processing status |
|
||||
| GET | `/api/upload/categories` | List available categories |
|
||||
| POST | `/api/ingest-url` | Ingest single URL (json: url, category, process) |
|
||||
| POST | `/api/ingest-urls` | Ingest multiple URLs (json: urls, category, process) |
|
||||
| POST | `/api/crawl` | Crawl a site (json: url, category, include, exclude, max_pages, dry_run) |
|
||||
| GET | `/api/crawl/<id>/status` | Poll crawl/pipeline progress |
|
||||
| POST | `/api/failures/retry` | Re-queue all failed documents |
|
||||
|
||||
### Dashboard Features
|
||||
|
||||
- **Auto-refresh**: Every 30 seconds via JavaScript fetch
|
||||
- **Knowledge cards**: Total documents, concepts, vectors, pages
|
||||
- **Source table**: Per-source breakdown with document/concept/vector counts and PDF/WEB type badges
|
||||
- **Domain distribution**: Horizontal bars showing top knowledge domains
|
||||
- **Skill level breakdown**: beginner/intermediate/advanced/expert percentages
|
||||
- **Qdrant health**: Connection status, points count, segments
|
||||
- **Pipeline status**: Compact display of documents in each stage
|
||||
- **Crawl polling**: Real-time stage tracking (ingesting -> enriching -> embedding)
|
||||
|
||||
---
|
||||
|
||||
## 9. Concept JSON Schema
|
||||
|
||||
Each window file (`data/concepts/{hash}/window_N.json`) contains a JSON array of concept objects:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"title": "Water Purification Methods",
|
||||
"content": "Detailed text about the concept...",
|
||||
"summary": "Brief summary of the concept",
|
||||
"domain": "Survival",
|
||||
"subdomain": "Water",
|
||||
"keywords": ["purification", "filtration", "boiling"],
|
||||
"skill_level": "beginner",
|
||||
"key_facts": ["Boiling kills 99.9% of pathogens", "..."],
|
||||
"scenario_applicable": ["wilderness survival", "disaster preparedness"],
|
||||
"cross_domain_tags": ["health", "camping"],
|
||||
"chapter": "Chapter 3",
|
||||
"page_ref": "pp. 45-48",
|
||||
"notes": "Additional context or caveats",
|
||||
"_window": 1,
|
||||
"_start_page": 1
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Web Ingestion
|
||||
|
||||
### Single URL
|
||||
|
||||
```bash
|
||||
recon ingest-url "https://example.com/article" --category "Reference" --process
|
||||
```
|
||||
|
||||
Or via API:
|
||||
```bash
|
||||
curl -X POST http://100.64.0.24:8420/api/ingest-url \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"url": "https://example.com/article", "category": "Reference", "process": true}'
|
||||
```
|
||||
|
||||
### Site Crawl
|
||||
|
||||
```bash
|
||||
# Preview what would be crawled
|
||||
recon crawl "https://docs.example.com" --include /docs/ --dry-run
|
||||
|
||||
# Full crawl
|
||||
recon crawl "https://docs.example.com" --include /docs/ --category "Reference" --process
|
||||
```
|
||||
|
||||
### How It Works
|
||||
|
||||
1. **URL discovery** (crawler.py):
|
||||
- Tries sitemap.xml first (preferred, finds all pages)
|
||||
- Falls back to BFS link-following if no sitemap
|
||||
- Filters by include/exclude patterns
|
||||
|
||||
2. **Content extraction** (web_scraper.py):
|
||||
- Uses trafilatura for clean text extraction
|
||||
- Chunks into ~2,000-word pages
|
||||
- Same output format as PDF extractor: `data/text/{hash}/page_NNNN.txt`
|
||||
- Content hash is MD5 of extracted text (deduplication)
|
||||
|
||||
3. **Pipeline integration**:
|
||||
- Web content enters at `extracted` status (no PDF extraction needed)
|
||||
- Enrichment and embedding proceed identically to PDF content
|
||||
- Qdrant vectors get `source_type: "web"` and `download_url` pointing to source URL
|
||||
|
||||
---
|
||||
|
||||
## 11. Configuration Reference
|
||||
|
||||
### config.yaml
|
||||
|
||||
```yaml
|
||||
# Root path for the PDF library (NFS mount from pi-nas)
|
||||
library_root: /mnt/library
|
||||
|
||||
processing:
|
||||
extract_workers: 4 # Concurrent PDF extraction threads
|
||||
enrich_workers: 16 # Concurrent Gemini enrichment threads (4 keys x 4)
|
||||
embed_workers: 4 # Concurrent embedding threads
|
||||
enrich_window_size: 5 # Pages per enrichment window (sent to Gemini)
|
||||
embed_batch_size: 500 # Vectors per Qdrant upsert batch
|
||||
rate_limit_delay: 0.1 # Delay between Gemini API calls (seconds)
|
||||
max_retries: 5 # Max retries for failed documents
|
||||
|
||||
embedding:
|
||||
backend: tei # "tei" (primary, ~1,711 emb/sec) or "ollama" (fallback, ~8 emb/sec)
|
||||
tei_host: 100.64.0.14 # TEI server (cortex)
|
||||
tei_port: 8090 # TEI HTTP port
|
||||
ollama_host: 100.64.0.14 # Ollama server (cortex) — fallback only
|
||||
ollama_port: 11434 # Ollama HTTP port
|
||||
model: bge-m3 # Embedding model name
|
||||
dimensions: 1024 # CRITICAL: bge-m3 is 1024-dim, NOT 384
|
||||
batch_size: 128 # Embeddings per TEI batch request
|
||||
|
||||
vector_db:
|
||||
host: 100.64.0.14 # Qdrant server (cortex)
|
||||
port: 6333 # Qdrant HTTP port
|
||||
collection: recon_knowledge # Collection name
|
||||
|
||||
gemini:
|
||||
model: gemini-2.0-flash # Gemini model for enrichment
|
||||
response_mime_type: application/json # Force JSON output
|
||||
|
||||
web:
|
||||
port: 8420 # Dashboard HTTP port
|
||||
host: 0.0.0.0 # Bind to all interfaces
|
||||
|
||||
paths:
|
||||
base: /opt/recon # Application root
|
||||
data: /opt/recon/data # Data directory
|
||||
text: /opt/recon/data/text # Extracted text output
|
||||
concepts: /opt/recon/data/concepts # Enriched concept JSONs
|
||||
intel: /opt/recon/data/intel # ARGUS intel feeds
|
||||
logs: /opt/recon/logs # Log files
|
||||
db: /opt/recon/data/recon.db # SQLite database
|
||||
|
||||
book_server:
|
||||
base_url: https://files.echo6.co # Public URL prefix for PDF downloads
|
||||
strip_prefix: /mnt/library # Path prefix to strip when generating URLs
|
||||
|
||||
upload_paths: # Category -> filesystem path mapping for uploads
|
||||
Survival Reference: /mnt/library/Survival-Companion-Library/Uploads
|
||||
Military Doctrine: /mnt/library/Army_Pubs/Uploads
|
||||
Gaming: /mnt/library/Gaming
|
||||
Reference: /mnt/library/Reference
|
||||
Technical: /mnt/library/Technical
|
||||
default: /mnt/library # Fallback for unknown categories
|
||||
|
||||
web_scraper:
|
||||
words_per_page: 2000 # Target words per page chunk
|
||||
fetch_timeout: 30 # HTTP request timeout (seconds)
|
||||
rate_limit_delay: 1.0 # Delay between URL fetches (seconds)
|
||||
max_batch_size: 50 # Max URLs per batch ingest
|
||||
user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
|
||||
|
||||
crawler:
|
||||
user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
|
||||
fetch_timeout: 30 # HTTP request timeout (seconds)
|
||||
rate_limit_delay: 1.0 # Delay between page fetches (seconds)
|
||||
max_pages: 500 # Max pages to discover per crawl
|
||||
max_depth: 3 # Max link-following depth (BFS only)
|
||||
default_exclude: # URL patterns to always skip
|
||||
- /search
|
||||
- /404
|
||||
- /login
|
||||
- /signup
|
||||
- /auth/
|
||||
- /api/
|
||||
- /assets/
|
||||
- /static/
|
||||
```
|
||||
|
||||
### .env
|
||||
|
||||
```
|
||||
GEMINI_KEY_1=<key>
|
||||
GEMINI_KEY_2=<key>
|
||||
GEMINI_KEY_3=<key>
|
||||
GEMINI_KEY_4=<key>
|
||||
```
|
||||
|
||||
Four Gemini API keys rotated across 16 enrichment workers via `KeyRotator`.
|
||||
|
||||
---
|
||||
|
||||
## 12. Aurora RAG Integration
|
||||
|
||||
Aurora is the RAG-enabled AI assistant running on OpenWebUI (ai.echo6.co).
|
||||
|
||||
### How It Works
|
||||
|
||||
1. User asks a question in OpenWebUI
|
||||
2. Aurora's OpenWebUI function/filter embeds the query via TEI (cortex:8090)
|
||||
3. Searches Qdrant `recon_knowledge` collection for similar concepts
|
||||
4. Top results are injected into the prompt as context
|
||||
5. JOSIEFIED Qwen3 8B generates an answer with citations
|
||||
6. Citations include `download_url` links (PDF files via files.echo6.co, web content via source URL)
|
||||
|
||||
### Key Components
|
||||
|
||||
- **Embedding**: Same TEI endpoint + bge-m3 model as RECON pipeline (ensures vector compatibility)
|
||||
- **Search**: Cosine similarity, top-5 results by default
|
||||
- **LLM**: `goekdenizguelmez/JOSIEFIED-Qwen3:8b` on Ollama (cortex:11434)
|
||||
- **Citations**: Each result includes `download_url` — either `https://files.echo6.co/...` for PDFs or the original URL for web content
|
||||
|
||||
---
|
||||
|
||||
## 13. Backup & Recovery
|
||||
|
||||
### Automated Backups
|
||||
|
||||
**Script**: `/opt/recon/scripts/backup.sh`
|
||||
**Destination**: Contabo VPS (`root@100.64.0.1:/opt/backups/recon/`)
|
||||
**Schedule** (cron):
|
||||
- Every 6 hours: Full backup (concepts, text, DB, config, intel)
|
||||
- Every 2 hours (off-hours): SQLite DB snapshot only
|
||||
|
||||
### What's Backed Up
|
||||
|
||||
| Component | Size | Priority | Notes |
|
||||
|-----------|------|----------|-------|
|
||||
| data/concepts/ | ~11M | **CRITICAL** | $130+ of Gemini API work |
|
||||
| data/text/ | ~203M | High | Hours to regenerate |
|
||||
| data/recon.db | ~6.5M | **CRITICAL** | All pipeline state |
|
||||
| config.yaml + .env | ~2K | Important | Configuration |
|
||||
| data/intel/ | ~4K | Low | Intel feed data |
|
||||
|
||||
### What's NOT Backed Up
|
||||
|
||||
- **Qdrant vectors**: Rebuilt from concept JSONs in ~10 minutes via `recon rebuild`
|
||||
- **PDF library**: Lives on pi-nas NFS, backed up separately
|
||||
- **venv/**: Recreated from requirements.txt
|
||||
|
||||
### Recovery Procedures
|
||||
|
||||
```bash
|
||||
# Restore from backup
|
||||
scp -r root@100.64.0.1:/opt/backups/recon/concepts/ /opt/recon/data/concepts/
|
||||
scp -r root@100.64.0.1:/opt/backups/recon/text/ /opt/recon/data/text/
|
||||
scp root@100.64.0.1:/opt/backups/recon/recon_LATEST.db /opt/recon/data/recon.db
|
||||
|
||||
# Rebuild Qdrant vectors from concept JSONs
|
||||
cd /opt/recon && source venv/bin/activate
|
||||
python3 scripts/rebuild_qdrant.py
|
||||
# Type REBUILD when prompted
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Embedding Performance
|
||||
|
||||
### TEI (Primary) vs Ollama (Fallback)
|
||||
|
||||
| Metric | TEI (cortex:8090) | Ollama (cortex:11434) |
|
||||
|--------|-------------------|----------------------|
|
||||
| Speed | ~1,711 emb/sec | ~8 emb/sec |
|
||||
| Model | bge-m3 | bge-m3 |
|
||||
| Dimensions | 1024 | 1024 |
|
||||
| Batch size | 128 | 1 |
|
||||
| Cosine similarity | 0.999900 | 0.999900 |
|
||||
|
||||
TEI is ~214x faster than Ollama for embeddings. Always use TEI unless it's down.
|
||||
|
||||
### Qdrant Configuration
|
||||
|
||||
- Collection: `recon_knowledge`
|
||||
- Distance: Cosine
|
||||
- HNSW indexing threshold: 20,000 (below this, brute-force search is used)
|
||||
- Current state: Brute-force (under 20K vectors) — this is normal and performant at current scale
|
||||
|
||||
---
|
||||
|
||||
## 15. Content Hashing
|
||||
|
||||
- **PDF content**: `MD5(file_bytes)` — stable across renames, detects exact duplicates
|
||||
- **Web content**: `MD5(extracted_text)` — deduplicates by content, not URL
|
||||
- Hash is used as the primary key in both SQLite tables and as the directory name for text/concept storage
|
||||
|
||||
---
|
||||
|
||||
## 16. Source Type Handling
|
||||
|
||||
| Source | Path Format | source_type | download_url | Badge |
|
||||
|--------|-------------|-------------|--------------|-------|
|
||||
| PDF | `/mnt/library/...` | document | `https://files.echo6.co/...` | PDF |
|
||||
| Web | `https://...` | web | Original URL | Web |
|
||||
| Intel | JSON feed | intel_feed | — | — |
|
||||
|
||||
The `generate_download_url()` function in utils.py handles the routing:
|
||||
- URLs starting with `http://` or `https://` are returned as-is
|
||||
- File paths are converted to `files.echo6.co` URLs
|
||||
|
||||
---
|
||||
|
||||
## 17. Lessons Learned
|
||||
|
||||
### RECON Rebuild Lessons
|
||||
|
||||
1. **Verify infrastructure before writing code.** Check Qdrant, TEI, Ollama connectivity first.
|
||||
2. **Dimensions are 1024, NOT 384.** BGE-M3 uses 1024-dimensional vectors. This caused silent failures in early builds.
|
||||
3. **TEI >> Ollama for embeddings.** 1,711 vs 8 embeddings/sec. A 214x speedup that makes batch processing viable.
|
||||
4. **Dynamic discovery over hardcoded paths.** Let the pipeline discover what's on disk rather than maintaining static file lists.
|
||||
5. **Web content uses the same pipeline.** After text extraction, web and PDF content follow identical enrichment and embedding paths.
|
||||
6. **Sitemap > link-following.** Sitemaps discover all pages reliably; BFS link-following misses orphaned pages and is slower.
|
||||
7. **Save to disk before DB operations.** Concept JSONs are written to disk first, then the database is updated. This means recovery is always possible from the JSON files.
|
||||
8. **NFS over large file sets is slow.** Scanning 13K PDFs over NFS takes ~30 minutes due to MD5 hashing over the network. Plan accordingly.
|
||||
|
||||
### Operational Gotchas
|
||||
|
||||
- `recon scan` can appear stuck on large PDFs over NFS — it's hashing, not hung
|
||||
- Some PDFs have corrupt metadata that crashes PyPDF2 — the extractor catches this and falls back
|
||||
- Gemini rate limits hit with 16 workers — the `KeyRotator` distributes across 4 keys to mitigate
|
||||
- `iptables-persistent` hangs on interactive prompts in LXC containers — use manual persistence
|
||||
- The recon LXC has no tmux/screen — use `nohup` for long-running background tasks
|
||||
|
||||
---
|
||||
|
||||
## 18. Monitoring
|
||||
|
||||
### Pipeline Status
|
||||
|
||||
```bash
|
||||
# Quick status
|
||||
recon status
|
||||
|
||||
# Dashboard
|
||||
http://100.64.0.24:8420
|
||||
|
||||
# Tail logs
|
||||
tail -f /opt/recon/logs/recon.log
|
||||
|
||||
# Pipeline run log (when running full background pipeline)
|
||||
tail -f /opt/recon/pipeline.log
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# Qdrant
|
||||
curl -s http://100.64.0.14:6333/collections/recon_knowledge | python3 -m json.tool
|
||||
|
||||
# TEI
|
||||
curl -s http://100.64.0.14:8090/info
|
||||
|
||||
# Ollama
|
||||
curl -s http://100.64.0.14:11434/api/tags | python3 -m json.tool
|
||||
|
||||
# NFS mount
|
||||
df -h /mnt/library
|
||||
|
||||
# Backup logs
|
||||
tail -20 /opt/recon/logs/backup.log
|
||||
```
|
||||
|
||||
### Validation
|
||||
|
||||
```bash
|
||||
# Quick validation
|
||||
recon validate
|
||||
|
||||
# Deep validation (checks all files on disk)
|
||||
recon validate --deep
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 19. Current State
|
||||
|
||||
*As of 2026-02-16*
|
||||
|
||||
### Pipeline Progress
|
||||
|
||||
| Status | Count |
|
||||
|--------|-------|
|
||||
| Catalogued | 10,162 |
|
||||
| Queued | 8,982 |
|
||||
| Extracted | 872 |
|
||||
| Complete | 302 |
|
||||
| Failed | 2 |
|
||||
|
||||
### Vector Database
|
||||
|
||||
- Qdrant points: 4,661 (3,144 PDF + 1,517 web)
|
||||
- Segments: 8
|
||||
- Indexing: Brute-force (under 20K threshold)
|
||||
|
||||
### Active Processing
|
||||
|
||||
Full pipeline running in background via `nohup` — extracting through the 8,982 queued documents. Expected to take ~40 hours for full extract -> enrich -> embed cycle.
|
||||
|
||||
### Backups
|
||||
|
||||
- Schedule: Every 6 hours (full) + every 2 hours (DB only)
|
||||
- Destination: Contabo VPS (`/opt/backups/recon/`)
|
||||
- Last verified: 2026-02-16 (220M total backup size)
|
||||
|
||||
---
|
||||
|
||||
## 20. Dependencies
|
||||
|
||||
### System Packages
|
||||
|
||||
- Python 3.11+
|
||||
- pdftotext (poppler-utils)
|
||||
- tesseract-ocr
|
||||
- sqlite3
|
||||
|
||||
### Python Packages (key)
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---------|---------|---------|
|
||||
| Flask | 3.1.2 | Web dashboard |
|
||||
| google-generativeai | 0.8.6 | Gemini API for enrichment |
|
||||
| qdrant-client | 1.16.2 | Vector database client |
|
||||
| PyPDF2 | 3.0.1 | PDF text extraction |
|
||||
| trafilatura | 2.0.0 | Web content extraction |
|
||||
| beautifulsoup4 | 4.14.3 | HTML parsing for crawler |
|
||||
| lxml | 6.0.2 | XML/HTML parsing |
|
||||
| pytesseract | 0.3.13 | OCR fallback |
|
||||
| requests | 2.32.5 | HTTP client |
|
||||
| PyYAML | 6.0.3 | Config file parsing |
|
||||
|
||||
Full list in `requirements.txt`.
|
||||
89
README.md
Normal file
89
README.md
Normal file
|
|
@ -0,0 +1,89 @@
|
|||
# RECON -- Knowledge Extraction Pipeline
|
||||
|
||||
Extracts structured knowledge from PDFs and web content into a Qdrant vector database for RAG retrieval by Aurora.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Activate
|
||||
cd /opt/recon && source venv/bin/activate
|
||||
|
||||
# Scan library for new PDFs
|
||||
recon scan
|
||||
|
||||
# Queue and process
|
||||
recon queue
|
||||
recon extract
|
||||
recon enrich
|
||||
recon embed
|
||||
|
||||
# Or run full pipeline
|
||||
recon run
|
||||
|
||||
# Ingest a web page
|
||||
recon ingest-url "https://example.com/article" --category "Category" --process
|
||||
|
||||
# Crawl an entire docs site
|
||||
recon crawl "https://docs.example.com" --include /docs/ --category "Category" --process
|
||||
|
||||
# Upload a PDF
|
||||
recon upload --file /path/to/document.pdf --category "Category"
|
||||
|
||||
# Search
|
||||
recon search "water purification methods"
|
||||
|
||||
# Check status
|
||||
recon status
|
||||
recon failures
|
||||
```
|
||||
|
||||
## Dashboard
|
||||
|
||||
http://100.64.0.24:8420
|
||||
|
||||
## Services
|
||||
|
||||
| Service | Location | Purpose |
|
||||
|---------|----------|---------|
|
||||
| RECON Dashboard | recon:8420 | Pipeline management + API |
|
||||
| Qdrant | cortex:6333 | Vector database |
|
||||
| TEI | cortex:8090 | Embeddings (1,711/sec) |
|
||||
| Ollama | cortex:11434 | Chat + fallback embeddings |
|
||||
| OpenWebUI | cortex:8080 (ai.echo6.co) | Aurora chat with RAG |
|
||||
| File Server | recon:8888 (files.echo6.co) | PDF downloads |
|
||||
|
||||
## Key Paths
|
||||
|
||||
| Path | Contents |
|
||||
|------|----------|
|
||||
| /opt/recon/ | Application code |
|
||||
| /opt/recon/data/concepts/ | Gemini extractions (**CRITICAL -- back these up**) |
|
||||
| /opt/recon/data/text/ | Extracted text |
|
||||
| /opt/recon/data/recon.db | SQLite status DB |
|
||||
| /mnt/library/ | PDF library (NFS from pi-nas) |
|
||||
|
||||
## Backups
|
||||
|
||||
Automated every 6 hours to Contabo VPS via `/opt/recon/scripts/backup.sh`.
|
||||
Concept JSONs are the most valuable data ($130+ of Gemini API work).
|
||||
Qdrant is NOT backed up -- rebuilt from JSONs in ~10 minutes via `recon rebuild`.
|
||||
|
||||
## Monitoring
|
||||
|
||||
```bash
|
||||
# Pipeline status
|
||||
recon status
|
||||
|
||||
# Tail logs
|
||||
tail -f /opt/recon/logs/recon.log
|
||||
|
||||
# Pipeline run log
|
||||
tail -f /opt/recon/pipeline.log
|
||||
|
||||
# Validate consistency
|
||||
recon validate --deep
|
||||
```
|
||||
|
||||
## Full Documentation
|
||||
|
||||
See [PROJECT-BIBLE.md](PROJECT-BIBLE.md) for complete system documentation.
|
||||
348
api.py
Normal file
348
api.py
Normal file
|
|
@ -0,0 +1,348 @@
|
|||
import json
|
||||
import os
|
||||
|
||||
import requests as http_requests
|
||||
from flask import Flask, request, jsonify, redirect
|
||||
from qdrant_client import QdrantClient
|
||||
from qdrant_client.models import Filter, FieldCondition, MatchValue
|
||||
|
||||
from .utils import get_config, content_hash, setup_logging
|
||||
from .status import StatusDB
|
||||
|
||||
logger = setup_logging('recon.api')
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
HTML_TEMPLATE = """<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>RECON</title>
|
||||
<meta charset="utf-8">
|
||||
<style>
|
||||
* { margin: 0; padding: 0; box-sizing: border-box; }
|
||||
body { font-family: 'Courier New', monospace; background: #0a0a0a; color: #c0c0c0; }
|
||||
.header { background: #111; border-bottom: 1px solid #333; padding: 12px 24px; display: flex; justify-content: space-between; align-items: center; }
|
||||
.header h1 { color: #00ff41; font-size: 18px; letter-spacing: 2px; }
|
||||
.header .stats { font-size: 12px; color: #666; }
|
||||
.nav { background: #0d0d0d; border-bottom: 1px solid #222; padding: 8px 24px; }
|
||||
.nav a { color: #888; text-decoration: none; margin-right: 16px; font-size: 13px; }
|
||||
.nav a:hover, .nav a.active { color: #00ff41; }
|
||||
.content { padding: 24px; max-width: 1400px; margin: 0 auto; }
|
||||
.search-box { width: 100%; padding: 10px 16px; background: #111; border: 1px solid #333; color: #c0c0c0; font-family: inherit; font-size: 14px; margin-bottom: 16px; }
|
||||
.search-box:focus { outline: none; border-color: #00ff41; }
|
||||
table { width: 100%; border-collapse: collapse; font-size: 13px; }
|
||||
th { background: #111; color: #00ff41; text-align: left; padding: 8px 12px; border-bottom: 1px solid #333; }
|
||||
td { padding: 6px 12px; border-bottom: 1px solid #1a1a1a; }
|
||||
tr:hover { background: #111; }
|
||||
.status { padding: 2px 8px; border-radius: 3px; font-size: 11px; }
|
||||
.status-complete { color: #00ff41; }
|
||||
.status-enriched { color: #00bfff; }
|
||||
.status-extracted { color: #ffa500; }
|
||||
.status-failed { color: #ff4444; }
|
||||
.status-queued { color: #888; }
|
||||
.stat-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; margin-bottom: 24px; }
|
||||
.stat-card { background: #111; border: 1px solid #222; padding: 16px; }
|
||||
.stat-card .label { color: #666; font-size: 11px; text-transform: uppercase; }
|
||||
.stat-card .value { color: #00ff41; font-size: 28px; margin-top: 4px; }
|
||||
.result { background: #111; border: 1px solid #222; padding: 16px; margin-bottom: 12px; }
|
||||
.result .title { color: #00ff41; font-size: 14px; margin-bottom: 4px; }
|
||||
.result .meta { color: #666; font-size: 11px; margin-bottom: 8px; }
|
||||
.result .content-text { color: #999; font-size: 12px; line-height: 1.5; }
|
||||
.result .score { color: #ffa500; font-size: 12px; float: right; }
|
||||
.btn { background: #1a1a1a; border: 1px solid #333; color: #c0c0c0; padding: 6px 14px; cursor: pointer; font-family: inherit; font-size: 12px; }
|
||||
.btn:hover { border-color: #00ff41; color: #00ff41; }
|
||||
.domain-tag { display: inline-block; background: #1a1a1a; border: 1px solid #333; padding: 1px 6px; margin: 1px; font-size: 10px; color: #888; }
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="header">
|
||||
<h1>RECON</h1>
|
||||
<div class="stats">Knowledge Base Management System</div>
|
||||
</div>
|
||||
<div class="nav">
|
||||
<a href="/" id="nav-dash">Dashboard</a>
|
||||
<a href="/search" id="nav-search">Search</a>
|
||||
<a href="/catalogue" id="nav-cat">Catalogue</a>
|
||||
<a href="/failures" id="nav-fail">Failures</a>
|
||||
</div>
|
||||
<div class="content" id="main">
|
||||
{{CONTENT}}
|
||||
</div>
|
||||
</body>
|
||||
</html>"""
|
||||
|
||||
|
||||
def render(content):
|
||||
return HTML_TEMPLATE.replace('{{CONTENT}}', content)
|
||||
|
||||
|
||||
@app.route('/')
|
||||
def dashboard():
|
||||
db = StatusDB()
|
||||
counts = db.get_status_counts()
|
||||
cat = counts.get('catalogue', {})
|
||||
doc = counts.get('documents', {})
|
||||
|
||||
total_cat = sum(cat.values())
|
||||
total_doc = sum(doc.values())
|
||||
complete = doc.get('complete', 0)
|
||||
failed = doc.get('failed', 0)
|
||||
|
||||
stats = f"""
|
||||
<div class="stat-grid">
|
||||
<div class="stat-card"><div class="label">Catalogued PDFs</div><div class="value">{total_cat}</div></div>
|
||||
<div class="stat-card"><div class="label">In Pipeline</div><div class="value">{total_doc}</div></div>
|
||||
<div class="stat-card"><div class="label">Complete</div><div class="value">{complete}</div></div>
|
||||
<div class="stat-card"><div class="label">Failed</div><div class="value">{failed}</div></div>
|
||||
</div>
|
||||
<h3 style="color:#00ff41;margin-bottom:12px;">Pipeline Status</h3>
|
||||
<table>
|
||||
<tr><th>Status</th><th>Count</th></tr>
|
||||
"""
|
||||
for status in ['queued', 'extracting', 'extracted', 'enriching', 'enriched', 'embedding', 'complete', 'failed']:
|
||||
count = doc.get(status, 0)
|
||||
stats += f'<tr><td><span class="status status-{status}">{status}</span></td><td>{count}</td></tr>\n'
|
||||
|
||||
stats += "</table>"
|
||||
|
||||
sources = db.source_breakdown()
|
||||
if sources:
|
||||
stats += '<h3 style="color:#00ff41;margin:24px 0 12px;">Sources</h3><table><tr><th>Source</th><th>Count</th><th>Size</th></tr>'
|
||||
for s in sources:
|
||||
size_mb = (s.get('total_bytes', 0) or 0) / (1024 * 1024)
|
||||
stats += f"<tr><td>{s['source']}</td><td>{s['count']}</td><td>{size_mb:.1f} MB</td></tr>"
|
||||
stats += "</table>"
|
||||
|
||||
return render(stats)
|
||||
|
||||
|
||||
@app.route('/search')
|
||||
def search_page():
|
||||
query = request.args.get('q', '')
|
||||
if not query:
|
||||
content = """
|
||||
<h3 style="color:#00ff41;margin-bottom:16px;">Semantic Search</h3>
|
||||
<form method="get" action="/search">
|
||||
<input type="text" name="q" class="search-box" placeholder="Search the knowledge base..." autofocus>
|
||||
</form>
|
||||
<p style="color:#666;font-size:12px;margin-top:8px;">Enter a query to search across all embedded concepts.</p>
|
||||
"""
|
||||
return render(content)
|
||||
|
||||
config = get_config()
|
||||
limit = int(request.args.get('limit', 20))
|
||||
source_filter = request.args.get('source_type', None)
|
||||
|
||||
try:
|
||||
url = f"http://{config['embedding']['host']}:{config['embedding']['port']}/api/embed"
|
||||
resp = http_requests.post(url, json={
|
||||
"model": config['embedding']['model'],
|
||||
"input": query
|
||||
}, timeout=120)
|
||||
resp.raise_for_status()
|
||||
query_vector = resp.json()['embeddings'][0]
|
||||
|
||||
qdrant = QdrantClient(
|
||||
host=config['vector_db']['host'],
|
||||
port=config['vector_db']['port'],
|
||||
timeout=60
|
||||
)
|
||||
|
||||
search_filter = None
|
||||
if source_filter:
|
||||
search_filter = Filter(must=[
|
||||
FieldCondition(key="source_type", match=MatchValue(value=source_filter))
|
||||
])
|
||||
|
||||
results = qdrant.query_points(
|
||||
collection_name=config['vector_db']['collection'],
|
||||
query=query_vector,
|
||||
limit=limit,
|
||||
query_filter=search_filter
|
||||
).points
|
||||
|
||||
content = f"""
|
||||
<h3 style="color:#00ff41;margin-bottom:16px;">Results for: {query}</h3>
|
||||
<form method="get" action="/search">
|
||||
<input type="text" name="q" class="search-box" value="{query}">
|
||||
</form>
|
||||
<p style="color:#666;font-size:12px;margin-bottom:16px;">{len(results)} results</p>
|
||||
"""
|
||||
|
||||
for r in results:
|
||||
p = r.payload
|
||||
title = p.get('title', 'Untitled')
|
||||
summary = p.get('summary', p.get('content', '')[:200])
|
||||
score = r.score
|
||||
domains = p.get('domain', [])
|
||||
book = p.get('book_title', p.get('filename', ''))
|
||||
source_type = p.get('source_type', 'document')
|
||||
|
||||
domain_tags = ''.join(f'<span class="domain-tag">{d}</span>' for d in (domains if isinstance(domains, list) else []))
|
||||
|
||||
content += f"""
|
||||
<div class="result">
|
||||
<span class="score">{score:.4f}</span>
|
||||
<div class="title">{title}</div>
|
||||
<div class="meta">{book} | {source_type} | {p.get('skill_level', 'unknown')}</div>
|
||||
<div class="content-text">{summary}</div>
|
||||
<div style="margin-top:6px;">{domain_tags}</div>
|
||||
</div>
|
||||
"""
|
||||
|
||||
return render(content)
|
||||
|
||||
except Exception as e:
|
||||
return render(f'<p style="color:#ff4444;">Search error: {e}</p>')
|
||||
|
||||
|
||||
@app.route('/catalogue')
|
||||
def catalogue_page():
|
||||
db = StatusDB()
|
||||
source = request.args.get('source', None)
|
||||
category = request.args.get('category', None)
|
||||
limit = int(request.args.get('limit', 100))
|
||||
|
||||
docs = db.get_all_documents(source=source, category=category, limit=limit)
|
||||
|
||||
content = '<h3 style="color:#00ff41;margin-bottom:16px;">Document Catalogue</h3>'
|
||||
|
||||
sources = db.get_sources()
|
||||
if sources:
|
||||
content += '<div style="margin-bottom:12px;">'
|
||||
content += '<a href="/catalogue" class="btn" style="margin-right:4px;">All</a>'
|
||||
for s in sources:
|
||||
content += f'<a href="/catalogue?source={s}" class="btn" style="margin-right:4px;">{s}</a>'
|
||||
content += '</div>'
|
||||
|
||||
content += """<table>
|
||||
<tr><th>Filename</th><th>Source</th><th>Status</th><th>Pages</th><th>Concepts</th><th>Vectors</th></tr>"""
|
||||
|
||||
for d in docs:
|
||||
status = d.get('status', 'unknown')
|
||||
content += f"""<tr>
|
||||
<td>{d.get('filename', '?')}</td>
|
||||
<td>{d.get('source', '')}</td>
|
||||
<td><span class="status status-{status}">{status}</span></td>
|
||||
<td>{d.get('pages_extracted', 0)}</td>
|
||||
<td>{d.get('concepts_extracted', 0)}</td>
|
||||
<td>{d.get('vectors_inserted', 0)}</td>
|
||||
</tr>"""
|
||||
|
||||
content += "</table>"
|
||||
return render(content)
|
||||
|
||||
|
||||
@app.route('/failures')
|
||||
def failures_page():
|
||||
db = StatusDB()
|
||||
failures = db.get_failures()
|
||||
|
||||
content = '<h3 style="color:#ff4444;margin-bottom:16px;">Failed Documents</h3>'
|
||||
|
||||
if not failures:
|
||||
content += '<p style="color:#666;">No failures.</p>'
|
||||
return render(content)
|
||||
|
||||
content += '<table><tr><th>Filename</th><th>Error</th><th>Retries</th><th>Actions</th></tr>'
|
||||
for f in failures:
|
||||
content += f"""<tr>
|
||||
<td>{f.get('filename', '?')}</td>
|
||||
<td style="color:#ff4444;font-size:11px;">{f.get('error_message', 'unknown')[:100]}</td>
|
||||
<td>{f.get('retry_count', 0)}</td>
|
||||
<td><form method="post" action="/api/retry/{f['hash']}" style="display:inline;">
|
||||
<button class="btn" type="submit">Retry</button>
|
||||
</form></td>
|
||||
</tr>"""
|
||||
|
||||
content += "</table>"
|
||||
return render(content)
|
||||
|
||||
|
||||
@app.route('/api/search', methods=['POST'])
|
||||
def api_search():
|
||||
config = get_config()
|
||||
data = request.get_json()
|
||||
if not data or 'query' not in data:
|
||||
return jsonify({'error': 'Missing query'}), 400
|
||||
|
||||
query = data['query']
|
||||
limit = data.get('limit', 20)
|
||||
source_type = data.get('source_type', None)
|
||||
|
||||
try:
|
||||
url = f"http://{config['embedding']['host']}:{config['embedding']['port']}/api/embed"
|
||||
resp = http_requests.post(url, json={
|
||||
"model": config['embedding']['model'],
|
||||
"input": query
|
||||
}, timeout=120)
|
||||
resp.raise_for_status()
|
||||
query_vector = resp.json()['embeddings'][0]
|
||||
|
||||
qdrant = QdrantClient(
|
||||
host=config['vector_db']['host'],
|
||||
port=config['vector_db']['port'],
|
||||
timeout=60
|
||||
)
|
||||
|
||||
search_filter = None
|
||||
if source_type:
|
||||
search_filter = Filter(must=[
|
||||
FieldCondition(key="source_type", match=MatchValue(value=source_type))
|
||||
])
|
||||
|
||||
results = qdrant.query_points(
|
||||
collection_name=config['vector_db']['collection'],
|
||||
query=query_vector,
|
||||
limit=limit,
|
||||
query_filter=search_filter
|
||||
).points
|
||||
|
||||
return jsonify({
|
||||
'query': query,
|
||||
'results': [
|
||||
{
|
||||
'score': r.score,
|
||||
'payload': r.payload
|
||||
}
|
||||
for r in results
|
||||
]
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
return jsonify({'error': str(e)}), 500
|
||||
|
||||
|
||||
@app.route('/api/status')
|
||||
def api_status():
|
||||
db = StatusDB()
|
||||
return jsonify(db.get_status_counts())
|
||||
|
||||
|
||||
@app.route('/api/retry/<file_hash>', methods=['POST'])
|
||||
def api_retry(file_hash):
|
||||
db = StatusDB()
|
||||
db.increment_retry(file_hash)
|
||||
return redirect('/failures')
|
||||
|
||||
|
||||
@app.route('/api/ingest', methods=['POST'])
|
||||
def api_ingest():
|
||||
from .ingester import ingest_intel
|
||||
data = request.get_json()
|
||||
if not data:
|
||||
return jsonify({'error': 'No JSON body'}), 400
|
||||
|
||||
config = get_config()
|
||||
result = ingest_intel(data, config)
|
||||
if result is not None:
|
||||
return jsonify({'intel_id': result})
|
||||
return jsonify({'error': 'Ingestion failed'}), 500
|
||||
|
||||
|
||||
def run_server():
|
||||
config = get_config()
|
||||
host = config['web']['host']
|
||||
port = config['web']['port']
|
||||
logger.info(f"Starting RECON web dashboard on {host}:{port}")
|
||||
app.run(host=host, port=port, debug=False)
|
||||
440
config.yaml
Normal file
440
config.yaml
Normal file
|
|
@ -0,0 +1,440 @@
|
|||
# RECON Configuration
|
||||
# See PROJECT-BIBLE.md Section 11 for full documentation
|
||||
|
||||
# Root path for the PDF library (NFS mount from pi-nas)
|
||||
library_root: /mnt/library
|
||||
|
||||
processing:
|
||||
max_pdf_size_mb: 2000 # Raised from 200MB default for large scanned books
|
||||
extract_workers: 4 # Concurrent PDF extraction threads
|
||||
enrich_workers: 16 # Concurrent Gemini enrichment threads (4 keys x 4)
|
||||
embed_workers: 4 # Concurrent embedding threads
|
||||
enrich_window_size: 5 # Pages per enrichment window (sent to Gemini)
|
||||
embed_batch_size: 500 # Vectors per Qdrant upsert batch
|
||||
rate_limit_delay: 0.1 # Delay between Gemini API calls (seconds)
|
||||
max_retries: 5 # Max retries for failed documents
|
||||
extract_timeout: 1800 # Max seconds per document extraction (30 min, allows vision OCR)
|
||||
page_timeout: 30 # Max seconds per page extraction
|
||||
enrich_max_retries: 5 # Max retries per enrichment window
|
||||
enrich_base_delay: 5.0 # Base backoff delay (seconds) — ~5s, 10s, 20s, 40s, 80s
|
||||
enrich_max_delay: 120.0 # Maximum backoff delay cap (seconds)
|
||||
|
||||
embedding:
|
||||
backend: tei # "tei" (primary, ~1,711 emb/sec) or "ollama" (fallback, ~8 emb/sec)
|
||||
tei_host: 100.64.0.14 # TEI server (cortex)
|
||||
tei_port: 8090 # TEI HTTP port
|
||||
ollama_host: 100.64.0.14 # Ollama server (cortex) — fallback only
|
||||
ollama_port: 11434 # Ollama HTTP port
|
||||
model: bge-m3 # Embedding model name
|
||||
dimensions: 1024 # CRITICAL: bge-m3 is 1024-dim, NOT 384
|
||||
batch_size: 128 # Embeddings per TEI batch request
|
||||
|
||||
sparse_embedding:
|
||||
enabled: true
|
||||
host: 100.64.0.14 # Sparse embedding service (cortex)
|
||||
port: 8091 # Sparse embedding HTTP port
|
||||
|
||||
vector_db:
|
||||
host: 100.64.0.14 # Qdrant server (cortex)
|
||||
port: 6333 # Qdrant HTTP port
|
||||
collection: recon_knowledge_hybrid # Collection name
|
||||
|
||||
gemini:
|
||||
model: gemini-2.0-flash # Gemini model for enrichment
|
||||
response_mime_type: application/json # Force JSON output from Gemini
|
||||
|
||||
web:
|
||||
port: 8420 # Dashboard HTTP port
|
||||
host: 0.0.0.0 # Bind address (all interfaces)
|
||||
|
||||
paths:
|
||||
base: /opt/recon # Application root
|
||||
data: /opt/recon/data # Data directory
|
||||
text: /opt/recon/data/text # Extracted text output (data/text/{hash}/page_NNNN.txt)
|
||||
concepts: /opt/recon/data/concepts # Enriched concept JSONs (data/concepts/{hash}/window_N.json)
|
||||
intel: /opt/recon/data/intel # ARGUS intel feeds
|
||||
logs: /opt/recon/logs # Log files
|
||||
db: /opt/recon/data/recon.db # SQLite database (WAL mode)
|
||||
|
||||
book_server:
|
||||
base_url: https://files.echo6.co # Public URL prefix for PDF downloads
|
||||
strip_prefix: /mnt/library # Path prefix stripped when generating download URLs
|
||||
|
||||
upload_paths: # Category -> filesystem path mapping for uploads
|
||||
Survival Reference: /mnt/library/Survival-Companion-Library/Uploads
|
||||
Military Doctrine: /mnt/library/Army_Pubs/Uploads
|
||||
Gaming: /mnt/library/Gaming
|
||||
Reference: /mnt/library/Reference
|
||||
Technical: /mnt/library/Technical
|
||||
default: /mnt/library # Fallback for unknown categories
|
||||
|
||||
web_scraper:
|
||||
words_per_page: 2000 # Target words per page chunk for web content
|
||||
fetch_timeout: 30 # HTTP request timeout (seconds)
|
||||
rate_limit_delay: 1.0 # Delay between URL fetches (seconds)
|
||||
max_batch_size: 50 # Max URLs per batch ingest
|
||||
user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
|
||||
|
||||
crawler:
|
||||
user_agent: "Mozilla/5.0 (compatible; RECON/1.0)"
|
||||
fetch_timeout: 30 # HTTP request timeout (seconds)
|
||||
rate_limit_delay: 1.0 # Delay between page fetches (seconds)
|
||||
max_pages: 500 # Max pages to discover per crawl
|
||||
max_depth: 3 # Max link-following depth (BFS only, not sitemap)
|
||||
inter_site_cooldown: 30 # Seconds to wait between crawling different sites
|
||||
recrawl_interval_days: 7 # Skip sites crawled within this many days
|
||||
|
||||
default_exclude: # URL patterns always excluded from crawling
|
||||
- /search
|
||||
- /404
|
||||
- /login
|
||||
- /signup
|
||||
- /auth/
|
||||
- /api/
|
||||
- /assets/
|
||||
- /static/
|
||||
- /cart
|
||||
- /checkout
|
||||
- /account
|
||||
- /register
|
||||
- /subscribe
|
||||
- /membership
|
||||
- /shop
|
||||
- /store
|
||||
- /product
|
||||
- /wp-admin
|
||||
- /feed
|
||||
- /wp-json
|
||||
- /xmlrpc
|
||||
- /.well-known
|
||||
- /cdn-cgi
|
||||
|
||||
# ─── Crawl Targets ─────────────────────────────────────────────
|
||||
# Sites are crawled by the scheduler loop in tier order (1 first).
|
||||
# Per-site delay overrides global rate_limit_delay for that site.
|
||||
# Per-site max_pages/max_depth override global defaults.
|
||||
|
||||
# Disabled 2026-04-14 for refactor — see refactored-recon repo for context
|
||||
sites: []
|
||||
|
||||
# sites:
|
||||
#
|
||||
# # ═══ TIER 1 — Free, authoritative, high-density ═══
|
||||
#
|
||||
# - url: https://hesperian.org/all-hesperian-health-guides
|
||||
# category: Medical
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Free health guides — WTIND, midwives, community health"
|
||||
#
|
||||
# - url: https://swsbm.com
|
||||
# category: Medical
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Michael Moore's entire free clinical herbal library — PDFs"
|
||||
#
|
||||
# - url: https://swsbm.henriettesherbal.com
|
||||
# category: Medical
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Mirror of Moore's library — grab both"
|
||||
#
|
||||
# - url: https://nchfp.uga.edu
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 3
|
||||
# delay: 2.0
|
||||
# tier: 1
|
||||
# notes: "USDA canning/preservation safety authority"
|
||||
#
|
||||
# - url: https://extension.uidaho.edu
|
||||
# category: Foundational Skills
|
||||
# max_depth: 3
|
||||
# delay: 2.0
|
||||
# tier: 1
|
||||
# notes: "Idaho-specific — soil, water, crops, livestock"
|
||||
#
|
||||
# - url: https://extension.usu.edu
|
||||
# category: Foundational Skills
|
||||
# max_depth: 3
|
||||
# delay: 2.0
|
||||
# tier: 1
|
||||
# notes: "Utah State — Idaho-adjacent climate"
|
||||
#
|
||||
# - url: https://attra.ncat.org
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "ATTRA sustainable ag — hundreds of free publications"
|
||||
#
|
||||
# - url: https://pfaf.org
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Plants For A Future — 7,000+ edible/medicinal plant profiles"
|
||||
#
|
||||
# - url: https://eattheweeds.com
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Green Deane — 1,000+ foraging plant articles"
|
||||
#
|
||||
# - url: https://lowtechmagazine.com
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Exceptional low-tech systems analysis"
|
||||
#
|
||||
# - url: https://appropedia.org
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Appropriate technology wiki"
|
||||
#
|
||||
# - url: https://journeytoforever.org
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "VITA manuals, biodiesel, biogas, hand tools archive"
|
||||
#
|
||||
# - url: https://cd3wd.com
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 2
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "1,050+ appropriate technology eBooks — index pages only"
|
||||
#
|
||||
# - url: https://practicalselfreliance.com
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Ashley Adamant — foraging, preservation, homesteading"
|
||||
#
|
||||
# - url: https://open.oregonstate.edu/permaculture
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Millison's free permaculture textbook"
|
||||
#
|
||||
# - url: https://open.oregonstate.edu/permaculturedesign
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Millison's advanced permaculture textbook"
|
||||
#
|
||||
# - url: https://mushroomexpert.com
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 1
|
||||
# notes: "Michael Kuo — mushroom ID, taxonomy, regional coverage"
|
||||
#
|
||||
# # ═══ TIER 2 — High value, second pass ═══
|
||||
#
|
||||
# - url: https://motherearthnews.com
|
||||
# category: Foundational Skills
|
||||
# max_depth: 2
|
||||
# max_pages: 200
|
||||
# delay: 8.0
|
||||
# tier: 2
|
||||
# notes: "50 years of homesteading archive — large commercial site, be polite"
|
||||
#
|
||||
# - url: https://permacultureresearchinstitute.com
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 3
|
||||
# delay: 5.0
|
||||
# tier: 2
|
||||
# notes: "Geoff Lawton — articles, case studies"
|
||||
#
|
||||
# - url: https://learnyourland.com
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 3
|
||||
# delay: 5.0
|
||||
# tier: 2
|
||||
# notes: "Adam Haritan — foraging articles"
|
||||
#
|
||||
# - url: https://herbswithRosalee.com
|
||||
# category: Medical
|
||||
# max_depth: 3
|
||||
# delay: 5.0
|
||||
# tier: 2
|
||||
# notes: "Rosalee de la Foret — clinical herbalism articles"
|
||||
#
|
||||
# - url: https://commonwealthherbs.com
|
||||
# category: Medical
|
||||
# max_depth: 3
|
||||
# delay: 5.0
|
||||
# tier: 2
|
||||
# notes: "Katja and Ryn — clinical herbalism"
|
||||
#
|
||||
# - url: https://soilfoodweb.com
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 3
|
||||
# delay: 5.0
|
||||
# tier: 2
|
||||
# notes: "Elaine Ingham soil biology — archive before it goes dark"
|
||||
#
|
||||
# - url: https://rocketstoves.com
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 3
|
||||
# delay: 5.0
|
||||
# tier: 2
|
||||
# notes: "Ianto Evans — rocket mass heater designs and PDFs"
|
||||
#
|
||||
# - url: https://farmsteadmeatsmith.com
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 2
|
||||
# delay: 5.0
|
||||
# tier: 2
|
||||
# notes: "Brandon Sheard — butchering articles (free content only)"
|
||||
#
|
||||
# - url: https://deeranddeerhunting.com
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 2
|
||||
# delay: 5.0
|
||||
# tier: 2
|
||||
# notes: "Field dressing, processing, hunting technique library"
|
||||
#
|
||||
# # ═══ TIER 3 — Government (authoritative) ═══
|
||||
#
|
||||
# - url: https://plants.usda.gov
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 2
|
||||
# delay: 2.0
|
||||
# tier: 3
|
||||
# notes: "USDA native plant database"
|
||||
#
|
||||
# - url: https://ars.usda.gov
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 2
|
||||
# delay: 2.0
|
||||
# tier: 3
|
||||
# notes: "USDA Agricultural Research publications"
|
||||
#
|
||||
# - url: https://nrcs.usda.gov
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 2
|
||||
# delay: 2.0
|
||||
# tier: 3
|
||||
# notes: "Soil surveys, conservation practice standards"
|
||||
#
|
||||
# - url: https://ready.gov
|
||||
# category: Scenario Playbooks
|
||||
# max_depth: 3
|
||||
# delay: 2.0
|
||||
# tier: 3
|
||||
# notes: "FEMA emergency preparedness guides"
|
||||
#
|
||||
# - url: https://emergency.cdc.gov
|
||||
# category: Medical
|
||||
# max_depth: 3
|
||||
# delay: 2.0
|
||||
# tier: 3
|
||||
# notes: "Public health emergency references"
|
||||
#
|
||||
# - url: https://agri.idaho.gov
|
||||
# category: Foundational Skills
|
||||
# max_depth: 2
|
||||
# delay: 2.0
|
||||
# tier: 3
|
||||
# notes: "Idaho Dept of Agriculture — local relevance"
|
||||
#
|
||||
# - url: https://driveonwood.com
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 3
|
||||
# delay: 3.0
|
||||
# tier: 3
|
||||
# notes: "Wood gasification — FEMA manual + modern improvements"
|
||||
#
|
||||
# # ═══ TIER 4 — Selective scrape (specific sections only) ═══
|
||||
#
|
||||
# - url: https://richsoil.com
|
||||
# category: Off-Grid Systems
|
||||
# max_depth: 2
|
||||
# delay: 5.0
|
||||
# tier: 4
|
||||
# notes: "Paul Wheaton — rocket mass heaters, natural building"
|
||||
#
|
||||
# - url: https://wildfoodgirl.com
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 3
|
||||
# delay: 5.0
|
||||
# tier: 4
|
||||
# notes: "Colorado foraging — Mountain West species"
|
||||
#
|
||||
# - url: https://foragersharvest.com
|
||||
# category: Sustainment Systems
|
||||
# max_depth: 3
|
||||
# delay: 5.0
|
||||
# tier: 4
|
||||
# notes: "Sam Thayer's site — articles"
|
||||
#
|
||||
# - url: https://mountainroseherbs.com/blog
|
||||
# category: Medical
|
||||
# max_depth: 2
|
||||
# delay: 5.0
|
||||
# tier: 4
|
||||
# notes: "Herb profiles and preparations — blog section only"
|
||||
#
|
||||
# - url: https://herbalprepper.com
|
||||
# category: Medical
|
||||
# max_depth: 3
|
||||
# delay: 5.0
|
||||
# tier: 4
|
||||
# notes: "Cat Ellis — grid-down herbalism"
|
||||
#
|
||||
# - url: https://prolongedfieldcare.org
|
||||
# category: Medical
|
||||
# max_depth: 3
|
||||
# delay: 5.0
|
||||
# tier: 4
|
||||
# notes: "PFC Collective — austere medical protocols"
|
||||
#
|
||||
service:
|
||||
scan_interval: 3600 # Seconds between library scans (1 hour)
|
||||
stage_poll_interval: 30 # Seconds stages sleep when idle
|
||||
progress_interval: 60 # Seconds between progress log lines
|
||||
|
||||
peertube:
|
||||
api_base: http://192.168.1.170 # Internal PeerTube API (CT 110 nginx)
|
||||
public_url: https://stream.echo6.co # Public URL for video links
|
||||
fetch_timeout: 30 # HTTP timeout for API/VTT requests
|
||||
rate_limit_delay: 0.5 # Delay between video ingestions (seconds)
|
||||
|
||||
# Stream B: New Library Pipeline
|
||||
new_pipeline:
|
||||
# Disabled 2026-04-14 for refactor — see refactored-recon repo for context
|
||||
enabled: false
|
||||
acquired_dir: /mnt/library/_acquired
|
||||
ingest_dir: /mnt/library/_ingest
|
||||
duplicates_dir: /mnt/library/_ingest/_duplicates
|
||||
failed_dir: /mnt/library/_ingest/_failed
|
||||
poll_interval: 60
|
||||
mtime_stability: 10
|
||||
pilot_domain: "Civil Organization"
|
||||
spaces_to_underscores: true
|
||||
|
||||
# Refactored pipeline configuration (2026-04-14)
|
||||
# See https://forge.echo6.co/matt/refactored-recon for design
|
||||
pipeline:
|
||||
acquired_root: /opt/recon/data/acquired
|
||||
processing_root: /opt/recon/data/processing
|
||||
# Subfolder name -> processor module mapping
|
||||
# Processors do not exist yet; this is scaffolding for Phase 3+
|
||||
dispatch:
|
||||
pdf: pdf_processor
|
||||
stream: transcript_processor
|
||||
html: html_processor
|
||||
# mtime stability threshold for picking up files from acquired/
|
||||
mtime_stability_seconds: 10
|
||||
264
enricher.py
Normal file
264
enricher.py
Normal file
|
|
@ -0,0 +1,264 @@
|
|||
import json
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
import traceback
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
import google.generativeai as genai
|
||||
|
||||
from .utils import get_config, setup_logging
|
||||
from .status import StatusDB
|
||||
|
||||
logger = setup_logging('recon.enricher')
|
||||
|
||||
|
||||
def repair_json(text):
|
||||
"""Attempt to repair common LLM JSON output issues including truncation."""
|
||||
# Remove control characters except newlines and tabs
|
||||
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
|
||||
# Remove trailing commas before } or ]
|
||||
text = re.sub(r',\s*([}\]])', r'\1', text)
|
||||
|
||||
# Handle truncated JSON: try to find the last complete object in the array
|
||||
try:
|
||||
json.loads(text, strict=False)
|
||||
return text
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# Find the last complete }, then close the array
|
||||
# Walk backward to find the last valid closing brace
|
||||
last_complete = -1
|
||||
depth_brace = 0
|
||||
depth_bracket = 0
|
||||
in_string = False
|
||||
escape = False
|
||||
|
||||
for i, ch in enumerate(text):
|
||||
if escape:
|
||||
escape = False
|
||||
continue
|
||||
if ch == '\\' and in_string:
|
||||
escape = True
|
||||
continue
|
||||
if ch == '"' and not escape:
|
||||
in_string = not in_string
|
||||
continue
|
||||
if in_string:
|
||||
continue
|
||||
if ch == '{':
|
||||
depth_brace += 1
|
||||
elif ch == '}':
|
||||
depth_brace -= 1
|
||||
if depth_brace == 0:
|
||||
last_complete = i
|
||||
elif ch == '[':
|
||||
depth_bracket += 1
|
||||
elif ch == ']':
|
||||
depth_bracket -= 1
|
||||
|
||||
if last_complete > 0:
|
||||
truncated = text[:last_complete + 1].rstrip().rstrip(',')
|
||||
# Close any open arrays
|
||||
open_brackets = truncated.count('[') - truncated.count(']')
|
||||
truncated += ']' * open_brackets
|
||||
return truncated
|
||||
|
||||
return text
|
||||
|
||||
ENRICH_PROMPT = """Extract knowledge concepts from this document text.
|
||||
|
||||
A concept is a SELF-CONTAINED piece of knowledge that can stand alone.
|
||||
|
||||
For each concept, provide ALL fields:
|
||||
|
||||
Required:
|
||||
- content: Full text of the concept (complete procedure, definition, etc.)
|
||||
- summary: 1-2 sentence summary
|
||||
- title: Brief descriptive title
|
||||
- domain: Array of 1-5 from: Foundational Skills, Sustainment Systems, Defense & Tactics, Off-Grid Systems, Communications, Scenario Playbooks, Reference
|
||||
- subdomain: Array of specific subcategories (up to 10)
|
||||
- keywords: Array of 3-30 searchable terms
|
||||
- skill_level: novice | intermediate | advanced
|
||||
- key_facts: Array of specific extractable claims, measurements, data points
|
||||
|
||||
Optional (include when present):
|
||||
- scenario_applicable: Array from: tuesday_prepper, month_prepper, year_prepper, multi_year, eotwawki
|
||||
- cross_domain_tags: Array from: sustainment, medical, security, communications, leadership, logistics, navigation, power_systems, water_systems, food_systems, tactical_ops, community_coordination
|
||||
- chapter: Chapter name if identifiable
|
||||
- page_ref: Page reference
|
||||
- notes: Any additional context
|
||||
|
||||
Return JSON array. If no extractable concepts, return [].
|
||||
|
||||
Document text:
|
||||
"""
|
||||
|
||||
|
||||
class KeyRotator:
|
||||
def __init__(self, keys):
|
||||
self.keys = keys
|
||||
self.index = 0
|
||||
|
||||
def next(self):
|
||||
if not self.keys:
|
||||
raise ValueError("No Gemini API keys configured")
|
||||
key = self.keys[self.index % len(self.keys)]
|
||||
self.index += 1
|
||||
return key
|
||||
|
||||
|
||||
def enrich_window(text, key, config):
|
||||
genai.configure(api_key=key)
|
||||
model = genai.GenerativeModel(
|
||||
config['gemini']['model'],
|
||||
generation_config={"response_mime_type": config['gemini']['response_mime_type']}
|
||||
)
|
||||
response = model.generate_content(ENRICH_PROMPT + text)
|
||||
raw = response.text
|
||||
try:
|
||||
return json.loads(raw, strict=False)
|
||||
except json.JSONDecodeError:
|
||||
repaired = repair_json(raw)
|
||||
return json.loads(repaired, strict=False)
|
||||
|
||||
|
||||
def enrich_single(file_hash, db, config, key_rotator):
|
||||
doc = db.get_document(file_hash)
|
||||
if not doc:
|
||||
return False
|
||||
|
||||
text_dir = os.path.join(config['paths']['text'], file_hash)
|
||||
concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
|
||||
window_size = config['processing']['enrich_window_size']
|
||||
delay = config['processing']['rate_limit_delay']
|
||||
max_retries = config['processing']['max_retries']
|
||||
|
||||
if not os.path.exists(text_dir):
|
||||
db.mark_failed(file_hash, f"Text directory not found: {text_dir}")
|
||||
return False
|
||||
|
||||
db.update_status(file_hash, 'enriching')
|
||||
|
||||
try:
|
||||
os.makedirs(concepts_dir, exist_ok=True)
|
||||
|
||||
page_files = sorted([f for f in os.listdir(text_dir) if f.startswith('page_') and f.endswith('.txt')])
|
||||
if not page_files:
|
||||
db.mark_failed(file_hash, "No page files found")
|
||||
return False
|
||||
|
||||
pages_text = []
|
||||
for pf in page_files:
|
||||
with open(os.path.join(text_dir, pf), encoding='utf-8') as f:
|
||||
pages_text.append(f.read())
|
||||
|
||||
windows = []
|
||||
for i in range(0, len(pages_text), window_size):
|
||||
window_pages = pages_text[i:i + window_size]
|
||||
combined = "\n\n".join(f"--- Page {i + j + 1} ---\n{t}" for j, t in enumerate(window_pages))
|
||||
windows.append((i, combined))
|
||||
|
||||
total_concepts = 0
|
||||
for w_idx, (start_page, window_text) in enumerate(windows):
|
||||
window_file = os.path.join(concepts_dir, f"window_{w_idx+1:04d}.json")
|
||||
|
||||
if os.path.exists(window_file):
|
||||
with open(window_file, encoding='utf-8') as f:
|
||||
existing = json.load(f)
|
||||
total_concepts += len(existing)
|
||||
logger.debug(f" Window {w_idx+1} already exists, skipping")
|
||||
continue
|
||||
|
||||
if len(window_text.strip()) < 50:
|
||||
with open(window_file, 'w') as f:
|
||||
json.dump([], f)
|
||||
continue
|
||||
|
||||
concepts = None
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
key = key_rotator.next()
|
||||
concepts = enrich_window(window_text, key, config)
|
||||
break
|
||||
except Exception as e:
|
||||
logger.warning(f" Window {w_idx+1} attempt {attempt+1} failed: {e}")
|
||||
if attempt < max_retries - 1:
|
||||
time.sleep(delay * (attempt + 1) * 2)
|
||||
|
||||
if concepts is None:
|
||||
db.mark_failed(file_hash, f"All retries failed for window {w_idx+1}")
|
||||
return False
|
||||
|
||||
if not isinstance(concepts, list):
|
||||
concepts = [concepts] if isinstance(concepts, dict) else []
|
||||
|
||||
for c_idx, concept in enumerate(concepts):
|
||||
concept['_window'] = w_idx + 1
|
||||
concept['_start_page'] = start_page + 1
|
||||
concept['_doc_hash'] = file_hash
|
||||
|
||||
# JSON FIRST: save before anything else
|
||||
with open(window_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(concepts, f, indent=2, ensure_ascii=False)
|
||||
|
||||
total_concepts += len(concepts)
|
||||
logger.debug(f" Window {w_idx+1}/{len(windows)}: {len(concepts)} concepts")
|
||||
time.sleep(delay)
|
||||
|
||||
meta = {
|
||||
'hash': file_hash,
|
||||
'total_windows': len(windows),
|
||||
'total_concepts': total_concepts,
|
||||
'window_size': window_size,
|
||||
'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
|
||||
}
|
||||
with open(os.path.join(concepts_dir, 'meta.json'), 'w') as f:
|
||||
json.dump(meta, f, indent=2)
|
||||
|
||||
db.update_status(file_hash, 'enriched', concepts_extracted=total_concepts)
|
||||
logger.info(f"Enriched {doc['filename']}: {total_concepts} concepts from {len(windows)} windows")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Enrichment failed for {file_hash}: {e}\n{traceback.format_exc()}")
|
||||
db.mark_failed(file_hash, str(e))
|
||||
return False
|
||||
|
||||
|
||||
def run_enrichment(workers=None, limit=None):
|
||||
config = get_config()
|
||||
db = StatusDB()
|
||||
workers = workers or config['processing']['enrich_workers']
|
||||
|
||||
keys = config.get('gemini_keys', [])
|
||||
if not keys:
|
||||
logger.error("No Gemini API keys configured in .env")
|
||||
return 0
|
||||
|
||||
key_rotator = KeyRotator(keys)
|
||||
|
||||
extracted = db.get_by_status('extracted', limit=limit)
|
||||
if not extracted:
|
||||
logger.info("No extracted documents to enrich")
|
||||
return 0
|
||||
|
||||
logger.info(f"Enriching {len(extracted)} documents with {workers} workers, {len(keys)} API key(s)")
|
||||
success = 0
|
||||
|
||||
with ThreadPoolExecutor(max_workers=workers) as pool:
|
||||
futures = {
|
||||
pool.submit(enrich_single, doc['hash'], StatusDB(), config, key_rotator): doc
|
||||
for doc in extracted
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
doc = futures[future]
|
||||
try:
|
||||
if future.result():
|
||||
success += 1
|
||||
except Exception as e:
|
||||
logger.error(f"Worker error for {doc['hash']}: {e}")
|
||||
|
||||
logger.info(f"Enrichment complete: {success}/{len(extracted)} succeeded")
|
||||
return success
|
||||
0
lib/__init__.py
Normal file
0
lib/__init__.py
Normal file
1930
lib/api.py
Normal file
1930
lib/api.py
Normal file
File diff suppressed because it is too large
Load diff
432
lib/crawler.py
Normal file
432
lib/crawler.py
Normal file
|
|
@ -0,0 +1,432 @@
|
|||
"""
|
||||
RECON Site Crawler — URL discovery for bulk web ingestion.
|
||||
|
||||
Two discovery strategies:
|
||||
1. Sitemap-based (preferred) — parses sitemap.xml for all URLs
|
||||
2. Link-following (fallback) — crawls from root URL following internal links
|
||||
|
||||
Discovered URLs are fed into web_scraper.ingest_url() for processing.
|
||||
"""
|
||||
|
||||
import re
|
||||
import time
|
||||
from collections import deque
|
||||
from urllib.parse import urlparse, urljoin, urldefrag
|
||||
|
||||
import requests
|
||||
from lxml import etree
|
||||
|
||||
from .utils import get_config, setup_logging
|
||||
|
||||
logger = setup_logging('recon.crawler')
|
||||
|
||||
|
||||
def _get_crawler_config(config=None):
|
||||
"""Load crawler config with defaults."""
|
||||
if config is None:
|
||||
config = get_config()
|
||||
crawler_cfg = config.get('crawler', {})
|
||||
web_cfg = config.get('web_scraper', {})
|
||||
return {
|
||||
'user_agent': (
|
||||
crawler_cfg.get('user_agent') or
|
||||
web_cfg.get('user_agent') or
|
||||
'Mozilla/5.0 (compatible; RECON/1.0)'
|
||||
),
|
||||
'fetch_timeout': crawler_cfg.get('fetch_timeout', 30),
|
||||
'rate_limit_delay': crawler_cfg.get('rate_limit_delay', 1.0),
|
||||
'max_pages': crawler_cfg.get('max_pages', 500),
|
||||
'max_depth': crawler_cfg.get('max_depth', 3),
|
||||
'default_exclude': crawler_cfg.get('default_exclude', [
|
||||
'/search', '/404', '/login', '/signup', '/auth/', '/api/', '/assets/', '/static/'
|
||||
]),
|
||||
}
|
||||
|
||||
|
||||
# ─── Sitemap Discovery ─────────────────────────────────────────────
|
||||
|
||||
def discover_sitemap_url(base_url, config=None):
|
||||
"""
|
||||
Find the sitemap URL for a site.
|
||||
|
||||
Checks: robots.txt Sitemap: directive, /sitemap.xml,
|
||||
/sitemap_index.xml, /sitemap-0.xml.
|
||||
|
||||
Returns sitemap URL or None.
|
||||
"""
|
||||
cfg = _get_crawler_config(config)
|
||||
headers = {'User-Agent': cfg['user_agent']}
|
||||
parsed = urlparse(base_url)
|
||||
root = f"{parsed.scheme}://{parsed.netloc}"
|
||||
|
||||
# Check robots.txt first
|
||||
try:
|
||||
resp = requests.get(
|
||||
f"{root}/robots.txt",
|
||||
headers=headers,
|
||||
timeout=cfg['fetch_timeout']
|
||||
)
|
||||
if resp.status_code == 200:
|
||||
for line in resp.text.splitlines():
|
||||
if line.strip().lower().startswith('sitemap:'):
|
||||
sitemap_url = line.split(':', 1)[1].strip()
|
||||
# Handle "Sitemap: https://..." — split(':',1) keeps the URL intact
|
||||
# but "Sitemap: https://..." splits into "Sitemap" and " https://..."
|
||||
# Need to rejoin properly
|
||||
if not sitemap_url.startswith('http'):
|
||||
sitemap_url = line[line.index(':') + 1:].strip()
|
||||
logger.info(f"Found sitemap in robots.txt: {sitemap_url}")
|
||||
return sitemap_url
|
||||
except Exception as e:
|
||||
logger.debug(f"robots.txt fetch failed: {e}")
|
||||
|
||||
# Try common sitemap locations
|
||||
candidates = [
|
||||
f"{root}/sitemap.xml",
|
||||
f"{root}/sitemap_index.xml",
|
||||
f"{root}/sitemap-0.xml",
|
||||
]
|
||||
|
||||
for url in candidates:
|
||||
try:
|
||||
resp = requests.head(
|
||||
url,
|
||||
headers=headers,
|
||||
timeout=cfg['fetch_timeout'],
|
||||
allow_redirects=True
|
||||
)
|
||||
if resp.status_code == 200:
|
||||
logger.info(f"Found sitemap at: {url}")
|
||||
return url
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
logger.warning(f"No sitemap found for {base_url}")
|
||||
return None
|
||||
|
||||
|
||||
def parse_sitemap(sitemap_url, config=None):
|
||||
"""
|
||||
Parse a sitemap XML and return all page URLs.
|
||||
|
||||
Handles standard sitemaps (<urlset>) and sitemap indexes
|
||||
(<sitemapindex>) with recursive sub-sitemap fetching.
|
||||
"""
|
||||
cfg = _get_crawler_config(config)
|
||||
headers = {'User-Agent': cfg['user_agent']}
|
||||
all_urls = []
|
||||
|
||||
def _fetch_and_parse(url, depth=0):
|
||||
if depth > 3:
|
||||
return
|
||||
|
||||
try:
|
||||
resp = requests.get(url, headers=headers, timeout=cfg['fetch_timeout'])
|
||||
resp.raise_for_status()
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to fetch sitemap {url}: {e}")
|
||||
return
|
||||
|
||||
try:
|
||||
root = etree.fromstring(resp.content)
|
||||
except etree.XMLSyntaxError as e:
|
||||
logger.error(f"Invalid XML in sitemap {url}: {e}")
|
||||
return
|
||||
|
||||
nsmap = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
|
||||
|
||||
# Check if this is a sitemap index
|
||||
sitemap_locs = root.findall('.//ns:sitemap/ns:loc', nsmap)
|
||||
if sitemap_locs:
|
||||
logger.info(f"Sitemap index at {url} — {len(sitemap_locs)} sub-sitemaps")
|
||||
for loc in sitemap_locs:
|
||||
if loc.text:
|
||||
_fetch_and_parse(loc.text.strip(), depth + 1)
|
||||
return
|
||||
|
||||
# Standard sitemap — extract URLs
|
||||
url_locs = root.findall('.//ns:loc', nsmap)
|
||||
|
||||
# Fallback: try without namespace
|
||||
if not url_locs:
|
||||
url_locs = root.findall('.//loc')
|
||||
|
||||
for loc in url_locs:
|
||||
if loc.text:
|
||||
all_urls.append(loc.text.strip())
|
||||
|
||||
logger.info(f"Parsed {len(url_locs)} URLs from {url}")
|
||||
|
||||
_fetch_and_parse(sitemap_url)
|
||||
|
||||
# Deduplicate preserving order
|
||||
seen = set()
|
||||
unique = []
|
||||
for url in all_urls:
|
||||
url_clean = urldefrag(url)[0]
|
||||
if url_clean not in seen:
|
||||
seen.add(url_clean)
|
||||
unique.append(url_clean)
|
||||
|
||||
logger.info(f"Total unique URLs from sitemap: {len(unique)}")
|
||||
return unique
|
||||
|
||||
|
||||
# ─── Link-Following Discovery (Fallback) ───────────────────────────
|
||||
|
||||
def crawl_links(base_url, max_depth=3, max_pages=500, config=None):
|
||||
"""
|
||||
Discover URLs by following internal links (BFS).
|
||||
Fallback when no sitemap is available.
|
||||
"""
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
cfg = _get_crawler_config(config)
|
||||
headers = {'User-Agent': cfg['user_agent']}
|
||||
|
||||
parsed_base = urlparse(base_url)
|
||||
base_domain = parsed_base.netloc
|
||||
|
||||
discovered = []
|
||||
visited = set()
|
||||
queue = deque([(base_url, 0)])
|
||||
|
||||
skip_extensions = (
|
||||
'.pdf', '.png', '.jpg', '.jpeg', '.gif', '.svg',
|
||||
'.css', '.js', '.zip', '.tar', '.gz', '.mp4', '.mp3',
|
||||
'.ico', '.woff', '.woff2', '.ttf', '.eot',
|
||||
)
|
||||
skip_paths = (
|
||||
'/tag/', '/tags/', '/page/', '/feed/', '/rss/',
|
||||
'/wp-json/', '/wp-admin/', '/wp-includes/',
|
||||
)
|
||||
|
||||
while queue and len(discovered) < max_pages:
|
||||
url, depth = queue.popleft()
|
||||
url = urldefrag(url)[0]
|
||||
|
||||
if url in visited:
|
||||
continue
|
||||
if depth > max_depth:
|
||||
continue
|
||||
|
||||
visited.add(url)
|
||||
discovered.append(url)
|
||||
|
||||
if depth >= max_depth:
|
||||
continue
|
||||
|
||||
try:
|
||||
resp = requests.get(url, headers=headers, timeout=cfg['fetch_timeout'])
|
||||
if resp.status_code != 200:
|
||||
continue
|
||||
if 'text/html' not in resp.headers.get('content-type', ''):
|
||||
continue
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
try:
|
||||
soup = BeautifulSoup(resp.text, 'lxml')
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
for a_tag in soup.find_all('a', href=True):
|
||||
href = a_tag['href']
|
||||
full_url = urljoin(url, href)
|
||||
full_url = urldefrag(full_url)[0]
|
||||
|
||||
parsed = urlparse(full_url)
|
||||
if parsed.netloc != base_domain:
|
||||
continue
|
||||
if any(parsed.path.lower().endswith(ext) for ext in skip_extensions):
|
||||
continue
|
||||
if any(skip in parsed.path.lower() for skip in skip_paths):
|
||||
continue
|
||||
|
||||
if full_url not in visited:
|
||||
queue.append((full_url, depth + 1))
|
||||
|
||||
time.sleep(cfg['rate_limit_delay'])
|
||||
|
||||
logger.info(f"Link crawl: {len(discovered)} URLs (visited {len(visited)}, depth {max_depth})")
|
||||
return discovered
|
||||
|
||||
|
||||
# ─── URL Filtering ──────────────────────────────────────────────────
|
||||
|
||||
def filter_urls(urls, include=None, exclude=None):
|
||||
"""
|
||||
Filter URLs by path prefix include/exclude rules.
|
||||
|
||||
include: URL must match at least one prefix (if provided)
|
||||
exclude: URL must not match any prefix
|
||||
"""
|
||||
filtered = []
|
||||
|
||||
for url in urls:
|
||||
path = urlparse(url).path
|
||||
|
||||
if include:
|
||||
if not any(path.startswith(prefix) for prefix in include):
|
||||
continue
|
||||
|
||||
if exclude:
|
||||
if any(path.startswith(prefix) for prefix in exclude):
|
||||
continue
|
||||
|
||||
filtered.append(url)
|
||||
|
||||
logger.info(f"Filtered {len(urls)} -> {len(filtered)} URLs "
|
||||
f"(include={include}, exclude={exclude})")
|
||||
return filtered
|
||||
|
||||
|
||||
# ─── Main Crawl Orchestrator ────────────────────────────────────────
|
||||
|
||||
def crawl_site(
|
||||
base_url,
|
||||
category='Web',
|
||||
source=None,
|
||||
include=None,
|
||||
exclude=None,
|
||||
max_pages=None,
|
||||
max_depth=None,
|
||||
delay=None,
|
||||
dry_run=False,
|
||||
use_sitemap=True,
|
||||
use_links=True,
|
||||
config=None,
|
||||
):
|
||||
"""
|
||||
Crawl a site and ingest all discovered pages.
|
||||
|
||||
1. Discover URLs via sitemap or link-following
|
||||
2. Apply include/exclude filters
|
||||
3. Feed each URL through web_scraper.ingest_url()
|
||||
|
||||
Returns summary dict with counts and per-URL results.
|
||||
"""
|
||||
if config is None:
|
||||
config = get_config()
|
||||
cfg = _get_crawler_config(config)
|
||||
|
||||
if max_pages is None:
|
||||
max_pages = cfg['max_pages']
|
||||
if max_depth is None:
|
||||
max_depth = cfg['max_depth']
|
||||
if delay is None:
|
||||
delay = cfg['rate_limit_delay']
|
||||
if source is None:
|
||||
source = urlparse(base_url).netloc
|
||||
|
||||
logger.info(f"Crawling {base_url} (category={category}, max_pages={max_pages})")
|
||||
|
||||
# ── Phase 1: Discover URLs ──
|
||||
|
||||
urls = []
|
||||
discovery_method = None
|
||||
|
||||
if use_sitemap:
|
||||
sitemap_url = discover_sitemap_url(base_url, config)
|
||||
if sitemap_url:
|
||||
urls = parse_sitemap(sitemap_url, config)
|
||||
discovery_method = 'sitemap'
|
||||
|
||||
if not urls and use_links:
|
||||
logger.info("No sitemap URLs, falling back to link crawl...")
|
||||
urls = crawl_links(base_url, max_depth=max_depth, max_pages=max_pages, config=config)
|
||||
discovery_method = 'link_crawl'
|
||||
|
||||
if not urls:
|
||||
logger.warning(f"No URLs discovered for {base_url}")
|
||||
return {
|
||||
'site': base_url,
|
||||
'discovery_method': None,
|
||||
'urls_discovered': 0,
|
||||
'urls_after_filter': 0,
|
||||
'results': [],
|
||||
'summary': {'total': 0, 'succeeded': 0, 'duplicates': 0, 'failed': 0},
|
||||
}
|
||||
|
||||
# ── Phase 2: Filter URLs ──
|
||||
|
||||
all_exclude = list(cfg['default_exclude'])
|
||||
if exclude:
|
||||
all_exclude.extend(exclude)
|
||||
|
||||
urls = filter_urls(urls, include=include, exclude=all_exclude)
|
||||
|
||||
if len(urls) > max_pages:
|
||||
logger.info(f"Limiting to {max_pages} pages (discovered {len(urls)})")
|
||||
urls = urls[:max_pages]
|
||||
|
||||
logger.info(f"After filtering: {len(urls)} URLs to process")
|
||||
|
||||
# ── Dry run ──
|
||||
|
||||
if dry_run:
|
||||
return {
|
||||
'site': base_url,
|
||||
'discovery_method': discovery_method,
|
||||
'dry_run': True,
|
||||
'urls_discovered': len(urls),
|
||||
'urls': urls,
|
||||
}
|
||||
|
||||
# ── Phase 3: Ingest each URL ──
|
||||
|
||||
from .web_scraper import ingest_url
|
||||
|
||||
results = []
|
||||
total = len(urls)
|
||||
|
||||
for i, url in enumerate(urls, 1):
|
||||
logger.info(f"[{i}/{total}] Ingesting: {url}")
|
||||
|
||||
try:
|
||||
result = ingest_url(url, category=category, source=source, config=config)
|
||||
result['url'] = url
|
||||
results.append(result)
|
||||
|
||||
status = result.get('status', 'unknown')
|
||||
title = result.get('title', '')
|
||||
if status == 'duplicate':
|
||||
logger.info(f" DUPLICATE: {title}")
|
||||
else:
|
||||
logger.info(f" OK: {title} ({result.get('page_count', 0)} pages)")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f" FAILED: {url} -- {e}")
|
||||
results.append({
|
||||
'url': url,
|
||||
'status': 'failed',
|
||||
'error': str(e),
|
||||
})
|
||||
|
||||
if i < total and delay > 0:
|
||||
time.sleep(delay)
|
||||
|
||||
# ── Summary ──
|
||||
|
||||
succeeded = sum(1 for r in results if r.get('status') not in ('failed', 'duplicate'))
|
||||
duplicates = sum(1 for r in results if r.get('status') == 'duplicate')
|
||||
failed = sum(1 for r in results if r.get('status') == 'failed')
|
||||
|
||||
summary = {
|
||||
'total': len(results),
|
||||
'succeeded': succeeded,
|
||||
'duplicates': duplicates,
|
||||
'failed': failed,
|
||||
}
|
||||
|
||||
logger.info(f"Crawl complete: {succeeded} new, {duplicates} duplicates, {failed} failed out of {total}")
|
||||
|
||||
return {
|
||||
'site': base_url,
|
||||
'domain': urlparse(base_url).netloc,
|
||||
'category': category,
|
||||
'discovery_method': discovery_method,
|
||||
'urls_discovered': total,
|
||||
'results': results,
|
||||
'summary': summary,
|
||||
}
|
||||
430
lib/embedder.py
Normal file
430
lib/embedder.py
Normal file
|
|
@ -0,0 +1,430 @@
|
|||
"""
|
||||
RECON Embedder
|
||||
|
||||
Concepts to vectors via TEI (primary, 1024-dim bge-m3, ~1,711 emb/sec)
|
||||
or Ollama (fallback, ~8 emb/sec). Inserts into Qdrant on cortex:6333.
|
||||
|
||||
Supports hybrid dense+sparse vectors when sparse_embedding service is configured.
|
||||
|
||||
Dependencies: requests, qdrant-client
|
||||
Config: embedding, vector_db, processing.embed_workers
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
import traceback
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
import requests as http_requests
|
||||
from qdrant_client import QdrantClient
|
||||
from qdrant_client.models import PointStruct, SparseVector
|
||||
|
||||
from .utils import get_config, concept_id, generate_download_url, setup_logging
|
||||
from .status import StatusDB
|
||||
|
||||
logger = setup_logging('recon.embedder')
|
||||
|
||||
# ── Classification allowlists ───────────────────────────────────────────────
|
||||
VALID_DOMAINS = {
|
||||
'Agriculture & Livestock', 'Civil Organization', 'Communications',
|
||||
'Food Systems', 'Foundational Skills', 'Logistics', 'Medical',
|
||||
'Navigation', 'Operations', 'Power Systems', 'Preservation & Storage',
|
||||
'Security', 'Shelter & Construction', 'Technology', 'Tools & Equipment',
|
||||
'Vehicles', 'Water Systems', 'Wilderness Skills',
|
||||
}
|
||||
VALID_KNOWLEDGE_TYPES = {'foundational', 'procedural', 'operational'}
|
||||
VALID_COMPLEXITIES = {'basic', 'intermediate', 'advanced'}
|
||||
|
||||
DOMAIN_FALLBACK = 'Foundational Skills'
|
||||
KNOWLEDGE_TYPE_FALLBACK = 'foundational'
|
||||
COMPLEXITY_FALLBACK = 'basic'
|
||||
|
||||
|
||||
def _validate_classification(payload):
|
||||
"""Validate domain, knowledge_type, complexity before upsert.
|
||||
|
||||
Logs WARNING and applies safe fallback for any invalid values.
|
||||
Returns the payload (modified in place if needed).
|
||||
"""
|
||||
title = payload.get('title', payload.get('filename', '?'))
|
||||
|
||||
# ── domain ──────────────────────────────────────────────────────────
|
||||
domain = payload.get('domain')
|
||||
if isinstance(domain, list):
|
||||
valid = [d for d in domain if d in VALID_DOMAINS]
|
||||
if valid:
|
||||
payload['domain'] = valid[0]
|
||||
else:
|
||||
logger.warning(f"Invalid domain {domain} for '{title}', fallback → {DOMAIN_FALLBACK}")
|
||||
payload['domain'] = DOMAIN_FALLBACK
|
||||
elif isinstance(domain, str):
|
||||
if domain not in VALID_DOMAINS:
|
||||
logger.warning(f"Invalid domain '{domain}' for '{title}', fallback → {DOMAIN_FALLBACK}")
|
||||
payload['domain'] = DOMAIN_FALLBACK
|
||||
else:
|
||||
payload['domain'] = DOMAIN_FALLBACK
|
||||
|
||||
# ── knowledge_type ──────────────────────────────────────────────────
|
||||
kt = payload.get('knowledge_type', '')
|
||||
if isinstance(kt, str):
|
||||
kt = kt.lower().strip()
|
||||
else:
|
||||
kt = ''
|
||||
if kt not in VALID_KNOWLEDGE_TYPES:
|
||||
logger.warning(f"Invalid knowledge_type '{kt}' for '{title}', fallback → {KNOWLEDGE_TYPE_FALLBACK}")
|
||||
payload['knowledge_type'] = KNOWLEDGE_TYPE_FALLBACK
|
||||
else:
|
||||
payload['knowledge_type'] = kt
|
||||
|
||||
# ── complexity ──────────────────────────────────────────────────────
|
||||
cx = payload.get('complexity', '')
|
||||
if isinstance(cx, str):
|
||||
cx = cx.lower().strip()
|
||||
else:
|
||||
cx = ''
|
||||
if cx not in VALID_COMPLEXITIES:
|
||||
logger.warning(f"Invalid complexity '{cx}' for '{title}', fallback → {COMPLEXITY_FALLBACK}")
|
||||
payload['complexity'] = COMPLEXITY_FALLBACK
|
||||
else:
|
||||
payload['complexity'] = cx
|
||||
|
||||
return payload
|
||||
|
||||
|
||||
def get_embedding_single(text, config):
|
||||
"""Get a single embedding — uses TEI or Ollama depending on config."""
|
||||
backend = config['embedding'].get('backend', 'ollama')
|
||||
|
||||
if backend == 'tei':
|
||||
url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/embed"
|
||||
resp = http_requests.post(url, json={"inputs": text}, timeout=120)
|
||||
resp.raise_for_status()
|
||||
return resp.json()[0]
|
||||
else:
|
||||
url = f"http://{config['embedding']['ollama_host']}:{config['embedding']['ollama_port']}/api/embed"
|
||||
resp = http_requests.post(url, json={
|
||||
"model": config['embedding']['model'],
|
||||
"input": text
|
||||
}, timeout=120)
|
||||
resp.raise_for_status()
|
||||
return resp.json()['embeddings'][0]
|
||||
|
||||
|
||||
def get_embeddings_batch(texts, config):
|
||||
"""Get embeddings for a batch of texts via TEI. Falls back to sequential on error."""
|
||||
url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/embed"
|
||||
|
||||
try:
|
||||
resp = http_requests.post(url, json={"inputs": texts}, timeout=300)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
except Exception as e:
|
||||
if len(texts) <= 1:
|
||||
raise
|
||||
# Split batch in half and retry each half
|
||||
mid = len(texts) // 2
|
||||
logger.warning(f" Batch of {len(texts)} failed ({e}), splitting in half")
|
||||
left = get_embeddings_batch(texts[:mid], config)
|
||||
right = get_embeddings_batch(texts[mid:], config)
|
||||
return left + right
|
||||
|
||||
|
||||
def get_sparse_embeddings_batch(texts, config):
|
||||
"""Get sparse embeddings from the sparse embedding service on cortex.
|
||||
|
||||
Returns a list of dicts with 'indices' and 'values' keys, or None on failure.
|
||||
"""
|
||||
sparse_cfg = config.get('sparse_embedding')
|
||||
if not sparse_cfg or not sparse_cfg.get('enabled', False):
|
||||
return None
|
||||
|
||||
url = f"http://{sparse_cfg['host']}:{sparse_cfg['port']}/embed_sparse"
|
||||
|
||||
try:
|
||||
resp = http_requests.post(url, json={"inputs": texts}, timeout=300)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
except Exception as e:
|
||||
logger.warning(f" Sparse embedding failed for batch of {len(texts)}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def _validate_content(content):
|
||||
"""Validate and normalize concept content for embedding. Returns clean string or None."""
|
||||
if content is None:
|
||||
return None
|
||||
if not isinstance(content, str):
|
||||
content = str(content)
|
||||
content = content.strip()
|
||||
if len(content) < 10:
|
||||
return None
|
||||
# Truncate to 8192 chars (Ollama/TEI input limit)
|
||||
if len(content) > 8192:
|
||||
content = content[:8192]
|
||||
return content
|
||||
|
||||
|
||||
def _build_payload(doc, concept, idx, source, download_url, source_type, page_timestamps):
|
||||
"""Build and validate payload for a single concept point."""
|
||||
start_page = concept.get('_start_page', 0)
|
||||
|
||||
payload = {
|
||||
'doc_hash': doc.get('hash', ''),
|
||||
'filename': doc['filename'],
|
||||
'book_title': doc.get('book_title', ''),
|
||||
'book_author': doc.get('book_author', ''),
|
||||
'source': source,
|
||||
'download_url': download_url,
|
||||
'source_type': source_type,
|
||||
'verification_status': 'unverified',
|
||||
'credibility_score': 0.7,
|
||||
'language': 'en',
|
||||
}
|
||||
|
||||
for field in ['content', 'summary', 'title', 'domain', 'subdomain',
|
||||
'keywords', 'knowledge_type', 'complexity',
|
||||
'key_facts', 'scenario_applicable',
|
||||
'cross_domain_tags', 'chapter', 'page_ref', 'notes',
|
||||
'_window', '_start_page']:
|
||||
if field in concept:
|
||||
payload[field] = concept[field]
|
||||
|
||||
# Add video timestamp for transcript sources
|
||||
if source_type == 'transcript' and page_timestamps:
|
||||
page_key = f"page_{start_page:04d}"
|
||||
if page_key in page_timestamps:
|
||||
payload['video_timestamp'] = page_timestamps[page_key]
|
||||
|
||||
# Validate classification fields before returning
|
||||
payload = _validate_classification(payload)
|
||||
|
||||
return payload
|
||||
|
||||
|
||||
def _build_point(point_id, dense_vector, sparse_vec, payload, config):
|
||||
"""Build a PointStruct with dense vector and optional sparse vector."""
|
||||
sparse_cfg = config.get('sparse_embedding')
|
||||
if sparse_cfg and sparse_cfg.get('enabled', False) and sparse_vec:
|
||||
vector = {
|
||||
"": dense_vector,
|
||||
"bge-m3-sparse": SparseVector(
|
||||
indices=sparse_vec['indices'],
|
||||
values=sparse_vec['values'],
|
||||
),
|
||||
}
|
||||
else:
|
||||
vector = {"": dense_vector}
|
||||
|
||||
return PointStruct(id=point_id, vector=vector, payload=payload)
|
||||
|
||||
|
||||
def embed_single(file_hash, db, config):
|
||||
doc = db.get_document(file_hash)
|
||||
if not doc:
|
||||
return False
|
||||
|
||||
concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
|
||||
if not os.path.exists(concepts_dir):
|
||||
db.mark_failed(file_hash, f"Concepts directory not found: {concepts_dir}")
|
||||
return False
|
||||
|
||||
db.update_status(file_hash, 'embedding')
|
||||
|
||||
try:
|
||||
qdrant = QdrantClient(
|
||||
host=config['vector_db']['host'],
|
||||
port=config['vector_db']['port'],
|
||||
timeout=60
|
||||
)
|
||||
collection = config['vector_db']['collection']
|
||||
qdrant_batch_size = config['processing']['embed_batch_size']
|
||||
embed_batch_size = config['embedding'].get('batch_size', 128)
|
||||
backend = config['embedding'].get('backend', 'ollama')
|
||||
|
||||
window_files = sorted([
|
||||
f for f in os.listdir(concepts_dir)
|
||||
if f.startswith('window_') and f.endswith('.json')
|
||||
])
|
||||
|
||||
if not window_files:
|
||||
db.mark_failed(file_hash, "No window files found")
|
||||
return False
|
||||
|
||||
all_concepts = []
|
||||
for wf in window_files:
|
||||
with open(os.path.join(concepts_dir, wf), encoding='utf-8') as f:
|
||||
concepts = json.load(f)
|
||||
if isinstance(concepts, list):
|
||||
all_concepts.extend([c for c in concepts if isinstance(c, dict)])
|
||||
|
||||
if not all_concepts:
|
||||
db.update_status(file_hash, 'complete', vectors_inserted=0)
|
||||
logger.info(f"No concepts to embed for {doc['filename']}")
|
||||
return True
|
||||
|
||||
# Look up source from catalogue once per doc
|
||||
cat_conn = db._get_conn()
|
||||
cat_row = cat_conn.execute(
|
||||
"SELECT source FROM catalogue WHERE hash = ?", (file_hash,)
|
||||
).fetchone()
|
||||
source = dict(cat_row)['source'] if cat_row else ''
|
||||
|
||||
download_url = ''
|
||||
is_web = doc.get('path', '').startswith(('http://', 'https://'))
|
||||
source_type = 'web' if is_web else 'document'
|
||||
|
||||
# Check meta.json for explicit source_type (e.g. 'transcript')
|
||||
text_dir = os.path.join(config['paths']['text'], file_hash)
|
||||
meta_path = os.path.join(text_dir, 'meta.json')
|
||||
page_timestamps = {}
|
||||
if os.path.exists(meta_path):
|
||||
try:
|
||||
with open(meta_path) as mf:
|
||||
meta = json.load(mf)
|
||||
if meta.get('source_type'):
|
||||
source_type = meta['source_type']
|
||||
if not download_url and meta.get('url'):
|
||||
download_url = meta['url']
|
||||
if meta.get('page_timestamps'):
|
||||
page_timestamps = meta['page_timestamps']
|
||||
except Exception:
|
||||
pass
|
||||
if doc.get('path'):
|
||||
download_url = generate_download_url(
|
||||
doc['path'], config.get('library_root', '/mnt/library')
|
||||
)
|
||||
|
||||
# Build list of valid concepts with their indices
|
||||
valid = []
|
||||
skipped = 0
|
||||
for idx, concept in enumerate(all_concepts):
|
||||
content = _validate_content(concept.get('content', ''))
|
||||
if content is None:
|
||||
skipped += 1
|
||||
continue
|
||||
valid.append((idx, concept, content))
|
||||
|
||||
if skipped > 0:
|
||||
logger.info(f" Skipped {skipped} concepts with invalid/empty content")
|
||||
|
||||
if not valid:
|
||||
db.update_status(file_hash, 'complete', vectors_inserted=0)
|
||||
logger.info(f"No valid concepts to embed for {doc['filename']}")
|
||||
return True
|
||||
|
||||
points = []
|
||||
embedded_count = 0
|
||||
|
||||
if backend == 'tei':
|
||||
# TEI: batch embedding
|
||||
for batch_start in range(0, len(valid), embed_batch_size):
|
||||
batch = valid[batch_start:batch_start + embed_batch_size]
|
||||
texts = [content for _, _, content in batch]
|
||||
|
||||
try:
|
||||
vectors = get_embeddings_batch(texts, config)
|
||||
except Exception as e:
|
||||
logger.error(f" Batch embedding failed at offset {batch_start}: {e}")
|
||||
# Skip entire batch on unrecoverable error
|
||||
continue
|
||||
|
||||
# Get sparse embeddings for the same batch
|
||||
sparse_results = get_sparse_embeddings_batch(texts, config)
|
||||
|
||||
for i, ((idx, concept, content), vector) in enumerate(zip(batch, vectors)):
|
||||
start_page = concept.get('_start_page', 0)
|
||||
point_id = concept_id(file_hash, start_page, idx)
|
||||
|
||||
payload = _build_payload(
|
||||
doc, concept, idx, source, download_url,
|
||||
source_type, page_timestamps
|
||||
)
|
||||
|
||||
sparse_vec = sparse_results[i] if sparse_results and i < len(sparse_results) else None
|
||||
points.append(_build_point(point_id, vector, sparse_vec, payload, config))
|
||||
embedded_count += 1
|
||||
|
||||
if len(points) >= qdrant_batch_size:
|
||||
qdrant.upsert(collection_name=collection, points=points)
|
||||
logger.debug(f" Upserted batch of {len(points)} points")
|
||||
points = []
|
||||
|
||||
else:
|
||||
# Ollama: one-at-a-time with retry
|
||||
for idx, concept, content in valid:
|
||||
try:
|
||||
vector = get_embedding_single(content, config)
|
||||
except Exception as e:
|
||||
logger.warning(f" Embedding failed for concept {idx}: {e}")
|
||||
time.sleep(2)
|
||||
try:
|
||||
vector = get_embedding_single(content, config)
|
||||
except Exception as e2:
|
||||
logger.error(f" Embedding retry failed for concept {idx}: {e2}")
|
||||
continue
|
||||
|
||||
# Get sparse embedding for single text
|
||||
sparse_results = get_sparse_embeddings_batch([content], config)
|
||||
sparse_vec = sparse_results[0] if sparse_results else None
|
||||
|
||||
start_page = concept.get('_start_page', 0)
|
||||
point_id = concept_id(file_hash, start_page, idx)
|
||||
|
||||
payload = _build_payload(
|
||||
doc, concept, idx, source, download_url,
|
||||
source_type, page_timestamps
|
||||
)
|
||||
|
||||
points.append(_build_point(point_id, vector, sparse_vec, payload, config))
|
||||
embedded_count += 1
|
||||
|
||||
if len(points) >= qdrant_batch_size:
|
||||
qdrant.upsert(collection_name=collection, points=points)
|
||||
logger.debug(f" Upserted batch of {len(points)} points")
|
||||
points = []
|
||||
|
||||
if points:
|
||||
qdrant.upsert(collection_name=collection, points=points)
|
||||
logger.debug(f" Upserted final batch of {len(points)} points")
|
||||
|
||||
db.update_status(file_hash, 'complete', vectors_inserted=embedded_count)
|
||||
logger.info(f"Embedded {doc['filename']}: {embedded_count} vectors ({skipped} skipped)")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Embedding failed for {file_hash}: {e}\n{traceback.format_exc()}")
|
||||
db.mark_failed(file_hash, str(e))
|
||||
return False
|
||||
|
||||
|
||||
def run_embedding(workers=None, limit=None):
|
||||
config = get_config()
|
||||
db = StatusDB()
|
||||
workers = workers or config['processing']['embed_workers']
|
||||
|
||||
enriched = db.get_by_status('enriched', limit=limit)
|
||||
if not enriched:
|
||||
logger.info("No enriched documents to embed")
|
||||
return 0
|
||||
|
||||
backend = config['embedding'].get('backend', 'ollama')
|
||||
sparse_cfg = config.get('sparse_embedding')
|
||||
sparse_status = "enabled" if (sparse_cfg and sparse_cfg.get('enabled')) else "disabled"
|
||||
logger.info(f"Embedding {len(enriched)} documents with {workers} workers (backend: {backend}, sparse: {sparse_status})")
|
||||
success = 0
|
||||
|
||||
with ThreadPoolExecutor(max_workers=workers) as pool:
|
||||
futures = {
|
||||
pool.submit(embed_single, doc['hash'], StatusDB(), config): doc
|
||||
for doc in enriched
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
doc = futures[future]
|
||||
try:
|
||||
if future.result():
|
||||
success += 1
|
||||
except Exception as e:
|
||||
logger.error(f"Worker error for {doc['hash']}: {e}")
|
||||
|
||||
logger.info(f"Embedding complete: {success}/{len(enriched)} succeeded")
|
||||
return success
|
||||
561
lib/enricher.py
Normal file
561
lib/enricher.py
Normal file
|
|
@ -0,0 +1,561 @@
|
|||
"""
|
||||
RECON Enricher
|
||||
|
||||
Text to structured concepts via Gemini API. Saves JSON to data/concepts/{hash}/
|
||||
BEFORE any DB operations. Uses 10-page windows, 4 API keys, 16 workers.
|
||||
|
||||
Resilience:
|
||||
- Exponential backoff with jitter for transient errors (429, 500, 503, timeout)
|
||||
- Permanent errors (JSON parse, auth) fail immediately without wasting retries
|
||||
- Window failures skip that window and continue — partial enrichment beats zero
|
||||
- Document marked enriched if ANY windows succeeded, failed only if ALL failed
|
||||
|
||||
Dependencies: google-generativeai
|
||||
Config: processing.enrich_workers, processing.enrich_window_size, gemini, paths.concepts
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import time
|
||||
import traceback
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
import google.generativeai as genai
|
||||
|
||||
from .utils import get_config, setup_logging
|
||||
from .status import StatusDB
|
||||
|
||||
logger = setup_logging('recon.enricher')
|
||||
|
||||
# Docs stuck in "enriching" longer than this get reset to "extracted" for retry
|
||||
STALE_ENRICHING_HOURS = 2
|
||||
|
||||
# ── Classification allowlists ───────────────────────────────────────────────
|
||||
VALID_DOMAINS = {
|
||||
'Agriculture & Livestock', 'Civil Organization', 'Communications',
|
||||
'Food Systems', 'Foundational Skills', 'Logistics', 'Medical',
|
||||
'Navigation', 'Operations', 'Power Systems', 'Preservation & Storage',
|
||||
'Security', 'Shelter & Construction', 'Technology', 'Tools & Equipment',
|
||||
'Vehicles', 'Water Systems', 'Wilderness Skills',
|
||||
}
|
||||
VALID_KNOWLEDGE_TYPES = {'foundational', 'procedural', 'operational'}
|
||||
VALID_COMPLEXITIES = {'basic', 'intermediate', 'advanced'}
|
||||
|
||||
DOMAIN_FALLBACK = 'Foundational Skills'
|
||||
KNOWLEDGE_TYPE_FALLBACK = 'foundational'
|
||||
COMPLEXITY_FALLBACK = 'basic'
|
||||
|
||||
|
||||
def repair_json(text):
|
||||
"""Attempt to repair common LLM JSON output issues including truncation."""
|
||||
# Remove control characters except newlines and tabs
|
||||
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
|
||||
# Fix invalid JSON escape sequences (e.g. \e, \p, \c from Gemini)
|
||||
# Valid JSON escapes: \", \\, \/, \b, \f, \n, \r, \t, \uXXXX
|
||||
text = re.sub(r'\\(?!["\\/bfnrtu])', r'\\\\', text)
|
||||
# Remove trailing commas before } or ]
|
||||
text = re.sub(r',\s*([}\]])', r'\1', text)
|
||||
|
||||
# Handle truncated JSON: try to find the last complete object in the array
|
||||
try:
|
||||
json.loads(text, strict=False)
|
||||
return text
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# Find the last complete }, then close the array
|
||||
# Walk backward to find the last valid closing brace
|
||||
last_complete = -1
|
||||
depth_brace = 0
|
||||
depth_bracket = 0
|
||||
in_string = False
|
||||
escape = False
|
||||
|
||||
for i, ch in enumerate(text):
|
||||
if escape:
|
||||
escape = False
|
||||
continue
|
||||
if ch == '\\' and in_string:
|
||||
escape = True
|
||||
continue
|
||||
if ch == '"' and not escape:
|
||||
in_string = not in_string
|
||||
continue
|
||||
if in_string:
|
||||
continue
|
||||
if ch == '{':
|
||||
depth_brace += 1
|
||||
elif ch == '}':
|
||||
depth_brace -= 1
|
||||
if depth_brace == 0:
|
||||
last_complete = i
|
||||
elif ch == '[':
|
||||
depth_bracket += 1
|
||||
elif ch == ']':
|
||||
depth_bracket -= 1
|
||||
|
||||
if last_complete > 0:
|
||||
truncated = text[:last_complete + 1].rstrip().rstrip(',')
|
||||
# Close any open arrays
|
||||
open_brackets = truncated.count('[') - truncated.count(']')
|
||||
truncated += ']' * open_brackets
|
||||
return truncated
|
||||
|
||||
return text
|
||||
|
||||
ENRICH_PROMPT = """Extract knowledge concepts from this document text.
|
||||
|
||||
A concept is a SELF-CONTAINED piece of knowledge that can stand alone.
|
||||
|
||||
For each concept, provide ALL fields:
|
||||
|
||||
Required:
|
||||
- content: Full text of the concept (complete procedure, definition, etc.)
|
||||
- summary: 1-2 sentence summary
|
||||
- title: Brief descriptive title
|
||||
- domain: must be exactly one of: Agriculture & Livestock, Civil Organization, Communications, Food Systems, Foundational Skills, Logistics, Medical, Navigation, Operations, Power Systems, Preservation & Storage, Security, Shelter & Construction, Technology, Tools & Equipment, Vehicles, Water Systems, Wilderness Skills — return ONLY this exact string, no variations, no new domains, no underscores, no synonyms
|
||||
CRITICAL: Medical content (first aid, anatomy, pharmacology, herbs, veterinary, austere medicine) → Medical
|
||||
CRITICAL: Food growing, farming, animal husbandry, livestock → Agriculture & Livestock
|
||||
CRITICAL: Foraging, hunting, fishing, bushcraft, wilderness survival → Wilderness Skills
|
||||
CRITICAL: Food preservation, storage, canning, dehydration, processing → Preservation & Storage
|
||||
CRITICAL: Solar, wind, hydro, batteries, generators → Power Systems
|
||||
CRITICAL: Water sourcing, filtration, sanitation, purification → Water Systems
|
||||
CRITICAL: Building, carpentry, structural construction, shelter → Shelter & Construction
|
||||
CRITICAL: Tactical operations, mission execution, combat maneuvers, search & rescue → Operations
|
||||
CRITICAL: Governance, civil administration, community leadership → Civil Organization
|
||||
CRITICAL: Electronics, IT, computing, engineering → Technology
|
||||
CRITICAL: Hand tools, power tools, equipment maintenance → Tools & Equipment
|
||||
CRITICAL: Motor vehicles, aircraft, watercraft, vehicle maintenance → Vehicles
|
||||
CRITICAL: Radio, signals, networking, comms equipment → Communications
|
||||
CRITICAL: Supply chain, transport, distribution, inventory → Logistics
|
||||
CRITICAL: Physical security, OPSEC, threat assessment → Security
|
||||
CRITICAL: Map reading, orienteering, GPS, celestial navigation → Navigation
|
||||
CRITICAL: Cooking methods, food production, recipes, nutrition → Food Systems
|
||||
- subdomain: Array of specific subcategories (up to 10)
|
||||
- keywords: Array of 3-30 searchable terms
|
||||
- knowledge_type: foundational | procedural | operational
|
||||
foundational — concepts, definitions, theory, background knowledge, explanations of how things work
|
||||
procedural — step-by-step techniques, instructions, how-to skills, methods you execute
|
||||
operational — application under real conditions, decision-making, mission execution, judgment calls in context
|
||||
Valid values are ONLY: foundational, procedural, operational — do not use any other values
|
||||
- complexity: basic | intermediate | advanced
|
||||
basic — requires little or no prior knowledge, introductory material, simple concepts
|
||||
intermediate — requires some domain familiarity, assumes foundational knowledge is in place
|
||||
advanced — requires significant experience or expertise, high-stakes or highly technical material
|
||||
Valid values are ONLY: basic, intermediate, advanced — do not use any other values
|
||||
- key_facts: Array of specific extractable claims, measurements, data points
|
||||
|
||||
Optional (include when present):
|
||||
- scenario_applicable: Array from: tuesday_prepper, month_prepper, year_prepper, multi_year, eotwawki
|
||||
- cross_domain_tags: Array from: sustainment, medical, security, communications, leadership, logistics, navigation, power_systems, water_systems, food_systems, tactical_ops, community_coordination
|
||||
- chapter: Chapter name if identifiable
|
||||
- page_ref: Page reference
|
||||
- notes: Any additional context
|
||||
|
||||
EXAMPLES (knowledge_type + complexity):
|
||||
- "Needle chest decompression procedure" → knowledge_type: "procedural", complexity: "advanced"
|
||||
- "What is soil texture and why does it matter" → knowledge_type: "foundational", complexity: "basic"
|
||||
- "Coordinating a fire team withdrawal under contact" → knowledge_type: "operational", complexity: "advanced"
|
||||
|
||||
Return JSON array. If no extractable concepts, return [].
|
||||
|
||||
Document text:
|
||||
"""
|
||||
|
||||
|
||||
class KeyRotator:
|
||||
def __init__(self, keys):
|
||||
self.keys = keys
|
||||
self.index = 0
|
||||
|
||||
def next(self):
|
||||
if not self.keys:
|
||||
raise ValueError("No Gemini API keys configured")
|
||||
key = self.keys[self.index % len(self.keys)]
|
||||
self.index += 1
|
||||
return key
|
||||
|
||||
|
||||
def enrich_window(text, key, config):
|
||||
genai.configure(api_key=key)
|
||||
model = genai.GenerativeModel(
|
||||
config['gemini']['model'],
|
||||
generation_config={"response_mime_type": config['gemini']['response_mime_type']}
|
||||
)
|
||||
response = model.generate_content(ENRICH_PROMPT + text)
|
||||
raw = response.text
|
||||
try:
|
||||
result = json.loads(raw, strict=False)
|
||||
except json.JSONDecodeError:
|
||||
repaired = repair_json(raw)
|
||||
result = json.loads(repaired, strict=False)
|
||||
# Filter out non-dict items (nested lists from truncated responses)
|
||||
if isinstance(result, list):
|
||||
result = [c for c in result if isinstance(c, dict)]
|
||||
return result
|
||||
|
||||
|
||||
def _is_transient(error_str):
|
||||
"""Classify whether an error is transient (worth retrying) or permanent."""
|
||||
s = error_str.lower()
|
||||
transient_signals = ['429', 'resource_exhausted', 'quota', 'rate',
|
||||
'500', '503', 'unavailable', 'timeout',
|
||||
'connection', 'reset by peer', 'broken pipe']
|
||||
return any(sig in s for sig in transient_signals)
|
||||
|
||||
|
||||
def _retry_with_backoff(fn, max_retries=5, base_delay=5.0, max_delay=120.0):
|
||||
"""Retry with exponential backoff + jitter for transient errors.
|
||||
|
||||
Backoff: ~5s, ~10s, ~20s, ~40s, ~80s (total ~155s before giving up).
|
||||
Permanent errors (JSON parse, auth) raise immediately without retrying.
|
||||
"""
|
||||
last_exc = None
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
return fn()
|
||||
except Exception as e:
|
||||
last_exc = e
|
||||
err = str(e)
|
||||
if not _is_transient(err):
|
||||
raise # permanent — don't waste retries
|
||||
if attempt < max_retries - 1:
|
||||
delay = min(base_delay * (2 ** attempt) + random.uniform(0, base_delay), max_delay)
|
||||
logger.info(f" Transient error (attempt {attempt+1}/{max_retries}), "
|
||||
f"retrying in {delay:.0f}s: {err[:120]}")
|
||||
time.sleep(delay)
|
||||
else:
|
||||
logger.warning(f" Transient error, max retries exhausted: {err[:150]}")
|
||||
raise last_exc
|
||||
|
||||
|
||||
def _reclassify_field(field_name, allowlist, concept, key, config, max_retries=3):
|
||||
"""Retry Gemini up to max_retries to get a valid value for a specific field."""
|
||||
content = concept.get('content', concept.get('summary', ''))
|
||||
if isinstance(content, str):
|
||||
content = content[:400]
|
||||
else:
|
||||
content = str(content)[:400]
|
||||
title = concept.get('title', '(untitled)')
|
||||
allowlist_str = ', '.join(sorted(allowlist))
|
||||
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
prompt = (
|
||||
f"Your previous response for '{field_name}' was invalid. "
|
||||
f"You must return ONLY one of these exact strings: {allowlist_str}\n\n"
|
||||
f"Title: {title}\n"
|
||||
f"Content: {content}\n\n"
|
||||
f"Return ONLY the exact string, nothing else. No explanation, no punctuation, no quotes."
|
||||
)
|
||||
genai.configure(api_key=key)
|
||||
model = genai.GenerativeModel(
|
||||
config['gemini']['model'],
|
||||
generation_config={"response_mime_type": "text/plain"}
|
||||
)
|
||||
resp = model.generate_content(prompt)
|
||||
value = resp.text.strip().strip('"').strip("'").strip()
|
||||
if value in allowlist:
|
||||
return value
|
||||
# Try case-insensitive match for knowledge_type/complexity
|
||||
for valid in allowlist:
|
||||
if value.lower() == valid.lower():
|
||||
return valid
|
||||
except Exception as e:
|
||||
err = str(e).lower()
|
||||
if any(s in err for s in ['429', 'quota', 'rate', '503']):
|
||||
time.sleep(min(3 * (2 ** attempt) + random.uniform(0, 2), 30))
|
||||
else:
|
||||
logger.warning(f" Reclassify retry {attempt+1} for {field_name} failed: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def validate_and_fix_concepts(concepts, key, config):
|
||||
"""Validate domain, knowledge_type, complexity on each concept.
|
||||
|
||||
For invalid values: retry Gemini up to 3 times, then apply safe fallback.
|
||||
"""
|
||||
for concept in concepts:
|
||||
if not isinstance(concept, dict):
|
||||
continue
|
||||
|
||||
# ── Validate domain ─────────────────────────────────────────────
|
||||
domain = concept.get('domain')
|
||||
if isinstance(domain, list):
|
||||
# Legacy array format — find first valid or reclassify
|
||||
valid = [d for d in domain if d in VALID_DOMAINS]
|
||||
if valid:
|
||||
concept['domain'] = valid[0]
|
||||
else:
|
||||
new_val = _reclassify_field('domain', VALID_DOMAINS, concept, key, config)
|
||||
if new_val:
|
||||
concept['domain'] = new_val
|
||||
else:
|
||||
logger.warning(f"Invalid domain {domain} for '{concept.get('title', '?')}', using fallback")
|
||||
concept['domain'] = DOMAIN_FALLBACK
|
||||
elif isinstance(domain, str):
|
||||
if domain not in VALID_DOMAINS:
|
||||
new_val = _reclassify_field('domain', VALID_DOMAINS, concept, key, config)
|
||||
if new_val:
|
||||
concept['domain'] = new_val
|
||||
else:
|
||||
logger.warning(f"Invalid domain '{domain}' for '{concept.get('title', '?')}', using fallback")
|
||||
concept['domain'] = DOMAIN_FALLBACK
|
||||
else:
|
||||
concept['domain'] = DOMAIN_FALLBACK
|
||||
|
||||
# ── Validate knowledge_type ─────────────────────────────────────
|
||||
kt = concept.get('knowledge_type', '')
|
||||
if isinstance(kt, str):
|
||||
kt = kt.lower().strip()
|
||||
else:
|
||||
kt = ''
|
||||
if kt not in VALID_KNOWLEDGE_TYPES:
|
||||
new_val = _reclassify_field('knowledge_type', VALID_KNOWLEDGE_TYPES, concept, key, config)
|
||||
if new_val:
|
||||
concept['knowledge_type'] = new_val
|
||||
else:
|
||||
logger.warning(f"Invalid knowledge_type '{kt}' for '{concept.get('title', '?')}', using fallback")
|
||||
concept['knowledge_type'] = KNOWLEDGE_TYPE_FALLBACK
|
||||
else:
|
||||
concept['knowledge_type'] = kt
|
||||
|
||||
# ── Validate complexity ─────────────────────────────────────────
|
||||
cx = concept.get('complexity', '')
|
||||
if isinstance(cx, str):
|
||||
cx = cx.lower().strip()
|
||||
else:
|
||||
cx = ''
|
||||
if cx not in VALID_COMPLEXITIES:
|
||||
new_val = _reclassify_field('complexity', VALID_COMPLEXITIES, concept, key, config)
|
||||
if new_val:
|
||||
concept['complexity'] = new_val
|
||||
else:
|
||||
logger.warning(f"Invalid complexity '{cx}' for '{concept.get('title', '?')}', using fallback")
|
||||
concept['complexity'] = COMPLEXITY_FALLBACK
|
||||
else:
|
||||
concept['complexity'] = cx
|
||||
|
||||
return concepts
|
||||
|
||||
|
||||
def enrich_single(file_hash, db, config, key_rotator):
|
||||
doc = db.get_document(file_hash)
|
||||
if not doc:
|
||||
return False
|
||||
|
||||
text_dir = os.path.join(config['paths']['text'], file_hash)
|
||||
concepts_dir = os.path.join(config['paths']['concepts'], file_hash)
|
||||
window_size = config['processing']['enrich_window_size']
|
||||
delay = config['processing']['rate_limit_delay']
|
||||
proc = config.get('processing', {})
|
||||
max_retries = proc.get('enrich_max_retries', proc.get('max_retries', 5))
|
||||
base_delay = proc.get('enrich_base_delay', 5.0)
|
||||
max_delay = proc.get('enrich_max_delay', 120.0)
|
||||
|
||||
if not os.path.exists(text_dir):
|
||||
db.mark_failed(file_hash, f"Text directory not found: {text_dir}")
|
||||
return False
|
||||
|
||||
db.update_status(file_hash, 'enriching')
|
||||
|
||||
try:
|
||||
os.makedirs(concepts_dir, exist_ok=True)
|
||||
|
||||
page_files = sorted([f for f in os.listdir(text_dir) if f.startswith('page_') and f.endswith('.txt')])
|
||||
if not page_files:
|
||||
db.mark_failed(file_hash, "No page files found")
|
||||
return False
|
||||
|
||||
pages_text = []
|
||||
for pf in page_files:
|
||||
with open(os.path.join(text_dir, pf), encoding='utf-8') as f:
|
||||
pages_text.append(f.read())
|
||||
|
||||
windows = []
|
||||
for i in range(0, len(pages_text), window_size):
|
||||
window_pages = pages_text[i:i + window_size]
|
||||
combined = "\n\n".join(f"--- Page {i + j + 1} ---\n{t}" for j, t in enumerate(window_pages))
|
||||
windows.append((i, combined))
|
||||
|
||||
total_concepts = 0
|
||||
failed_windows = []
|
||||
|
||||
for w_idx, (start_page, window_text) in enumerate(windows):
|
||||
window_file = os.path.join(concepts_dir, f"window_{w_idx+1:04d}.json")
|
||||
|
||||
if os.path.exists(window_file):
|
||||
with open(window_file, encoding='utf-8') as f:
|
||||
existing = json.load(f)
|
||||
total_concepts += len(existing)
|
||||
logger.debug(f" Window {w_idx+1} already exists, skipping")
|
||||
continue
|
||||
|
||||
if len(window_text.strip()) < 50:
|
||||
with open(window_file, 'w') as f:
|
||||
json.dump([], f)
|
||||
continue
|
||||
|
||||
# Attempt enrichment with backoff — failures skip the window, not the doc
|
||||
try:
|
||||
key = key_rotator.next()
|
||||
concepts = _retry_with_backoff(
|
||||
lambda k=key: enrich_window(window_text, k, config),
|
||||
max_retries=max_retries,
|
||||
base_delay=base_delay,
|
||||
max_delay=max_delay,
|
||||
)
|
||||
except Exception as e:
|
||||
failed_windows.append((w_idx + 1, str(e)[:100]))
|
||||
logger.warning(f" Window {w_idx+1}/{len(windows)} failed: {e}")
|
||||
continue # skip this window, keep going
|
||||
|
||||
if not isinstance(concepts, list):
|
||||
concepts = [concepts] if isinstance(concepts, dict) else []
|
||||
concepts = [c for c in concepts if isinstance(c, dict)]
|
||||
|
||||
# Validate domain, knowledge_type, complexity — retry then fallback
|
||||
validation_key = key_rotator.next()
|
||||
concepts = validate_and_fix_concepts(concepts, validation_key, config)
|
||||
|
||||
for c_idx, concept in enumerate(concepts):
|
||||
concept['_window'] = w_idx + 1
|
||||
concept['_start_page'] = start_page + 1
|
||||
concept['_doc_hash'] = file_hash
|
||||
|
||||
# JSON FIRST: save before anything else
|
||||
with open(window_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(concepts, f, indent=2, ensure_ascii=False)
|
||||
|
||||
total_concepts += len(concepts)
|
||||
logger.debug(f" Window {w_idx+1}/{len(windows)}: {len(concepts)} concepts")
|
||||
time.sleep(delay)
|
||||
|
||||
# Decide document status based on results
|
||||
meta = {
|
||||
'hash': file_hash,
|
||||
'total_windows': len(windows),
|
||||
'total_concepts': total_concepts,
|
||||
'failed_windows': len(failed_windows),
|
||||
'window_size': window_size,
|
||||
'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
|
||||
}
|
||||
with open(os.path.join(concepts_dir, 'meta.json'), 'w') as f:
|
||||
json.dump(meta, f, indent=2)
|
||||
|
||||
if total_concepts > 0 or not failed_windows:
|
||||
# Some concepts extracted, or all windows were empty — mark enriched
|
||||
error_msg = None
|
||||
if total_concepts == 0 and doc.get('page_count', 0) >= 3:
|
||||
error_msg = (f"0 concepts from {doc.get('page_count', '?')} pages — "
|
||||
f"likely image-only PDF, may need manual review")
|
||||
logger.warning(f" {doc['filename']}: {error_msg}")
|
||||
elif failed_windows:
|
||||
wins = ', '.join(str(w) for w, _ in failed_windows[:10])
|
||||
error_msg = (f"Partial: {len(failed_windows)}/{len(windows)} "
|
||||
f"windows failed (windows {wins})")
|
||||
logger.warning(f" {doc['filename']}: {error_msg}")
|
||||
db.update_status(file_hash, 'enriched', concepts_extracted=total_concepts,
|
||||
error_message=error_msg)
|
||||
fw_note = f", {len(failed_windows)} windows failed" if failed_windows else ""
|
||||
logger.info(f"Enriched {doc['filename']}: {total_concepts} concepts "
|
||||
f"from {len(windows)} windows{fw_note}")
|
||||
return True
|
||||
else:
|
||||
# Every window failed — document truly failed
|
||||
first_err = failed_windows[0][1] if failed_windows else 'unknown'
|
||||
db.mark_failed(file_hash,
|
||||
f"All {len(windows)} windows failed: {first_err}")
|
||||
logger.error(f" {doc['filename']}: all {len(windows)} windows failed")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Enrichment failed for {file_hash}: {e}\n{traceback.format_exc()}")
|
||||
db.mark_failed(file_hash, str(e))
|
||||
return False
|
||||
|
||||
|
||||
def _recover_stale_enriching(db, max_hours=STALE_ENRICHING_HOURS):
|
||||
"""Reset docs stuck in enriching back to extracted so they get retried.
|
||||
|
||||
This handles the case where a previous enrichment run crashed mid-document.
|
||||
The enricher skips already-completed window files, so no work is lost.
|
||||
"""
|
||||
import sqlite3
|
||||
conn = db._get_conn()
|
||||
rows = conn.execute(
|
||||
"SELECT hash, filename FROM documents WHERE status = 'enriching'",
|
||||
).fetchall()
|
||||
if not rows:
|
||||
return
|
||||
|
||||
# Check extracted_at timestamp — if enriching started > max_hours ago, reset
|
||||
now = __import__('datetime').datetime.now(__import__('datetime').timezone.utc)
|
||||
reset = []
|
||||
for row in rows:
|
||||
doc = db.get_document(row['hash'])
|
||||
extracted_at = doc.get('extracted_at', '')
|
||||
if not extracted_at:
|
||||
reset.append(row)
|
||||
continue
|
||||
try:
|
||||
from datetime import datetime, timezone
|
||||
ts = datetime.fromisoformat(extracted_at)
|
||||
if ts.tzinfo is None:
|
||||
ts = ts.replace(tzinfo=timezone.utc)
|
||||
age_hours = (now - ts).total_seconds() / 3600
|
||||
if age_hours > max_hours:
|
||||
reset.append(row)
|
||||
except Exception:
|
||||
reset.append(row)
|
||||
|
||||
for row in reset:
|
||||
conn.execute(
|
||||
"UPDATE documents SET status = 'extracted' WHERE hash = ?",
|
||||
(row['hash'],)
|
||||
)
|
||||
logger.warning(f"Recovered stale enriching doc: {row['filename']} ({row['hash'][:12]}...)")
|
||||
if reset:
|
||||
conn.commit()
|
||||
logger.info(f"Reset {len(reset)} stale enriching docs back to extracted")
|
||||
|
||||
|
||||
def run_enrichment(workers=None, limit=None):
|
||||
config = get_config()
|
||||
db = StatusDB()
|
||||
workers = workers or config['processing']['enrich_workers']
|
||||
|
||||
# Recover docs orphaned by previous crashed enrichment runs
|
||||
_recover_stale_enriching(db)
|
||||
|
||||
keys = config.get('gemini_keys', [])
|
||||
if not keys:
|
||||
logger.error("No Gemini API keys configured in .env")
|
||||
return 0
|
||||
|
||||
key_rotator = KeyRotator(keys)
|
||||
|
||||
extracted = db.get_by_status('extracted', limit=limit)
|
||||
if not extracted:
|
||||
logger.info("No extracted documents to enrich")
|
||||
return 0
|
||||
|
||||
logger.info(f"Enriching {len(extracted)} documents with {workers} workers, {len(keys)} API key(s)")
|
||||
success = 0
|
||||
|
||||
with ThreadPoolExecutor(max_workers=workers) as pool:
|
||||
futures = {
|
||||
pool.submit(enrich_single, doc['hash'], StatusDB(), config, key_rotator): doc
|
||||
for doc in extracted
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
doc = futures[future]
|
||||
try:
|
||||
if future.result():
|
||||
success += 1
|
||||
except Exception as e:
|
||||
logger.error(f"Worker error for {doc['hash']}: {e}")
|
||||
|
||||
logger.info(f"Enrichment complete: {success}/{len(extracted)} succeeded")
|
||||
return success
|
||||
601
lib/extractor.py
Normal file
601
lib/extractor.py
Normal file
|
|
@ -0,0 +1,601 @@
|
|||
"""
|
||||
RECON Text Extractor
|
||||
|
||||
PDF to text via PyPDF2 -> pdftotext -> Tesseract -> Gemini Vision fallback chain.
|
||||
Saves to data/text/{hash}/page_NNNN.txt (4-digit zero-padded, 1-indexed).
|
||||
|
||||
Safety guards:
|
||||
- Layer 1: Pre-flight size check (max_pdf_size_mb, default 200)
|
||||
- Layer 2: Per-document timeout (extract_timeout, default 300s)
|
||||
- Layer 3: Per-page timeout (page_timeout, default 30s)
|
||||
- Partial extractions saved as 'extracted' with error_message noting incompleteness
|
||||
|
||||
Fallback chain per page:
|
||||
1. PyPDF2 (fast, free, text-based PDFs)
|
||||
2. pdftotext/poppler (handles some PDFs PyPDF2 misses)
|
||||
3. Tesseract OCR (renders page → local OCR)
|
||||
4. Gemini Vision (renders page → cloud vision API, last resort for scanned docs)
|
||||
|
||||
Dependencies: PyPDF2, pdftotext (poppler-utils), pytesseract, google-generativeai
|
||||
Config: processing.extract_workers, processing.max_pdf_size_mb,
|
||||
processing.extract_timeout, processing.page_timeout
|
||||
"""
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import subprocess
|
||||
import tempfile
|
||||
import threading
|
||||
import time
|
||||
import traceback
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed, TimeoutError as FuturesTimeoutError
|
||||
from pathlib import Path
|
||||
|
||||
import google.generativeai as genai
|
||||
from PyPDF2 import PdfReader
|
||||
|
||||
from .utils import get_config, content_hash, clean_filename_to_title, setup_logging
|
||||
from .status import StatusDB
|
||||
|
||||
logger = setup_logging('recon.extractor')
|
||||
|
||||
# ── Gemini Vision singleton (lazy, thread-safe) ──
|
||||
|
||||
_vision_keys = None
|
||||
_vision_key_index = 0
|
||||
_vision_lock = threading.Lock()
|
||||
|
||||
|
||||
def _get_vision_keys():
|
||||
"""Load Gemini API keys once from .env (same keys the enricher uses)."""
|
||||
global _vision_keys
|
||||
if _vision_keys is not None:
|
||||
return _vision_keys
|
||||
|
||||
with _vision_lock:
|
||||
if _vision_keys is not None:
|
||||
return _vision_keys
|
||||
|
||||
keys = []
|
||||
env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), '.env')
|
||||
if os.path.exists(env_path):
|
||||
with open(env_path) as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line or line.startswith('#') or '=' not in line:
|
||||
continue
|
||||
key_name, val = line.split('=', 1)
|
||||
val = val.strip().strip('"').strip("'")
|
||||
if key_name.strip().startswith('GEMINI_KEY_') and val != 'PASTE_KEY_HERE':
|
||||
keys.append(val)
|
||||
|
||||
_vision_keys = keys
|
||||
if keys:
|
||||
logger.info(f"Gemini vision OCR: {len(keys)} API key(s) available")
|
||||
else:
|
||||
logger.warning("No Gemini API keys found — vision OCR fallback disabled")
|
||||
return keys
|
||||
|
||||
|
||||
def _next_vision_key():
|
||||
"""Round-robin through available Gemini keys."""
|
||||
global _vision_key_index
|
||||
keys = _get_vision_keys()
|
||||
if not keys:
|
||||
return None
|
||||
with _vision_lock:
|
||||
key = keys[_vision_key_index % len(keys)]
|
||||
_vision_key_index += 1
|
||||
return key
|
||||
|
||||
|
||||
def _is_transient(error_str):
|
||||
"""Classify whether an error is transient (worth retrying)."""
|
||||
s = error_str.lower()
|
||||
transient_signals = ['429', 'resource_exhausted', 'quota', 'rate',
|
||||
'500', '503', 'unavailable', 'timeout',
|
||||
'connection', 'reset by peer', 'broken pipe']
|
||||
return any(sig in s for sig in transient_signals)
|
||||
|
||||
|
||||
def _render_page_to_png(pdf_path, page_num_1indexed, dpi=200, timeout=30):
|
||||
"""Render a single PDF page to PNG bytes using pdftoppm.
|
||||
|
||||
Args:
|
||||
pdf_path: Path to PDF file
|
||||
page_num_1indexed: 1-indexed page number
|
||||
dpi: Resolution (200 = readable text, reasonable file size)
|
||||
timeout: Subprocess timeout in seconds
|
||||
|
||||
Returns:
|
||||
bytes or None: PNG image data, or None if render fails/blank
|
||||
"""
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
prefix = os.path.join(tmpdir, 'page')
|
||||
try:
|
||||
subprocess.run(
|
||||
['pdftoppm', '-f', str(page_num_1indexed), '-l', str(page_num_1indexed),
|
||||
'-png', '-r', str(dpi), pdf_path, prefix],
|
||||
capture_output=True, timeout=timeout, check=True
|
||||
)
|
||||
png_files = list(Path(tmpdir).glob('*.png'))
|
||||
if not png_files:
|
||||
return None
|
||||
|
||||
img_data = png_files[0].read_bytes()
|
||||
|
||||
# Skip blank pages (tiny image = solid white/blank page)
|
||||
if len(img_data) < 5000:
|
||||
return None
|
||||
|
||||
return img_data
|
||||
|
||||
except (subprocess.TimeoutExpired, subprocess.CalledProcessError, OSError):
|
||||
return None
|
||||
|
||||
|
||||
def _try_gemini_vision(pdf_path, page_num_1indexed, page_timeout=60):
|
||||
"""Last-resort OCR: render page to image, send to Gemini vision.
|
||||
|
||||
Only called when PyPDF2, pdftotext, AND Tesseract all failed.
|
||||
|
||||
Args:
|
||||
pdf_path: Path to PDF file
|
||||
page_num_1indexed: 1-indexed page number
|
||||
page_timeout: Max time for the render + API call
|
||||
|
||||
Returns:
|
||||
str: Extracted text, or empty string if vision fails
|
||||
"""
|
||||
api_key = _next_vision_key()
|
||||
if api_key is None:
|
||||
return ''
|
||||
|
||||
# Render page to PNG
|
||||
img_data = _render_page_to_png(pdf_path, page_num_1indexed, timeout=min(page_timeout, 30))
|
||||
if img_data is None:
|
||||
return ''
|
||||
|
||||
# Call Gemini vision with retry for transient errors
|
||||
last_exc = None
|
||||
for attempt in range(3):
|
||||
try:
|
||||
genai.configure(api_key=api_key)
|
||||
model = genai.GenerativeModel('gemini-2.0-flash')
|
||||
response = model.generate_content([
|
||||
{
|
||||
'mime_type': 'image/png',
|
||||
'data': base64.b64encode(img_data).decode('utf-8')
|
||||
},
|
||||
"Extract ALL text from this scanned document page exactly as written. "
|
||||
"Preserve headings, lists, numbered items, tables, and paragraph structure. "
|
||||
"Return ONLY the extracted text, no commentary or markdown formatting."
|
||||
])
|
||||
if response and response.text:
|
||||
text = response.text.strip()
|
||||
if len(text) > 10:
|
||||
return text
|
||||
return ''
|
||||
|
||||
except Exception as e:
|
||||
last_exc = e
|
||||
if not _is_transient(str(e)):
|
||||
break # permanent error — don't retry
|
||||
if attempt < 2:
|
||||
delay = 5.0 * (2 ** attempt) + random.uniform(0, 3)
|
||||
time.sleep(delay)
|
||||
# Rotate to next key on rate limit
|
||||
api_key = _next_vision_key() or api_key
|
||||
|
||||
if last_exc:
|
||||
logger.debug(f" Vision OCR failed page {page_num_1indexed}: {last_exc}")
|
||||
return ''
|
||||
|
||||
|
||||
|
||||
def _get_page_count(pdf_path):
|
||||
"""Get page count using pdfinfo (poppler) as fallback when PdfReader fails."""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
['pdfinfo', pdf_path],
|
||||
capture_output=True, text=True, timeout=30
|
||||
)
|
||||
if result.returncode == 0:
|
||||
for line in result.stdout.splitlines():
|
||||
if line.startswith('Pages:'):
|
||||
return int(line.split(':', 1)[1].strip())
|
||||
except Exception:
|
||||
pass
|
||||
return 0
|
||||
|
||||
|
||||
def _extract_page_without_reader(pdf_path, page_num_0indexed, page_timeout=30):
|
||||
"""Extract text from a single page WITHOUT PyPDF2 reader.
|
||||
|
||||
Used when PdfReader() fails entirely (corrupt/encrypted PDFs).
|
||||
Runs the pdftotext -> Tesseract -> Gemini Vision fallback chain.
|
||||
|
||||
Returns:
|
||||
tuple: (text, ocr_method)
|
||||
"""
|
||||
text = ''
|
||||
|
||||
# Method 1: pdftotext (poppler)
|
||||
try:
|
||||
result = subprocess.run(
|
||||
['pdftotext', '-f', str(page_num_0indexed + 1),
|
||||
'-l', str(page_num_0indexed + 1), pdf_path, '-'],
|
||||
capture_output=True, text=True, timeout=page_timeout
|
||||
)
|
||||
if result.returncode == 0:
|
||||
text = result.stdout
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if len(text.strip()) >= 50:
|
||||
return text, 'pdftotext'
|
||||
|
||||
# Method 2: pdftoppm + Tesseract OCR
|
||||
try:
|
||||
from PIL import Image
|
||||
import pytesseract
|
||||
|
||||
result = subprocess.run(
|
||||
['pdftoppm', '-f', str(page_num_0indexed + 1),
|
||||
'-l', str(page_num_0indexed + 1),
|
||||
'-png', '-singlefile', pdf_path, '-'],
|
||||
capture_output=True, timeout=page_timeout * 2
|
||||
)
|
||||
if result.returncode == 0 and result.stdout:
|
||||
with tempfile.NamedTemporaryFile(suffix='.png', delete=True) as tmp:
|
||||
tmp.write(result.stdout)
|
||||
tmp.flush()
|
||||
img = Image.open(tmp.name)
|
||||
ocr_text = pytesseract.image_to_string(img)
|
||||
if len(ocr_text.strip()) > len(text.strip()):
|
||||
text = ocr_text
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if len(text.strip()) >= 50:
|
||||
return text, 'tesseract'
|
||||
|
||||
# Method 3: Gemini Vision (last resort)
|
||||
vision_text = _try_gemini_vision(pdf_path, page_num_0indexed + 1,
|
||||
page_timeout=page_timeout * 2)
|
||||
if len(vision_text.strip()) > len(text.strip()):
|
||||
text = vision_text
|
||||
|
||||
if len(text.strip()) >= 10:
|
||||
return text, 'gemini_vision'
|
||||
|
||||
return text, 'none'
|
||||
|
||||
|
||||
# ── Core extraction functions ──
|
||||
|
||||
def _pypdf2_extract(reader, page_num):
|
||||
"""Extract text from a PyPDF2 page object. Runs inside a thread for timeout."""
|
||||
return reader.pages[page_num].extract_text() or ''
|
||||
|
||||
|
||||
def extract_text_from_page(reader, page_num, pdf_path, page_timeout=30):
|
||||
"""Extract text from a single page with fallback chain.
|
||||
|
||||
Returns:
|
||||
tuple: (text, ocr_method) where ocr_method is one of:
|
||||
'pypdf2', 'pdftotext', 'tesseract', 'gemini_vision', 'none'
|
||||
"""
|
||||
# Method 1: PyPDF2 (wrapped in thread for timeout — extract_text() can hang)
|
||||
text = ''
|
||||
try:
|
||||
ex = ThreadPoolExecutor(1)
|
||||
future = ex.submit(_pypdf2_extract, reader, page_num)
|
||||
try:
|
||||
text = future.result(timeout=page_timeout)
|
||||
except FuturesTimeoutError:
|
||||
logger.warning(f" PyPDF2 timeout on page {page_num + 1}")
|
||||
text = ''
|
||||
finally:
|
||||
ex.shutdown(wait=False, cancel_futures=True)
|
||||
except Exception:
|
||||
text = ''
|
||||
|
||||
if len(text.strip()) >= 50:
|
||||
return text, 'pypdf2'
|
||||
|
||||
# Method 2: pdftotext via subprocess (inherently timeout-safe)
|
||||
try:
|
||||
result = subprocess.run(
|
||||
['pdftotext', '-f', str(page_num + 1), '-l', str(page_num + 1), pdf_path, '-'],
|
||||
capture_output=True, text=True, timeout=page_timeout
|
||||
)
|
||||
if result.returncode == 0 and len(result.stdout.strip()) > len(text.strip()):
|
||||
text = result.stdout
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if len(text.strip()) >= 50:
|
||||
return text, 'pdftotext'
|
||||
|
||||
# Method 3: pdftoppm + Tesseract OCR
|
||||
try:
|
||||
from PIL import Image
|
||||
import pytesseract
|
||||
|
||||
result = subprocess.run(
|
||||
['pdftoppm', '-f', str(page_num + 1), '-l', str(page_num + 1),
|
||||
'-png', '-singlefile', pdf_path, '-'],
|
||||
capture_output=True, timeout=page_timeout * 2
|
||||
)
|
||||
if result.returncode == 0 and result.stdout:
|
||||
with tempfile.NamedTemporaryFile(suffix='.png', delete=True) as tmp:
|
||||
tmp.write(result.stdout)
|
||||
tmp.flush()
|
||||
img = Image.open(tmp.name)
|
||||
ocr_text = pytesseract.image_to_string(img)
|
||||
if len(ocr_text.strip()) > len(text.strip()):
|
||||
text = ocr_text
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if len(text.strip()) >= 50:
|
||||
return text, 'tesseract'
|
||||
|
||||
# Method 4: Gemini Vision (last resort — costs API calls but handles scanned docs)
|
||||
vision_text = _try_gemini_vision(pdf_path, page_num + 1, page_timeout=page_timeout * 2)
|
||||
if len(vision_text.strip()) > len(text.strip()):
|
||||
text = vision_text
|
||||
|
||||
if len(text.strip()) >= 10:
|
||||
return text, 'gemini_vision'
|
||||
|
||||
return text, 'none'
|
||||
|
||||
|
||||
def extract_book_metadata(first_page_text, config):
|
||||
keys = config.get('gemini_keys', [])
|
||||
if not keys or len(first_page_text.strip()) < 20:
|
||||
return None, None
|
||||
|
||||
try:
|
||||
genai.configure(api_key=keys[0])
|
||||
model = genai.GenerativeModel(
|
||||
config['gemini']['model'],
|
||||
generation_config={"response_mime_type": config['gemini']['response_mime_type']}
|
||||
)
|
||||
prompt = f"""Extract the book title and author from this first page text.
|
||||
Return JSON: {{"title": "...", "author": "..."}}
|
||||
If unknown, use null for that field.
|
||||
|
||||
Text:
|
||||
{first_page_text[:3000]}"""
|
||||
|
||||
response = model.generate_content(prompt)
|
||||
data = json.loads(response.text)
|
||||
return data.get('title'), data.get('author')
|
||||
except Exception as e:
|
||||
logger.warning(f"Metadata extraction failed: {e}")
|
||||
return None, None
|
||||
|
||||
|
||||
def extract_single(file_hash, db, config):
|
||||
doc = db.get_document(file_hash)
|
||||
if not doc:
|
||||
return False
|
||||
|
||||
pdf_path = doc['path']
|
||||
filename = doc['filename']
|
||||
text_dir = os.path.join(config['paths']['text'], file_hash)
|
||||
|
||||
if not os.path.exists(pdf_path):
|
||||
db.mark_failed(file_hash, f"File not found: {pdf_path}")
|
||||
return False
|
||||
|
||||
# Layer 1: Pre-flight size check
|
||||
proc = config.get('processing', {})
|
||||
max_size_mb = proc.get('max_pdf_size_mb', 200)
|
||||
try:
|
||||
file_size_mb = os.path.getsize(pdf_path) / 1048576
|
||||
except OSError as e:
|
||||
db.mark_failed(file_hash, f"Cannot stat file: {e}")
|
||||
return False
|
||||
|
||||
if file_size_mb > max_size_mb:
|
||||
msg = f"Skipped: {file_size_mb:.0f}MB exceeds {max_size_mb}MB limit"
|
||||
logger.warning(f"SIZE SKIP: {filename} — {msg}")
|
||||
db.mark_failed(file_hash, msg)
|
||||
return False
|
||||
|
||||
db.update_status(file_hash, 'extracting')
|
||||
|
||||
# Layer 2/3 setup
|
||||
max_doc_seconds = proc.get('extract_timeout', 300)
|
||||
page_timeout = proc.get('page_timeout', 30)
|
||||
start_time = time.time()
|
||||
page_count = 0
|
||||
pages_extracted = 0
|
||||
skipped_pages = 0
|
||||
ocr_pages = []
|
||||
ocr_methods = {'pypdf2': 0, 'pdftotext': 0, 'tesseract': 0, 'gemini_vision': 0, 'none': 0}
|
||||
|
||||
try:
|
||||
os.makedirs(text_dir, exist_ok=True)
|
||||
# Try PyPDF2 first; fall back to poppler-only extraction if it fails
|
||||
reader = None
|
||||
use_reader = True
|
||||
try:
|
||||
reader = PdfReader(pdf_path)
|
||||
page_count = len(reader.pages)
|
||||
except Exception as pdf_err:
|
||||
logger.warning(f"PdfReader failed for {filename}: {pdf_err} — using poppler fallback")
|
||||
use_reader = False
|
||||
page_count = _get_page_count(pdf_path)
|
||||
if page_count == 0:
|
||||
db.mark_failed(file_hash, f"PdfReader failed and pdfinfo returned 0 pages: {str(pdf_err)[:200]}")
|
||||
return False
|
||||
|
||||
for i in range(page_count):
|
||||
# Layer 2: Check total document time budget
|
||||
elapsed = time.time() - start_time
|
||||
if elapsed > max_doc_seconds:
|
||||
msg = f"Timed out after {elapsed:.0f}s at page {i}/{page_count}"
|
||||
logger.warning(f"TIMEOUT: {filename} — {msg}")
|
||||
if pages_extracted > 0:
|
||||
_save_partial(file_hash, db, doc, config, text_dir,
|
||||
page_count, pages_extracted, ocr_pages,
|
||||
f"Partial: {pages_extracted}/{page_count} pages "
|
||||
f"(timed out after {elapsed:.0f}s)",
|
||||
ocr_methods=ocr_methods)
|
||||
return True
|
||||
else:
|
||||
db.mark_failed(file_hash, msg)
|
||||
return False
|
||||
|
||||
# Layer 3: Per-page extraction with fallback chain
|
||||
try:
|
||||
if use_reader:
|
||||
text, method = extract_text_from_page(reader, i, pdf_path, page_timeout)
|
||||
else:
|
||||
text, method = _extract_page_without_reader(pdf_path, i, page_timeout)
|
||||
ocr_methods[method] += 1
|
||||
if method in ('tesseract', 'gemini_vision'):
|
||||
ocr_pages.append(i + 1)
|
||||
except Exception as e:
|
||||
logger.warning(f" Page {i+1}/{page_count} failed: {e} — skipping")
|
||||
text = ''
|
||||
skipped_pages += 1
|
||||
ocr_methods['none'] += 1
|
||||
|
||||
page_file = os.path.join(text_dir, f"page_{i+1:04d}.txt")
|
||||
with open(page_file, 'w', encoding='utf-8') as f:
|
||||
f.write(text)
|
||||
|
||||
if text.strip():
|
||||
pages_extracted += 1
|
||||
|
||||
# Progress logging every 50 pages (more frequent since vision is slower)
|
||||
if (i + 1) % 50 == 0:
|
||||
el = time.time() - start_time
|
||||
rate = (i + 1) / el if el > 0 else 0
|
||||
vision_n = ocr_methods['gemini_vision']
|
||||
vision_note = f", {vision_n} vision" if vision_n else ""
|
||||
logger.info(f" {filename}: page {i+1}/{page_count} "
|
||||
f"({rate:.1f} pages/sec, {skipped_pages} skipped{vision_note})")
|
||||
|
||||
# Full extraction complete — save metadata
|
||||
first_page_text = ''
|
||||
first_page_file = os.path.join(text_dir, 'page_0001.txt')
|
||||
if os.path.exists(first_page_file):
|
||||
with open(first_page_file, encoding='utf-8') as f:
|
||||
first_page_text = f.read()
|
||||
|
||||
book_title, book_author = extract_book_metadata(first_page_text, config)
|
||||
|
||||
if not book_title:
|
||||
book_title = clean_filename_to_title(filename)
|
||||
|
||||
meta = {
|
||||
'hash': file_hash,
|
||||
'filename': filename,
|
||||
'page_count': page_count,
|
||||
'ocr_pages': ocr_pages,
|
||||
'skipped_pages': skipped_pages,
|
||||
'ocr_methods': ocr_methods,
|
||||
}
|
||||
with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
|
||||
json.dump(meta, f, indent=2)
|
||||
|
||||
kwargs = {
|
||||
'page_count': page_count,
|
||||
'pages_extracted': pages_extracted,
|
||||
'book_title': book_title,
|
||||
}
|
||||
if book_author:
|
||||
kwargs['book_author'] = book_author
|
||||
if skipped_pages > 0:
|
||||
kwargs['error_message'] = (f"Partial: {pages_extracted}/{page_count} pages "
|
||||
f"({skipped_pages} pages timed out)")
|
||||
|
||||
elapsed = time.time() - start_time
|
||||
db.update_status(file_hash, 'extracted', **kwargs)
|
||||
ocr_note = f", {len(ocr_pages)} OCR" if ocr_pages else ""
|
||||
skip_note = f", {skipped_pages} skipped" if skipped_pages > 0 else ""
|
||||
vision_note = f", {ocr_methods['gemini_vision']} vision" if ocr_methods['gemini_vision'] else ""
|
||||
logger.info(f"Extracted {filename}: {pages_extracted}/{page_count} pages "
|
||||
f"({elapsed:.1f}s{ocr_note}{vision_note}{skip_note})")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Extraction failed for {file_hash}: {e}\n{traceback.format_exc()}")
|
||||
if pages_extracted > 0:
|
||||
_save_partial(file_hash, db, doc, config, text_dir,
|
||||
page_count, pages_extracted, ocr_pages,
|
||||
f"Partial: {pages_extracted}/{page_count} pages "
|
||||
f"({str(e)[:150]})",
|
||||
ocr_methods=ocr_methods)
|
||||
return True
|
||||
db.mark_failed(file_hash, str(e)[:500])
|
||||
return False
|
||||
|
||||
|
||||
def _save_partial(file_hash, db, doc, config, text_dir, page_count,
|
||||
pages_extracted, ocr_pages, error_msg, ocr_methods=None):
|
||||
"""Save metadata and mark a partial extraction as 'extracted'."""
|
||||
book_title = clean_filename_to_title(doc['filename'])
|
||||
|
||||
first_page_file = os.path.join(text_dir, 'page_0001.txt')
|
||||
if os.path.exists(first_page_file):
|
||||
with open(first_page_file, encoding='utf-8') as f:
|
||||
first_text = f.read()
|
||||
if len(first_text.strip()) > 20:
|
||||
title, _ = extract_book_metadata(first_text, config)
|
||||
if title:
|
||||
book_title = title
|
||||
|
||||
meta = {
|
||||
'hash': file_hash,
|
||||
'filename': doc['filename'],
|
||||
'page_count': page_count,
|
||||
'ocr_pages': ocr_pages,
|
||||
'partial': True,
|
||||
}
|
||||
if ocr_methods:
|
||||
meta['ocr_methods'] = ocr_methods
|
||||
with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
|
||||
json.dump(meta, f, indent=2)
|
||||
|
||||
db.update_status(file_hash, 'extracted',
|
||||
page_count=page_count,
|
||||
pages_extracted=pages_extracted,
|
||||
book_title=book_title,
|
||||
error_message=error_msg)
|
||||
logger.info(f" Saved partial extraction: {pages_extracted}/{page_count} pages")
|
||||
|
||||
|
||||
def run_extraction(workers=None):
|
||||
config = get_config()
|
||||
db = StatusDB()
|
||||
workers = workers or config['processing']['extract_workers']
|
||||
|
||||
queued = db.get_by_status('queued')
|
||||
if not queued:
|
||||
logger.info("No queued documents to extract")
|
||||
return 0
|
||||
|
||||
logger.info(f"Extracting {len(queued)} documents with {workers} workers")
|
||||
success = 0
|
||||
|
||||
with ThreadPoolExecutor(max_workers=workers) as pool:
|
||||
futures = {pool.submit(extract_single, doc['hash'], StatusDB(), config): doc for doc in queued}
|
||||
for future in as_completed(futures):
|
||||
doc = futures[future]
|
||||
try:
|
||||
if future.result():
|
||||
success += 1
|
||||
except Exception as e:
|
||||
logger.error(f"Worker error for {doc['hash']}: {e}")
|
||||
|
||||
logger.info(f"Extraction complete: {success}/{len(queued)} succeeded")
|
||||
return success
|
||||
159
lib/ingester.py
Normal file
159
lib/ingester.py
Normal file
|
|
@ -0,0 +1,159 @@
|
|||
"""
|
||||
RECON Intel Ingester
|
||||
|
||||
ARGUS intelligence feed intake. Embeds intel JSON and inserts into Qdrant
|
||||
with source_type='intel_feed'.
|
||||
|
||||
Dependencies: requests, qdrant-client
|
||||
Config: embedding, vector_db
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
import traceback
|
||||
|
||||
import requests as http_requests
|
||||
from qdrant_client import QdrantClient
|
||||
from qdrant_client.models import PointStruct
|
||||
|
||||
from .utils import get_config, setup_logging
|
||||
from .status import StatusDB
|
||||
|
||||
logger = setup_logging('recon.ingester')
|
||||
|
||||
|
||||
def ingest_intel(intel_data, config=None):
|
||||
if config is None:
|
||||
config = get_config()
|
||||
|
||||
db = StatusDB()
|
||||
|
||||
required = ['source', 'category', 'content']
|
||||
for field in required:
|
||||
if field not in intel_data:
|
||||
logger.error(f"Missing required field: {field}")
|
||||
return None
|
||||
|
||||
try:
|
||||
conn = db._get_conn()
|
||||
cursor = conn.execute(
|
||||
"""INSERT INTO intel (source, timestamp, region, category, content,
|
||||
summary, key_facts, credibility_score, verification_status)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
|
||||
(
|
||||
intel_data.get('source', 'unknown'),
|
||||
intel_data.get('timestamp', time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())),
|
||||
intel_data.get('region', 'unknown'),
|
||||
intel_data['category'],
|
||||
intel_data['content'],
|
||||
intel_data.get('summary', ''),
|
||||
json.dumps(intel_data.get('key_facts', [])),
|
||||
intel_data.get('credibility_score', 0.5),
|
||||
intel_data.get('verification_status', 'unverified'),
|
||||
)
|
||||
)
|
||||
intel_id = cursor.lastrowid
|
||||
conn.commit()
|
||||
|
||||
url = f"http://{config['embedding']['host']}:{config['embedding']['port']}/api/embed"
|
||||
resp = http_requests.post(url, json={
|
||||
"model": config['embedding']['model'],
|
||||
"input": intel_data['content']
|
||||
}, timeout=120)
|
||||
resp.raise_for_status()
|
||||
vector = resp.json()['embeddings'][0]
|
||||
|
||||
qdrant = QdrantClient(
|
||||
host=config['vector_db']['host'],
|
||||
port=config['vector_db']['port'],
|
||||
timeout=60
|
||||
)
|
||||
|
||||
point_id = intel_id + 2**60
|
||||
|
||||
payload = {
|
||||
'source_type': 'intel_feed',
|
||||
'intel_id': intel_id,
|
||||
'source': intel_data.get('source', 'unknown'),
|
||||
'region': intel_data.get('region', 'unknown'),
|
||||
'category': intel_data['category'],
|
||||
'content': intel_data['content'],
|
||||
'summary': intel_data.get('summary', ''),
|
||||
'key_facts': intel_data.get('key_facts', []),
|
||||
'credibility_score': intel_data.get('credibility_score', 0.5),
|
||||
'verification_status': intel_data.get('verification_status', 'unverified'),
|
||||
'timestamp': intel_data.get('timestamp', ''),
|
||||
'language': 'en',
|
||||
}
|
||||
|
||||
qdrant.upsert(
|
||||
collection_name=config['vector_db']['collection'],
|
||||
points=[PointStruct(id=point_id, vector=vector, payload=payload)]
|
||||
)
|
||||
|
||||
conn.execute("UPDATE intel SET vector_id = ? WHERE id = ?", (point_id, intel_id))
|
||||
conn.commit()
|
||||
|
||||
logger.info(f"Ingested intel #{intel_id} from {intel_data.get('source', 'unknown')}")
|
||||
return intel_id
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Intel ingestion failed: {e}\n{traceback.format_exc()}")
|
||||
return None
|
||||
|
||||
|
||||
def ingest_file(filepath, config=None):
|
||||
if config is None:
|
||||
config = get_config()
|
||||
|
||||
try:
|
||||
with open(filepath, encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
|
||||
if isinstance(data, list):
|
||||
results = []
|
||||
for item in data:
|
||||
result = ingest_intel(item, config)
|
||||
results.append(result)
|
||||
success = sum(1 for r in results if r is not None)
|
||||
logger.info(f"Ingested {success}/{len(data)} items from {filepath}")
|
||||
return results
|
||||
else:
|
||||
return [ingest_intel(data, config)]
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to ingest file {filepath}: {e}")
|
||||
return []
|
||||
|
||||
|
||||
def run_ingestion(directory=None):
|
||||
config = get_config()
|
||||
intel_dir = directory or config['paths']['intel']
|
||||
|
||||
if not os.path.exists(intel_dir):
|
||||
logger.info(f"Intel directory does not exist: {intel_dir}")
|
||||
return 0
|
||||
|
||||
json_files = sorted([
|
||||
f for f in os.listdir(intel_dir)
|
||||
if f.endswith('.json') and not f.startswith('.')
|
||||
])
|
||||
|
||||
if not json_files:
|
||||
logger.info("No intel files to ingest")
|
||||
return 0
|
||||
|
||||
total = 0
|
||||
for jf in json_files:
|
||||
filepath = os.path.join(intel_dir, jf)
|
||||
results = ingest_file(filepath, config)
|
||||
ingested = sum(1 for r in results if r is not None)
|
||||
total += ingested
|
||||
|
||||
if ingested > 0:
|
||||
done_dir = os.path.join(intel_dir, 'processed')
|
||||
os.makedirs(done_dir, exist_ok=True)
|
||||
os.rename(filepath, os.path.join(done_dir, jf))
|
||||
|
||||
logger.info(f"Intel ingestion complete: {total} items ingested")
|
||||
return total
|
||||
270
lib/key_manager.py
Normal file
270
lib/key_manager.py
Normal file
|
|
@ -0,0 +1,270 @@
|
|||
"""
|
||||
RECON Key Manager - Thread-safe API key management with hot-reload.
|
||||
|
||||
Provides a singleton KeyManager that workers (enricher, extractor) read from
|
||||
instead of loading .env directly. Dashboard can update keys at runtime without
|
||||
restarting the service.
|
||||
|
||||
Dependencies: None beyond stdlib + requests (already in requirements.txt)
|
||||
Config: Reads/writes /opt/recon/.env
|
||||
"""
|
||||
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
import logging
|
||||
import threading
|
||||
import requests
|
||||
|
||||
logger = logging.getLogger('recon.key_manager')
|
||||
|
||||
class KeyManager:
|
||||
"""Thread-safe API key store with hot-reload and validation."""
|
||||
|
||||
_instance = None
|
||||
_lock = threading.Lock()
|
||||
|
||||
def __new__(cls):
|
||||
if cls._instance is None:
|
||||
with cls._lock:
|
||||
if cls._instance is None:
|
||||
cls._instance = super().__new__(cls)
|
||||
cls._instance._initialized = False
|
||||
return cls._instance
|
||||
|
||||
def __init__(self):
|
||||
if self._initialized:
|
||||
return
|
||||
self._keys_lock = threading.RLock()
|
||||
self._gemini_keys = []
|
||||
self._env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), '.env')
|
||||
self._last_loaded = None
|
||||
self._key_stats = {} # key_index -> {calls, errors, last_used}
|
||||
self._load_from_env()
|
||||
self._initialized = True
|
||||
logger.info(f"KeyManager initialized with {len(self._gemini_keys)} Gemini key(s)")
|
||||
|
||||
# ── Read Operations ──
|
||||
|
||||
def get_gemini_keys(self):
|
||||
"""Return a copy of current Gemini keys. Thread-safe."""
|
||||
with self._keys_lock:
|
||||
return list(self._gemini_keys)
|
||||
|
||||
def get_gemini_key(self, index=0):
|
||||
"""Get a single Gemini key by index. Returns None if out of range."""
|
||||
with self._keys_lock:
|
||||
if 0 <= index < len(self._gemini_keys):
|
||||
return self._gemini_keys[index]
|
||||
return None
|
||||
|
||||
def get_gemini_key_count(self):
|
||||
"""Return number of loaded Gemini keys."""
|
||||
with self._keys_lock:
|
||||
return len(self._gemini_keys)
|
||||
|
||||
def get_masked_keys(self):
|
||||
"""Return keys masked for display: first 8 + ... + last 4 chars."""
|
||||
with self._keys_lock:
|
||||
result = []
|
||||
for i, key in enumerate(self._gemini_keys):
|
||||
if len(key) > 16:
|
||||
masked = key[:8] + '...' + key[-4:]
|
||||
elif len(key) > 8:
|
||||
masked = key[:4] + '...' + key[-2:]
|
||||
else:
|
||||
masked = '****'
|
||||
stats = self._key_stats.get(i, {})
|
||||
result.append({
|
||||
'index': i,
|
||||
'masked': masked,
|
||||
'length': len(key),
|
||||
'calls': stats.get('calls', 0),
|
||||
'errors': stats.get('errors', 0),
|
||||
'last_used': stats.get('last_used', None),
|
||||
'valid': stats.get('valid', None),
|
||||
'last_validated': stats.get('last_validated', None),
|
||||
})
|
||||
return result
|
||||
|
||||
# ── Write Operations (all persist to .env) ──
|
||||
|
||||
def set_gemini_keys(self, keys):
|
||||
"""Replace all Gemini keys. Persists to .env. Returns success bool."""
|
||||
# Filter empty strings
|
||||
keys = [k.strip() for k in keys if k.strip()]
|
||||
with self._keys_lock:
|
||||
self._gemini_keys = keys
|
||||
self._key_stats = {} # Reset stats on full replace
|
||||
self._persist_to_env()
|
||||
logger.info(f"Gemini keys replaced: {len(keys)} key(s) loaded")
|
||||
return True
|
||||
|
||||
def add_gemini_key(self, key):
|
||||
"""Add a single Gemini key. Persists to .env. Returns new index."""
|
||||
key = key.strip()
|
||||
if not key:
|
||||
raise ValueError("Key cannot be empty")
|
||||
with self._keys_lock:
|
||||
# Check for duplicates
|
||||
if key in self._gemini_keys:
|
||||
raise ValueError("Key already exists")
|
||||
self._gemini_keys.append(key)
|
||||
idx = len(self._gemini_keys) - 1
|
||||
self._persist_to_env()
|
||||
logger.info(f"Gemini key added at index {idx}")
|
||||
return idx
|
||||
|
||||
def remove_gemini_key(self, index):
|
||||
"""Remove a Gemini key by index. Persists to .env. Returns removed key (masked)."""
|
||||
with self._keys_lock:
|
||||
if index < 0 or index >= len(self._gemini_keys):
|
||||
raise IndexError(f"Key index {index} out of range (have {len(self._gemini_keys)} keys)")
|
||||
if len(self._gemini_keys) <= 1:
|
||||
raise ValueError("Cannot remove last key — pipeline needs at least 1 Gemini key")
|
||||
key = self._gemini_keys.pop(index)
|
||||
# Rebuild stats with shifted indices
|
||||
new_stats = {}
|
||||
for i, stats in self._key_stats.items():
|
||||
if i < index:
|
||||
new_stats[i] = stats
|
||||
elif i > index:
|
||||
new_stats[i - 1] = stats
|
||||
self._key_stats = new_stats
|
||||
self._persist_to_env()
|
||||
masked = key[:8] + '...' + key[-4:] if len(key) > 16 else '****'
|
||||
logger.info(f"Gemini key removed at index {index}: {masked}")
|
||||
return masked
|
||||
|
||||
def replace_gemini_key(self, index, new_key):
|
||||
"""Replace a single Gemini key at index. Persists to .env."""
|
||||
new_key = new_key.strip()
|
||||
if not new_key:
|
||||
raise ValueError("Key cannot be empty")
|
||||
with self._keys_lock:
|
||||
if index < 0 or index >= len(self._gemini_keys):
|
||||
raise IndexError(f"Key index {index} out of range")
|
||||
# Check duplicate (but allow replacing with same key)
|
||||
if new_key in self._gemini_keys and self._gemini_keys[index] != new_key:
|
||||
raise ValueError("Key already exists at another index")
|
||||
self._gemini_keys[index] = new_key
|
||||
if index in self._key_stats:
|
||||
self._key_stats[index] = {} # Reset stats for replaced key
|
||||
self._persist_to_env()
|
||||
logger.info(f"Gemini key replaced at index {index}")
|
||||
|
||||
# ── Validation ──
|
||||
|
||||
def validate_key(self, key):
|
||||
"""
|
||||
Test a Gemini API key by listing models.
|
||||
Returns (valid: bool, message: str).
|
||||
"""
|
||||
try:
|
||||
resp = requests.get(
|
||||
f"https://generativelanguage.googleapis.com/v1beta/models?key={key}",
|
||||
timeout=10
|
||||
)
|
||||
if resp.status_code == 200 and 'models' in resp.text:
|
||||
return True, "Valid — API responded"
|
||||
elif resp.status_code == 400:
|
||||
return False, f"Invalid key (HTTP {resp.status_code})"
|
||||
elif resp.status_code == 403:
|
||||
return False, "Key disabled or quota exhausted"
|
||||
elif resp.status_code == 429:
|
||||
return True, "Valid — but currently rate-limited"
|
||||
else:
|
||||
return False, f"Unexpected response (HTTP {resp.status_code})"
|
||||
except requests.Timeout:
|
||||
return False, "Timeout — could not reach Gemini API"
|
||||
except requests.ConnectionError:
|
||||
return False, "Connection error — check network"
|
||||
except Exception as e:
|
||||
return False, f"Error: {str(e)}"
|
||||
|
||||
def validate_all(self):
|
||||
"""Validate all loaded Gemini keys. Returns list of results."""
|
||||
results = []
|
||||
with self._keys_lock:
|
||||
keys_copy = list(enumerate(self._gemini_keys))
|
||||
|
||||
for i, key in keys_copy:
|
||||
valid, message = self.validate_key(key)
|
||||
with self._keys_lock:
|
||||
if i not in self._key_stats:
|
||||
self._key_stats[i] = {}
|
||||
self._key_stats[i]['valid'] = valid
|
||||
self._key_stats[i]['last_validated'] = time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())
|
||||
results.append({'index': i, 'valid': valid, 'message': message})
|
||||
time.sleep(0.2) # Don't hammer the API
|
||||
|
||||
return results
|
||||
|
||||
# ── Stats tracking (called by enricher/extractor) ──
|
||||
|
||||
def record_usage(self, key_index, success=True):
|
||||
"""Record a key usage event. Called by workers after each Gemini call."""
|
||||
with self._keys_lock:
|
||||
if key_index not in self._key_stats:
|
||||
self._key_stats[key_index] = {'calls': 0, 'errors': 0}
|
||||
self._key_stats[key_index]['calls'] = self._key_stats[key_index].get('calls', 0) + 1
|
||||
if not success:
|
||||
self._key_stats[key_index]['errors'] = self._key_stats[key_index].get('errors', 0) + 1
|
||||
self._key_stats[key_index]['last_used'] = time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())
|
||||
|
||||
# ── Internal ──
|
||||
|
||||
def _load_from_env(self):
|
||||
"""Load Gemini keys from .env file."""
|
||||
keys = []
|
||||
if os.path.exists(self._env_path):
|
||||
with open(self._env_path, 'r') as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line and not line.startswith('#'):
|
||||
match = re.match(r'^GEMINI_KEY(?:_\d+)?=(.+)$', line)
|
||||
if match:
|
||||
val = match.group(1).strip().strip('"').strip("'")
|
||||
if val:
|
||||
keys.append(val)
|
||||
self._gemini_keys = keys
|
||||
self._last_loaded = time.time()
|
||||
|
||||
def _persist_to_env(self):
|
||||
"""Write current keys back to .env file, preserving non-Gemini lines."""
|
||||
other_lines = []
|
||||
if os.path.exists(self._env_path):
|
||||
with open(self._env_path, 'r') as f:
|
||||
for line in f:
|
||||
stripped = line.strip()
|
||||
if stripped and not re.match(r'^GEMINI_KEY', stripped):
|
||||
other_lines.append(line.rstrip('\n'))
|
||||
|
||||
with open(self._env_path, 'w') as f:
|
||||
# Write non-Gemini lines first
|
||||
for line in other_lines:
|
||||
f.write(line + '\n')
|
||||
# Write Gemini keys
|
||||
for i, key in enumerate(self._gemini_keys, 1):
|
||||
f.write(f'GEMINI_KEY_{i}={key}\n')
|
||||
|
||||
self._last_loaded = time.time()
|
||||
logger.info(f"Persisted {len(self._gemini_keys)} Gemini key(s) to {self._env_path}")
|
||||
|
||||
def reload_from_env(self):
|
||||
"""Force reload from .env (e.g., if edited externally)."""
|
||||
with self._keys_lock:
|
||||
self._load_from_env()
|
||||
logger.info(f"Reloaded {len(self._gemini_keys)} Gemini key(s) from .env")
|
||||
return len(self._gemini_keys)
|
||||
|
||||
|
||||
# Module-level convenience — import and use anywhere
|
||||
_manager = None
|
||||
|
||||
def get_key_manager():
|
||||
"""Get the singleton KeyManager instance."""
|
||||
global _manager
|
||||
if _manager is None:
|
||||
_manager = KeyManager()
|
||||
return _manager
|
||||
1637
lib/new_pipeline.py
Normal file
1637
lib/new_pipeline.py
Normal file
File diff suppressed because it is too large
Load diff
374
lib/organizer.py
Normal file
374
lib/organizer.py
Normal file
|
|
@ -0,0 +1,374 @@
|
|||
"""
|
||||
RECON Library Organizer
|
||||
|
||||
After a document completes the pipeline (extract -> enrich -> embed),
|
||||
this module classifies it by dominant domain and moves it into the
|
||||
correct Domain/Subdomain/ folder with a sanitized filename.
|
||||
|
||||
Two modes:
|
||||
1. Per-document: determine_dominant_domain() from on-disk concept JSONs
|
||||
2. Bulk manifest: organize_from_manifest() using pre-built manifest JSON
|
||||
|
||||
Path updates trigger the existing catalogue.path_updated_at mechanism,
|
||||
which sync_qdrant_paths() propagates to Qdrant payloads.
|
||||
"""
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import shutil
|
||||
from collections import Counter
|
||||
|
||||
from .utils import sanitize_filename
|
||||
|
||||
logger = logging.getLogger('recon.organizer')
|
||||
|
||||
# ── Domain folder mapping (canonical) ───────────────────────────────────
|
||||
# Keys = exact domain strings from Gemini enrichment
|
||||
# Values = filesystem-safe folder names
|
||||
|
||||
DOMAIN_FOLDERS = {
|
||||
'Agriculture & Livestock': 'Agriculture-and-Livestock',
|
||||
'Civil Organization': 'Civil-Organization',
|
||||
'Communications': 'Communications',
|
||||
'Food Systems': 'Food-Systems',
|
||||
'Foundational Skills': 'Foundational-Skills',
|
||||
'Logistics': 'Logistics',
|
||||
'Medical': 'Medical',
|
||||
'Navigation': 'Navigation',
|
||||
'Operations': 'Operations',
|
||||
'Power Systems': 'Power-Systems',
|
||||
'Preservation & Storage': 'Preservation-and-Storage',
|
||||
'Security': 'Security',
|
||||
'Shelter & Construction': 'Shelter-and-Construction',
|
||||
'Technology': 'Technology',
|
||||
'Tools & Equipment': 'Tools-and-Equipment',
|
||||
'Vehicles': 'Vehicles',
|
||||
'Water Systems': 'Water-Systems',
|
||||
'Wilderness Skills': 'Wilderness-Skills',
|
||||
}
|
||||
|
||||
|
||||
def normalize_folder_name(name):
|
||||
"""Normalize a domain/subdomain name to a folder-safe string.
|
||||
|
||||
Examples:
|
||||
'Edible Plants & Foraging' -> 'Edible-Plants-and-Foraging'
|
||||
'emergency medicine' -> 'Emergency-Medicine'
|
||||
"""
|
||||
if not name:
|
||||
return 'Uncategorized'
|
||||
name = name.strip()
|
||||
name = name.replace('&', 'and')
|
||||
words = name.split()
|
||||
titled = []
|
||||
for w in words:
|
||||
if w.lower() in ('and', 'of', 'the', 'to', 'for', 'in', 'on', 'at'):
|
||||
titled.append(w.lower())
|
||||
else:
|
||||
titled.append(w.capitalize())
|
||||
return '-'.join(titled)
|
||||
|
||||
|
||||
def determine_dominant_domain(doc_hash, data_dir):
|
||||
"""Determine a document's dominant domain from on-disk concept JSONs.
|
||||
|
||||
Reads all /data/concepts/{hash}/window_*.json files, counts domain
|
||||
occurrences across all concepts, returns the top domain.
|
||||
|
||||
Args:
|
||||
doc_hash: Document hash
|
||||
data_dir: Path to /opt/recon/data
|
||||
|
||||
Returns:
|
||||
(domain, subdomain, confidence) tuple.
|
||||
domain/subdomain are strings or None.
|
||||
confidence is float 0-1 (top domain count / total concepts).
|
||||
"""
|
||||
concepts_dir = os.path.join(data_dir, 'concepts', doc_hash)
|
||||
if not os.path.isdir(concepts_dir):
|
||||
return (None, None, 0.0)
|
||||
|
||||
domain_counter = Counter()
|
||||
subdomain_counter = Counter()
|
||||
total_concepts = 0
|
||||
|
||||
for fname in os.listdir(concepts_dir):
|
||||
if not fname.startswith('window_') or not fname.endswith('.json'):
|
||||
continue
|
||||
fpath = os.path.join(concepts_dir, fname)
|
||||
try:
|
||||
with open(fpath, 'r') as f:
|
||||
concepts = json.load(f)
|
||||
except (json.JSONDecodeError, OSError):
|
||||
continue
|
||||
|
||||
if not isinstance(concepts, list):
|
||||
continue
|
||||
|
||||
for concept in concepts:
|
||||
total_concepts += 1
|
||||
# domain is usually a list with one element
|
||||
dom = concept.get('domain')
|
||||
if isinstance(dom, list):
|
||||
for d in dom:
|
||||
if isinstance(d, str):
|
||||
domain_counter[d] += 1
|
||||
elif isinstance(dom, str):
|
||||
domain_counter[dom] += 1
|
||||
|
||||
sub = concept.get('subdomain')
|
||||
if isinstance(sub, list):
|
||||
for s in sub:
|
||||
if isinstance(s, str):
|
||||
subdomain_counter[s] += 1
|
||||
elif isinstance(sub, str):
|
||||
subdomain_counter[sub] += 1
|
||||
|
||||
if total_concepts == 0 or not domain_counter:
|
||||
return (None, None, 0.0)
|
||||
|
||||
top_domains = domain_counter.most_common(2)
|
||||
dom_name = top_domains[0][0]
|
||||
dom_count = top_domains[0][1]
|
||||
confidence = dom_count / total_concepts
|
||||
|
||||
# Check ambiguity
|
||||
is_ambiguous = False
|
||||
if len(top_domains) >= 2:
|
||||
dom2_count = top_domains[1][1]
|
||||
if dom2_count >= dom_count * 0.8:
|
||||
is_ambiguous = True
|
||||
if confidence < 0.4:
|
||||
is_ambiguous = True
|
||||
|
||||
if is_ambiguous:
|
||||
return (None, None, confidence)
|
||||
|
||||
top_sub = subdomain_counter.most_common(1)
|
||||
sub_name = top_sub[0][0] if top_sub else None
|
||||
|
||||
return (dom_name, sub_name, confidence)
|
||||
|
||||
|
||||
def _build_target_path(library_root, domain, subdomain, filename, doc_hash):
|
||||
"""Build the target path for a document, handling domain mapping and collisions.
|
||||
|
||||
Returns:
|
||||
(target_path, sanitized_filename) tuple
|
||||
"""
|
||||
san_name = sanitize_filename(filename, doc_hash=doc_hash)
|
||||
|
||||
if domain is None:
|
||||
# Unclassified — leave in place (don't move to Review folder for pipeline)
|
||||
return (None, san_name)
|
||||
|
||||
domain_folder = DOMAIN_FOLDERS.get(domain)
|
||||
if not domain_folder:
|
||||
domain_folder = normalize_folder_name(domain)
|
||||
|
||||
if subdomain:
|
||||
sub_folder = normalize_folder_name(subdomain)
|
||||
else:
|
||||
sub_folder = 'General'
|
||||
|
||||
target_dir = os.path.join(library_root, domain_folder, sub_folder)
|
||||
target_path = os.path.join(target_dir, san_name)
|
||||
|
||||
# Handle collision at target
|
||||
if os.path.exists(target_path):
|
||||
stem, ext = os.path.splitext(san_name)
|
||||
h6 = doc_hash[:6]
|
||||
new_name = '{} [{}]{}'.format(stem, h6, ext)
|
||||
if len(new_name) > 120:
|
||||
max_stem = 120 - len(ext) - 9
|
||||
stem = stem[:max_stem].rstrip('. -,')
|
||||
new_name = '{} [{}]{}'.format(stem, h6, ext)
|
||||
san_name = new_name
|
||||
target_path = os.path.join(target_dir, san_name)
|
||||
|
||||
return (target_path, san_name)
|
||||
|
||||
|
||||
def organize_document(doc_hash, db, config, dry_run=False):
|
||||
"""Organize a single document: classify, rename, and move.
|
||||
|
||||
Args:
|
||||
doc_hash: Document hash
|
||||
db: StatusDB instance
|
||||
config: RECON config dict
|
||||
dry_run: If True, don't actually move files
|
||||
|
||||
Returns:
|
||||
dict with keys: hash, action, before_path, after_path, domain, subdomain, error
|
||||
"""
|
||||
library_root = config['library_root']
|
||||
data_dir = config['paths']['data']
|
||||
|
||||
result = {
|
||||
'hash': doc_hash,
|
||||
'action': 'skip',
|
||||
'before_path': None,
|
||||
'after_path': None,
|
||||
'domain': None,
|
||||
'subdomain': None,
|
||||
'error': None,
|
||||
}
|
||||
|
||||
# Look up current path from catalogue
|
||||
conn = db._get_conn()
|
||||
row = conn.execute(
|
||||
"SELECT path, filename FROM catalogue WHERE hash = ?", (doc_hash,)
|
||||
).fetchone()
|
||||
if not row:
|
||||
result['error'] = 'Not in catalogue'
|
||||
return result
|
||||
|
||||
current_path = row['path']
|
||||
current_filename = row['filename']
|
||||
result['before_path'] = current_path
|
||||
|
||||
# Verify file exists on disk
|
||||
if not dry_run and not os.path.exists(current_path):
|
||||
result['error'] = 'File not found on disk'
|
||||
return result
|
||||
|
||||
# Determine domain from concept JSONs
|
||||
domain, subdomain, confidence = determine_dominant_domain(doc_hash, data_dir)
|
||||
result['domain'] = domain
|
||||
result['subdomain'] = subdomain
|
||||
|
||||
if domain is None:
|
||||
result['action'] = 'skip_unclassified'
|
||||
return result
|
||||
|
||||
# Build target path
|
||||
target_path, san_name = _build_target_path(
|
||||
library_root, domain, subdomain, current_filename, doc_hash
|
||||
)
|
||||
|
||||
if target_path is None:
|
||||
result['action'] = 'skip_unclassified'
|
||||
return result
|
||||
|
||||
result['after_path'] = target_path
|
||||
|
||||
# Already at target?
|
||||
if os.path.abspath(current_path) == os.path.abspath(target_path):
|
||||
result['action'] = 'already_organized'
|
||||
# Still mark as organized
|
||||
if not dry_run:
|
||||
db.mark_organized(doc_hash)
|
||||
return result
|
||||
|
||||
if dry_run:
|
||||
result['action'] = 'would_move'
|
||||
return result
|
||||
|
||||
# Move the file
|
||||
try:
|
||||
target_dir = os.path.dirname(target_path)
|
||||
os.makedirs(target_dir, exist_ok=True)
|
||||
shutil.move(current_path, target_path)
|
||||
|
||||
# Update catalogue (triggers path_updated_at for Qdrant sync)
|
||||
db.update_catalogue_path(doc_hash, target_path, san_name)
|
||||
db.mark_organized(doc_hash)
|
||||
|
||||
result['action'] = 'moved'
|
||||
logger.info("Organized %s -> %s [%s/%s]",
|
||||
doc_hash[:8], target_path, domain, subdomain)
|
||||
except Exception as e:
|
||||
result['action'] = 'error'
|
||||
result['error'] = str(e)
|
||||
logger.error("Failed to organize %s: %s", doc_hash[:8], e)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def organize_from_manifest(manifest_path, db, config, dry_run=False):
|
||||
"""Bulk migration using a pre-built manifest JSON.
|
||||
|
||||
The manifest is produced by recon_manifest_builder.py and contains
|
||||
entries with current_path, sanitized_path, sanitized_filename, hash, etc.
|
||||
|
||||
Args:
|
||||
manifest_path: Path to manifest JSON file
|
||||
db: StatusDB instance
|
||||
config: RECON config dict
|
||||
dry_run: If True, don't actually move files
|
||||
|
||||
Returns:
|
||||
dict with summary stats: moved, skipped, errors, already_organized, total
|
||||
"""
|
||||
with open(manifest_path, 'r') as f:
|
||||
entries = json.load(f)
|
||||
|
||||
stats = {
|
||||
'total': len(entries),
|
||||
'moved': 0,
|
||||
'skipped': 0,
|
||||
'already_organized': 0,
|
||||
'errors': 0,
|
||||
'not_found': 0,
|
||||
}
|
||||
|
||||
for i, entry in enumerate(entries):
|
||||
doc_hash = entry['hash']
|
||||
current_path = entry['current_path']
|
||||
target_path = entry.get('sanitized_path', entry.get('proposed_path'))
|
||||
san_name = entry.get('sanitized_filename', entry.get('filename'))
|
||||
|
||||
if not target_path or not san_name:
|
||||
stats['skipped'] += 1
|
||||
continue
|
||||
|
||||
# Skip ambiguous entries
|
||||
if entry.get('ambiguous'):
|
||||
stats['skipped'] += 1
|
||||
continue
|
||||
|
||||
# Already at target?
|
||||
if os.path.abspath(current_path) == os.path.abspath(target_path):
|
||||
stats['already_organized'] += 1
|
||||
if not dry_run:
|
||||
db.mark_organized(doc_hash)
|
||||
continue
|
||||
|
||||
if dry_run:
|
||||
stats['moved'] += 1
|
||||
continue
|
||||
|
||||
# Verify source exists
|
||||
if not os.path.exists(current_path):
|
||||
stats['not_found'] += 1
|
||||
logger.warning("Manifest: file not found: %s [%s]", current_path, doc_hash[:8])
|
||||
continue
|
||||
|
||||
try:
|
||||
target_dir = os.path.dirname(target_path)
|
||||
os.makedirs(target_dir, exist_ok=True)
|
||||
|
||||
# Check for collision at target (different file already there)
|
||||
if os.path.exists(target_path):
|
||||
stem, ext = os.path.splitext(san_name)
|
||||
h6 = doc_hash[:6]
|
||||
san_name = '{} [{}]{}'.format(stem, h6, ext)
|
||||
target_path = os.path.join(target_dir, san_name)
|
||||
|
||||
shutil.move(current_path, target_path)
|
||||
|
||||
# Update catalogue + mark organized
|
||||
db.update_catalogue_path(doc_hash, target_path, san_name)
|
||||
db.mark_organized(doc_hash)
|
||||
stats['moved'] += 1
|
||||
|
||||
except Exception as e:
|
||||
stats['errors'] += 1
|
||||
logger.error("Manifest: failed to move %s: %s", doc_hash[:8], e)
|
||||
|
||||
# Progress reporting
|
||||
if (i + 1) % 1000 == 0:
|
||||
logger.info("Manifest progress: %d / %d (moved=%d, errors=%d)",
|
||||
i + 1, stats['total'], stats['moved'], stats['errors'])
|
||||
|
||||
return stats
|
||||
137
lib/peertube_collector.py
Normal file
137
lib/peertube_collector.py
Normal file
|
|
@ -0,0 +1,137 @@
|
|||
"""
|
||||
RECON Metrics Collector
|
||||
|
||||
Background daemon thread that snapshots pipeline metrics every 5 minutes
|
||||
to the metrics_snapshots SQLite table. Used for time-series charts.
|
||||
"""
|
||||
import json
|
||||
import time
|
||||
import threading
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger('recon.collector')
|
||||
|
||||
|
||||
def start_collector(stop_event=None):
|
||||
"""Start the metrics collector in a daemon thread."""
|
||||
def _run():
|
||||
from .status import StatusDB
|
||||
from .utils import get_config
|
||||
import requests as req
|
||||
|
||||
interval = 120 # 2 minutes
|
||||
logger.info(f"Metrics collector started (interval: {interval}s)")
|
||||
|
||||
while True:
|
||||
if stop_event and stop_event.is_set():
|
||||
break
|
||||
try:
|
||||
_snapshot(StatusDB(), get_config(), req)
|
||||
except Exception as e:
|
||||
logger.error(f"Metrics snapshot failed: {e}")
|
||||
|
||||
# Wait with stop check
|
||||
if stop_event:
|
||||
stop_event.wait(interval)
|
||||
if stop_event.is_set():
|
||||
break
|
||||
else:
|
||||
time.sleep(interval)
|
||||
|
||||
logger.info("Metrics collector stopped")
|
||||
|
||||
t = threading.Thread(target=_run, daemon=True, name='metrics-collector')
|
||||
t.start()
|
||||
return t
|
||||
|
||||
|
||||
def _snapshot(db, config, req):
|
||||
"""Take a single metrics snapshot."""
|
||||
from datetime import datetime, timezone, timedelta
|
||||
|
||||
conn = db._get_conn()
|
||||
ts = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:00Z') # Round to minute
|
||||
|
||||
# Knowledge pipeline stats
|
||||
try:
|
||||
totals = conn.execute("""
|
||||
SELECT
|
||||
COUNT(*) as total,
|
||||
SUM(CASE WHEN status = 'complete' THEN 1 ELSE 0 END) as complete,
|
||||
SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed,
|
||||
SUM(CASE WHEN status NOT IN ('complete', 'failed') THEN 1 ELSE 0 END) as in_pipeline,
|
||||
SUM(COALESCE(concepts_extracted, 0)) as concepts,
|
||||
SUM(COALESCE(vectors_inserted, 0)) as vectors
|
||||
FROM documents
|
||||
""").fetchone()
|
||||
|
||||
knowledge_data = {
|
||||
'total': totals['total'],
|
||||
'complete': totals['complete'],
|
||||
'failed': totals['failed'],
|
||||
'in_pipeline': totals['in_pipeline'],
|
||||
'concepts': totals['concepts'],
|
||||
'vectors': totals['vectors'],
|
||||
}
|
||||
|
||||
conn.execute(
|
||||
"INSERT OR REPLACE INTO metrics_snapshots (timestamp, metric_type, data) VALUES (?, ?, ?)",
|
||||
(ts, 'knowledge', json.dumps(knowledge_data))
|
||||
)
|
||||
conn.commit()
|
||||
except Exception as e:
|
||||
logger.debug(f"Knowledge snapshot failed: {e}")
|
||||
|
||||
# PeerTube pipeline stats (via SSH)
|
||||
try:
|
||||
import subprocess
|
||||
result = subprocess.run(
|
||||
['ssh', '-o', 'BatchMode=yes', '-o', 'ConnectTimeout=5',
|
||||
'zvx@192.168.1.170',
|
||||
'sudo -u peertube psql peertube_prod -t -A -c "SELECT state, COUNT(*) FROM video GROUP BY state;" 2>/dev/null; '
|
||||
'echo "---"; '
|
||||
'for d in staging completed transcoded failed; do '
|
||||
' dir="/opt/bulk-import/$d"; '
|
||||
' files=$(find -L "$dir" -type f 2>/dev/null | wc -l); '
|
||||
' echo "$d|$files"; '
|
||||
'done'],
|
||||
capture_output=True, text=True, timeout=20
|
||||
)
|
||||
if result.returncode == 0 or result.stdout.strip():
|
||||
sections = result.stdout.split('---')
|
||||
video_states = {}
|
||||
if len(sections) > 0:
|
||||
for line in sections[0].strip().split('\n'):
|
||||
if '|' in line:
|
||||
parts = line.split('|')
|
||||
if len(parts) == 2 and parts[1].isdigit():
|
||||
video_states[parts[0]] = int(parts[1])
|
||||
pipeline_files = {}
|
||||
if len(sections) > 1:
|
||||
for line in sections[1].strip().split('\n'):
|
||||
if '|' in line:
|
||||
parts = line.split('|')
|
||||
if len(parts) == 2:
|
||||
pipeline_files[parts[0]] = int(parts[1]) if parts[1].isdigit() else 0
|
||||
|
||||
pt_data = {
|
||||
'video_states': video_states,
|
||||
'pipeline_files': pipeline_files,
|
||||
'published': video_states.get('1', 0),
|
||||
'backlog': sum(pipeline_files.values()),
|
||||
}
|
||||
conn.execute(
|
||||
"INSERT OR REPLACE INTO metrics_snapshots (timestamp, metric_type, data) VALUES (?, ?, ?)",
|
||||
(ts, 'peertube', json.dumps(pt_data))
|
||||
)
|
||||
conn.commit()
|
||||
except Exception as e:
|
||||
logger.debug(f"PeerTube snapshot failed: {e}")
|
||||
|
||||
# Prune old snapshots (> 7 days)
|
||||
try:
|
||||
cutoff = (datetime.now(timezone.utc) - timedelta(days=7)).isoformat()
|
||||
conn.execute("DELETE FROM metrics_snapshots WHERE timestamp < ?", (cutoff,))
|
||||
conn.commit()
|
||||
except Exception:
|
||||
pass
|
||||
580
lib/peertube_scraper.py
Normal file
580
lib/peertube_scraper.py
Normal file
|
|
@ -0,0 +1,580 @@
|
|||
"""
|
||||
RECON PeerTube Scraper — Video transcript ingestion.
|
||||
|
||||
Fetches WebVTT captions from a PeerTube instance, converts to plain text,
|
||||
chunks into pages, and feeds into the standard RECON enrichment pipeline.
|
||||
|
||||
Output format matches lib/web_scraper.py so the enricher and embedder
|
||||
process transcript content identically to web content.
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import io
|
||||
import json
|
||||
import os
|
||||
import bisect
|
||||
import re
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from urllib.parse import quote
|
||||
|
||||
import requests
|
||||
import webvtt
|
||||
|
||||
from .utils import get_config, setup_logging
|
||||
from .status import StatusDB
|
||||
from .web_scraper import chunk_text
|
||||
|
||||
logger = setup_logging('recon.peertube_scraper')
|
||||
|
||||
# Module-level stop flag — set by service thread for graceful shutdown
|
||||
_stop_check = None
|
||||
|
||||
def set_stop_check(fn):
|
||||
"""Register a callable that returns True when shutdown is requested."""
|
||||
global _stop_check
|
||||
_stop_check = fn
|
||||
|
||||
# Defaults (overridden by config.yaml peertube section)
|
||||
DEFAULT_API_BASE = 'http://192.168.1.170'
|
||||
DEFAULT_PUBLIC_URL = 'https://stream.echo6.co'
|
||||
DEFAULT_FETCH_TIMEOUT = 30
|
||||
DEFAULT_RATE_LIMIT_DELAY = 0.5
|
||||
|
||||
|
||||
def _get_pt_config(config=None):
|
||||
"""Get PeerTube settings from config, with defaults."""
|
||||
if config is None:
|
||||
config = get_config()
|
||||
pt = config.get('peertube', {})
|
||||
return {
|
||||
'api_base': pt.get('api_base', DEFAULT_API_BASE),
|
||||
'public_url': pt.get('public_url', DEFAULT_PUBLIC_URL),
|
||||
'fetch_timeout': pt.get('fetch_timeout', DEFAULT_FETCH_TIMEOUT),
|
||||
'rate_limit_delay': pt.get('rate_limit_delay', DEFAULT_RATE_LIMIT_DELAY),
|
||||
}
|
||||
|
||||
|
||||
def _api_get(path, config=None, params=None):
|
||||
"""Make a GET request to the PeerTube API."""
|
||||
ptc = _get_pt_config(config)
|
||||
url = f"{ptc['api_base']}{path}"
|
||||
resp = requests.get(url, params=params, timeout=ptc['fetch_timeout'])
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def get_videos(channel=None, since=None, config=None):
|
||||
"""
|
||||
Paginate through all published videos on the PeerTube instance.
|
||||
|
||||
Args:
|
||||
channel: Filter to this channel actor_name (e.g., 'mental-outlaw')
|
||||
since: ISO date string — only return videos published after this date
|
||||
config: RECON config dict
|
||||
|
||||
Returns list of video dicts with: uuid, name, duration,
|
||||
channel.name, channel.displayName, publishedAt, description.
|
||||
"""
|
||||
ptc = _get_pt_config(config)
|
||||
videos = []
|
||||
start = 0
|
||||
count = 100 # PeerTube supports up to 100 per page
|
||||
|
||||
while True:
|
||||
if channel:
|
||||
path = f"/api/v1/video-channels/{channel}/videos"
|
||||
else:
|
||||
path = "/api/v1/videos"
|
||||
|
||||
data = _api_get(path, config, params={
|
||||
'count': count,
|
||||
'start': start,
|
||||
'sort': '-publishedAt',
|
||||
})
|
||||
|
||||
total = data.get('total', 0)
|
||||
batch = data.get('data', [])
|
||||
|
||||
if not batch:
|
||||
break
|
||||
|
||||
for v in batch:
|
||||
published = v.get('publishedAt', '')
|
||||
|
||||
# Filter by since date
|
||||
if since and published < since:
|
||||
# Videos are sorted by publishedAt desc, so once we pass
|
||||
# the since threshold, all remaining are older — stop
|
||||
return videos
|
||||
|
||||
videos.append({
|
||||
'uuid': v['uuid'],
|
||||
'name': v['name'],
|
||||
'duration': v.get('duration', 0),
|
||||
'channel_name': v.get('channel', {}).get('name', ''),
|
||||
'channel_display': v.get('channel', {}).get('displayName', ''),
|
||||
'publishedAt': published,
|
||||
'description': (v.get('description') or '')[:500],
|
||||
})
|
||||
|
||||
start += count
|
||||
if start >= total:
|
||||
break
|
||||
|
||||
# Check for shutdown during pagination
|
||||
if _stop_check and _stop_check():
|
||||
logger.info(f"Shutdown requested during video listing — returning {len(videos)} collected so far")
|
||||
return videos
|
||||
|
||||
# Rate limit pagination requests
|
||||
time.sleep(ptc['rate_limit_delay'])
|
||||
|
||||
return videos
|
||||
|
||||
|
||||
def get_captions(uuid, config=None):
|
||||
"""Get caption list for a video. Returns list of caption dicts."""
|
||||
data = _api_get(f"/api/v1/videos/{uuid}/captions", config)
|
||||
return data.get('data', [])
|
||||
|
||||
|
||||
def fetch_vtt(caption_path, config=None):
|
||||
"""Fetch raw VTT file content from PeerTube."""
|
||||
ptc = _get_pt_config(config)
|
||||
url = f"{ptc['api_base']}{caption_path}"
|
||||
resp = requests.get(url, timeout=ptc['fetch_timeout'])
|
||||
resp.raise_for_status()
|
||||
return resp.text
|
||||
|
||||
|
||||
|
||||
def _parse_vtt_time(time_str):
|
||||
"""Parse VTT timestamp string (HH:MM:SS.mmm or MM:SS.mmm) to seconds."""
|
||||
parts = time_str.split(':')
|
||||
if len(parts) == 3:
|
||||
h, m, s = parts
|
||||
return int(h) * 3600 + int(m) * 60 + float(s)
|
||||
elif len(parts) == 2:
|
||||
m, s = parts
|
||||
return int(m) * 60 + float(s)
|
||||
return 0.0
|
||||
|
||||
|
||||
def vtt_to_text(vtt_content):
|
||||
"""
|
||||
Convert WebVTT content to clean plain text with timestamp tracking.
|
||||
|
||||
Strips timestamps, de-duplicates consecutive identical cues (common with
|
||||
Whisper output), removes HTML tags, and joins cues with spaces (not
|
||||
newlines — Whisper cues break mid-sentence).
|
||||
|
||||
Returns (text, cue_timestamps) where:
|
||||
- text: clean prose string
|
||||
- cue_timestamps: list of (start_seconds, char_offset) tuples tracking
|
||||
where each VTT cue begins in the output text
|
||||
"""
|
||||
buf = io.StringIO(vtt_content)
|
||||
try:
|
||||
captions = webvtt.read_buffer(buf)
|
||||
except Exception:
|
||||
# Fallback: manual regex parse if webvtt-py fails
|
||||
return _vtt_to_text_fallback(vtt_content)
|
||||
|
||||
prev_text = None
|
||||
segments = []
|
||||
raw_timestamps = [] # (start_seconds, segment_index)
|
||||
|
||||
for caption in captions:
|
||||
text = caption.text.strip()
|
||||
if not text:
|
||||
continue
|
||||
|
||||
# Strip HTML tags
|
||||
text = re.sub(r'<[^>]+>', '', text)
|
||||
|
||||
# De-duplicate consecutive identical cues
|
||||
if text == prev_text:
|
||||
continue
|
||||
prev_text = text
|
||||
|
||||
start_seconds = _parse_vtt_time(caption.start)
|
||||
raw_timestamps.append((start_seconds, len(segments)))
|
||||
segments.append(text)
|
||||
|
||||
# Join with spaces — VTT cues break mid-sentence
|
||||
raw = ' '.join(segments)
|
||||
|
||||
# Clean up double spaces and whitespace
|
||||
raw = re.sub(r'\s+', ' ', raw).strip()
|
||||
|
||||
# Compute char offsets for each tracked segment
|
||||
seg_offsets = []
|
||||
pos = 0
|
||||
for i, seg in enumerate(segments):
|
||||
seg_offsets.append(pos)
|
||||
pos += len(seg) + 1 # +1 for space separator
|
||||
|
||||
cue_timestamps = []
|
||||
for start_secs, seg_idx in raw_timestamps:
|
||||
if seg_idx < len(seg_offsets):
|
||||
cue_timestamps.append((start_secs, seg_offsets[seg_idx]))
|
||||
|
||||
return raw, cue_timestamps
|
||||
|
||||
|
||||
def _vtt_to_text_fallback(vtt_content):
|
||||
"""Regex-based VTT parser as fallback. Returns (text, cue_timestamps)."""
|
||||
lines = vtt_content.split('\n')
|
||||
prev_text = None
|
||||
segments = []
|
||||
raw_timestamps = []
|
||||
last_time = 0.0
|
||||
|
||||
for line in lines:
|
||||
line = line.strip()
|
||||
if not line or line == 'WEBVTT':
|
||||
continue
|
||||
if '-->' in line:
|
||||
# Parse start time from "00:01:23.456 --> 00:01:25.789"
|
||||
time_part = line.split('-->')[0].strip()
|
||||
last_time = _parse_vtt_time(time_part)
|
||||
continue
|
||||
if line.isdigit():
|
||||
continue
|
||||
|
||||
text = re.sub(r'<[^>]+>', '', line)
|
||||
if text == prev_text:
|
||||
continue
|
||||
prev_text = text
|
||||
raw_timestamps.append((last_time, len(segments)))
|
||||
segments.append(text)
|
||||
|
||||
raw = ' '.join(segments)
|
||||
raw = re.sub(r'\s+', ' ', raw).strip()
|
||||
|
||||
# Compute char offsets
|
||||
seg_offsets = []
|
||||
pos = 0
|
||||
for seg in segments:
|
||||
seg_offsets.append(pos)
|
||||
pos += len(seg) + 1
|
||||
|
||||
cue_timestamps = []
|
||||
for start_secs, seg_idx in raw_timestamps:
|
||||
if seg_idx < len(seg_offsets):
|
||||
cue_timestamps.append((start_secs, seg_offsets[seg_idx]))
|
||||
|
||||
return raw, cue_timestamps
|
||||
|
||||
|
||||
|
||||
def _map_page_timestamps(pages, full_text, cue_timestamps):
|
||||
"""
|
||||
Map page numbers to video timestamps.
|
||||
|
||||
For each page, finds its approximate start position in the full text,
|
||||
then looks up the nearest VTT cue timestamp via binary search.
|
||||
|
||||
Returns dict: {"page_0001": 0.0, "page_0002": 312.5, ...}
|
||||
"""
|
||||
if not cue_timestamps:
|
||||
return {}
|
||||
|
||||
offsets = [ct[1] for ct in cue_timestamps]
|
||||
times = [ct[0] for ct in cue_timestamps]
|
||||
|
||||
page_ts = {}
|
||||
search_start = 0
|
||||
|
||||
for i, page_text in enumerate(pages):
|
||||
page_name = f"page_{i+1:04d}"
|
||||
|
||||
# Find where this page starts in the full text
|
||||
snippet = page_text[:200].strip()
|
||||
pos = full_text.find(snippet, search_start)
|
||||
if pos < 0:
|
||||
pos = search_start # fallback
|
||||
|
||||
# Binary search for nearest cue at or before this position
|
||||
idx = bisect.bisect_right(offsets, pos) - 1
|
||||
if idx < 0:
|
||||
idx = 0
|
||||
|
||||
page_ts[page_name] = round(times[idx], 1)
|
||||
search_start = pos + len(snippet)
|
||||
|
||||
return page_ts
|
||||
|
||||
def _content_hash(text):
|
||||
"""MD5 hash of text content — same as web_scraper."""
|
||||
return hashlib.md5(text.encode('utf-8')).hexdigest()
|
||||
|
||||
|
||||
def ingest_video(uuid, video_meta, config=None):
|
||||
"""
|
||||
Ingest a single PeerTube video transcript.
|
||||
|
||||
Fetches captions, converts VTT to text, chunks into pages,
|
||||
saves to data/text/{hash}/, and sets status to 'extracted'.
|
||||
|
||||
Args:
|
||||
uuid: Video UUID
|
||||
video_meta: Dict with name, duration, channel_name, channel_display,
|
||||
publishedAt, description
|
||||
config: RECON config dict
|
||||
|
||||
Returns dict with hash, status, title, page_count — or None if no captions.
|
||||
"""
|
||||
if config is None:
|
||||
config = get_config()
|
||||
ptc = _get_pt_config(config)
|
||||
db = StatusDB()
|
||||
|
||||
# Get captions
|
||||
captions = get_captions(uuid, config)
|
||||
if not captions:
|
||||
return None
|
||||
|
||||
# Prefer English caption
|
||||
caption = None
|
||||
for c in captions:
|
||||
if c.get('language', {}).get('id') == 'en':
|
||||
caption = c
|
||||
break
|
||||
if caption is None:
|
||||
caption = captions[0]
|
||||
|
||||
# Fetch VTT
|
||||
vtt_content = fetch_vtt(caption['captionPath'], config)
|
||||
|
||||
# Convert to plain text with timestamp tracking
|
||||
text, cue_timestamps = vtt_to_text(vtt_content)
|
||||
if not text or len(text) < 50:
|
||||
logger.warning(f"Transcript too short for {video_meta['name']} ({uuid}): {len(text)} chars")
|
||||
return None
|
||||
|
||||
# Hash the text content
|
||||
doc_hash = _content_hash(text)
|
||||
|
||||
# Check for duplicate
|
||||
conn = db._get_conn()
|
||||
existing = conn.execute("SELECT * FROM catalogue WHERE hash = ?", (doc_hash,)).fetchone()
|
||||
if existing:
|
||||
doc = db.get_document(doc_hash)
|
||||
existing_status = doc['status'] if doc else existing['status']
|
||||
logger.debug(f"Duplicate transcript (hash {doc_hash[:12]}...) — {video_meta['name']}")
|
||||
return {
|
||||
'hash': doc_hash,
|
||||
'status': 'duplicate',
|
||||
'title': video_meta['name'],
|
||||
'existing_status': existing_status,
|
||||
}
|
||||
|
||||
# Chunk into pages
|
||||
words_per_page = config.get('web_scraper', {}).get('words_per_page', 2000)
|
||||
pages = chunk_text(text, words_per_page)
|
||||
|
||||
# Compute page-to-timestamp mapping
|
||||
page_timestamps = _map_page_timestamps(pages, text, cue_timestamps)
|
||||
|
||||
# Save text files
|
||||
text_dir = os.path.join(config['paths']['text'], doc_hash)
|
||||
os.makedirs(text_dir, exist_ok=True)
|
||||
|
||||
for i, page_text in enumerate(pages, 1):
|
||||
page_file = os.path.join(text_dir, f"page_{i:04d}.txt")
|
||||
with open(page_file, 'w', encoding='utf-8') as f:
|
||||
f.write(page_text)
|
||||
|
||||
# Save meta.json
|
||||
video_url = f"{ptc['public_url']}/w/{uuid}"
|
||||
meta = {
|
||||
'hash': doc_hash,
|
||||
'source_type': 'transcript',
|
||||
'url': video_url,
|
||||
'title': video_meta['name'],
|
||||
'author': video_meta.get('channel_display', ''),
|
||||
'channel': video_meta.get('channel_name', ''),
|
||||
'duration': video_meta.get('duration', 0),
|
||||
'date': video_meta.get('publishedAt', ''),
|
||||
'description': video_meta.get('description', ''),
|
||||
'sitename': 'stream.echo6.co',
|
||||
'page_count': len(pages),
|
||||
'text_length': len(text),
|
||||
'page_timestamps': page_timestamps,
|
||||
'fetched_at': datetime.now(timezone.utc).isoformat(),
|
||||
}
|
||||
with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
|
||||
json.dump(meta, f, indent=2)
|
||||
|
||||
# Display filename for catalogue
|
||||
display_name = re.sub(r'[^\w\s._-]', '', video_meta['name'])[:200].strip()
|
||||
if not display_name:
|
||||
display_name = uuid
|
||||
|
||||
# Add to catalogue
|
||||
db.add_to_catalogue(
|
||||
doc_hash, display_name, video_url,
|
||||
len(text), 'stream.echo6.co', video_meta.get('channel_name', 'unknown')
|
||||
)
|
||||
|
||||
# Queue + advance to extracted
|
||||
db.queue_document(doc_hash)
|
||||
db.update_status(doc_hash, 'extracted',
|
||||
page_count=len(pages),
|
||||
pages_extracted=len(pages),
|
||||
book_title=video_meta['name'],
|
||||
book_author=video_meta.get('channel_display', ''))
|
||||
|
||||
logger.info(
|
||||
f"Ingested transcript: {video_meta['name']} ({uuid[:8]}...) "
|
||||
f"-> {doc_hash[:12]}... ({len(pages)} pages, {len(text)} chars)"
|
||||
)
|
||||
|
||||
return {
|
||||
'hash': doc_hash,
|
||||
'status': 'extracted',
|
||||
'title': video_meta['name'],
|
||||
'page_count': len(pages),
|
||||
'text_length': len(text),
|
||||
'page_timestamps': page_timestamps,
|
||||
'channel': video_meta.get('channel_name', ''),
|
||||
'duration': video_meta.get('duration', 0),
|
||||
'url': video_url,
|
||||
}
|
||||
|
||||
|
||||
def ingest_channel(channel_name, config=None, since=None):
|
||||
"""
|
||||
Ingest all captioned videos from a specific channel.
|
||||
|
||||
Returns summary dict.
|
||||
"""
|
||||
if config is None:
|
||||
config = get_config()
|
||||
ptc = _get_pt_config(config)
|
||||
|
||||
logger.info(f"Ingesting channel: {channel_name}")
|
||||
videos = get_videos(channel=channel_name, since=since, config=config)
|
||||
return _ingest_video_list(videos, config, ptc)
|
||||
|
||||
|
||||
def ingest_all(config=None, since=None):
|
||||
"""
|
||||
Ingest all captioned videos from the entire PeerTube instance.
|
||||
|
||||
Returns summary dict.
|
||||
"""
|
||||
if config is None:
|
||||
config = get_config()
|
||||
ptc = _get_pt_config(config)
|
||||
|
||||
logger.info("Ingesting all PeerTube videos with captions")
|
||||
videos = get_videos(since=since, config=config)
|
||||
return _ingest_video_list(videos, config, ptc)
|
||||
|
||||
|
||||
def _ingest_video_list(videos, config, ptc):
|
||||
"""Process a list of videos — shared logic for ingest_channel and ingest_all."""
|
||||
results = []
|
||||
skipped_no_captions = 0
|
||||
skipped_duplicate = 0
|
||||
failed = 0
|
||||
ingested = 0
|
||||
total_pages = 0
|
||||
|
||||
total = len(videos)
|
||||
logger.info(f"Found {total} videos to check for captions")
|
||||
|
||||
for i, video in enumerate(videos, 1):
|
||||
if _stop_check and _stop_check():
|
||||
logger.info(f"Shutdown requested — stopping after {i-1}/{total} videos")
|
||||
break
|
||||
uuid = video['uuid']
|
||||
|
||||
try:
|
||||
result = ingest_video(uuid, video, config)
|
||||
|
||||
if result is None:
|
||||
skipped_no_captions += 1
|
||||
elif result['status'] == 'duplicate':
|
||||
skipped_duplicate += 1
|
||||
else:
|
||||
ingested += 1
|
||||
total_pages += result.get('page_count', 0)
|
||||
results.append(result)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"[{i}/{total}] Failed: {video['name']} ({uuid}) — {e}")
|
||||
failed += 1
|
||||
|
||||
# Check for shutdown
|
||||
if _stop_check and _stop_check():
|
||||
logger.info(f"Shutdown requested — stopping after {i}/{total} videos")
|
||||
break
|
||||
|
||||
# Rate limit
|
||||
if i < total:
|
||||
time.sleep(ptc['rate_limit_delay'])
|
||||
|
||||
# Progress logging every 50 videos
|
||||
if i % 50 == 0:
|
||||
logger.info(
|
||||
f"Progress: {i}/{total} checked — "
|
||||
f"{ingested} ingested, {skipped_no_captions} no captions, "
|
||||
f"{skipped_duplicate} dupes, {failed} failed"
|
||||
)
|
||||
|
||||
logger.info(
|
||||
f"PeerTube ingestion complete: {ingested} ingested ({total_pages} pages), "
|
||||
f"{skipped_no_captions} no captions, {skipped_duplicate} duplicates, "
|
||||
f"{failed} failed out of {total} videos"
|
||||
)
|
||||
|
||||
return {
|
||||
'results': results,
|
||||
'summary': {
|
||||
'total_checked': total,
|
||||
'ingested': ingested,
|
||||
'skipped_no_captions': skipped_no_captions,
|
||||
'skipped_duplicate': skipped_duplicate,
|
||||
'failed': failed,
|
||||
'total_pages': total_pages,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
def get_instance_stats(config=None):
|
||||
"""Get PeerTube instance statistics for the dashboard."""
|
||||
if config is None:
|
||||
config = get_config()
|
||||
db = StatusDB()
|
||||
|
||||
# Total videos on instance
|
||||
try:
|
||||
data = _api_get("/api/v1/videos", config, params={'count': 1})
|
||||
total_videos = data.get('total', 0)
|
||||
except Exception:
|
||||
total_videos = 0
|
||||
|
||||
# Videos ingested into RECON (from catalogue)
|
||||
conn = db._get_conn()
|
||||
ingested = conn.execute(
|
||||
"SELECT count(*) FROM catalogue WHERE source = 'stream.echo6.co'"
|
||||
).fetchone()[0]
|
||||
|
||||
# Status breakdown
|
||||
status_rows = conn.execute(
|
||||
"SELECT d.status, count(*) as cnt FROM documents d "
|
||||
"JOIN catalogue c ON d.hash = c.hash "
|
||||
"WHERE c.source = 'stream.echo6.co' "
|
||||
"GROUP BY d.status"
|
||||
).fetchall()
|
||||
status_breakdown = {row['status']: row['cnt'] for row in status_rows}
|
||||
|
||||
return {
|
||||
'total_videos': total_videos,
|
||||
'ingested': ingested,
|
||||
'status_breakdown': status_breakdown,
|
||||
}
|
||||
508
lib/status.py
Normal file
508
lib/status.py
Normal file
|
|
@ -0,0 +1,508 @@
|
|||
"""
|
||||
RECON Status Tracker
|
||||
|
||||
SQLite operations for catalogue and documents tables. WAL mode, thread-local connections.
|
||||
Status flow: catalogued -> queued -> extracting -> extracted -> enriching -> enriched -> embedding -> complete.
|
||||
|
||||
Config: paths.db
|
||||
"""
|
||||
import os
|
||||
import sqlite3
|
||||
import threading
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from .utils import get_config
|
||||
|
||||
_local = threading.local()
|
||||
|
||||
|
||||
class StatusDB:
|
||||
def __init__(self, db_path=None):
|
||||
if db_path is None:
|
||||
db_path = get_config()['paths']['db']
|
||||
self.db_path = db_path
|
||||
os.makedirs(os.path.dirname(db_path), exist_ok=True)
|
||||
self._init_db()
|
||||
|
||||
def _get_conn(self):
|
||||
if not hasattr(_local, 'conn') or _local.conn is None:
|
||||
_local.conn = sqlite3.connect(self.db_path, timeout=30)
|
||||
_local.conn.row_factory = sqlite3.Row
|
||||
_local.conn.execute("PRAGMA journal_mode=WAL")
|
||||
_local.conn.execute("PRAGMA busy_timeout=5000")
|
||||
return _local.conn
|
||||
|
||||
def _init_db(self):
|
||||
conn = self._get_conn()
|
||||
conn.executescript("""
|
||||
CREATE TABLE IF NOT EXISTS catalogue (
|
||||
hash TEXT PRIMARY KEY,
|
||||
filename TEXT NOT NULL,
|
||||
path TEXT NOT NULL,
|
||||
size_bytes INTEGER,
|
||||
source TEXT,
|
||||
category TEXT,
|
||||
status TEXT DEFAULT 'catalogued',
|
||||
discovered_at TEXT DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS documents (
|
||||
hash TEXT PRIMARY KEY,
|
||||
filename TEXT NOT NULL,
|
||||
path TEXT,
|
||||
size_bytes INTEGER,
|
||||
page_count INTEGER,
|
||||
book_title TEXT,
|
||||
book_author TEXT,
|
||||
collection TEXT DEFAULT 'survival',
|
||||
status TEXT DEFAULT 'pending',
|
||||
pages_extracted INTEGER DEFAULT 0,
|
||||
concepts_extracted INTEGER DEFAULT 0,
|
||||
vectors_inserted INTEGER DEFAULT 0,
|
||||
discovered_at TEXT DEFAULT CURRENT_TIMESTAMP,
|
||||
extracted_at TEXT,
|
||||
enriched_at TEXT,
|
||||
embedded_at TEXT,
|
||||
error_message TEXT,
|
||||
retry_count INTEGER DEFAULT 0
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS intel (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
source TEXT,
|
||||
timestamp TEXT,
|
||||
region TEXT,
|
||||
category TEXT,
|
||||
content TEXT,
|
||||
summary TEXT,
|
||||
key_facts TEXT,
|
||||
credibility_score REAL,
|
||||
verification_status TEXT,
|
||||
vector_id INTEGER,
|
||||
ingested_at TEXT DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS metrics_snapshots (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
timestamp TEXT NOT NULL,
|
||||
metric_type TEXT NOT NULL,
|
||||
data TEXT NOT NULL,
|
||||
UNIQUE(timestamp, metric_type)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_catalogue_status ON catalogue(status);
|
||||
CREATE INDEX IF NOT EXISTS idx_catalogue_source ON catalogue(source);
|
||||
CREATE INDEX IF NOT EXISTS idx_documents_status ON documents(status);
|
||||
""")
|
||||
# Migration: add path_updated_at column if missing
|
||||
try:
|
||||
conn.execute("ALTER TABLE catalogue ADD COLUMN path_updated_at TEXT")
|
||||
except Exception:
|
||||
pass # column already exists
|
||||
# Migration: add organized_at column to documents if missing
|
||||
try:
|
||||
conn.execute("ALTER TABLE documents ADD COLUMN organized_at TEXT")
|
||||
except Exception:
|
||||
pass # column already exists
|
||||
|
||||
# Stream B: file_operations + duplicate_review tables
|
||||
conn.executescript("""
|
||||
CREATE TABLE IF NOT EXISTS file_operations (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
doc_hash TEXT NOT NULL,
|
||||
operation TEXT NOT NULL,
|
||||
source_path TEXT NOT NULL,
|
||||
target_path TEXT NOT NULL,
|
||||
source_filename TEXT NOT NULL,
|
||||
target_filename TEXT NOT NULL,
|
||||
original_filename TEXT,
|
||||
collision_step INTEGER,
|
||||
qdrant_points_updated INTEGER DEFAULT 0,
|
||||
performed_at TEXT DEFAULT CURRENT_TIMESTAMP,
|
||||
reversed_at TEXT,
|
||||
notes TEXT
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_fileops_hash ON file_operations(doc_hash);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS duplicate_review (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
doc_hash TEXT NOT NULL,
|
||||
original_filename TEXT NOT NULL,
|
||||
sanitized_filename TEXT NOT NULL,
|
||||
collision_with_hash TEXT,
|
||||
collision_path TEXT,
|
||||
duplicate_path TEXT NOT NULL,
|
||||
domain TEXT,
|
||||
subdomain TEXT,
|
||||
book_author TEXT,
|
||||
book_title TEXT,
|
||||
status TEXT DEFAULT 'pending',
|
||||
resolution TEXT,
|
||||
discovered_at TEXT DEFAULT CURRENT_TIMESTAMP,
|
||||
resolved_at TEXT
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_dupreview_status ON duplicate_review(status);
|
||||
""")
|
||||
conn.commit()
|
||||
|
||||
def add_to_catalogue(self, file_hash, filename, path, size_bytes, source, category):
|
||||
conn = self._get_conn()
|
||||
conn.execute(
|
||||
"""INSERT INTO catalogue (hash, filename, path, size_bytes, source, category)
|
||||
VALUES (?, ?, ?, ?, ?, ?)
|
||||
ON CONFLICT(hash) DO UPDATE SET
|
||||
path = excluded.path,
|
||||
filename = excluded.filename,
|
||||
source = excluded.source,
|
||||
category = excluded.category,
|
||||
path_updated_at = CASE
|
||||
WHEN catalogue.path != excluded.path THEN CURRENT_TIMESTAMP
|
||||
ELSE catalogue.path_updated_at
|
||||
END""",
|
||||
(file_hash, filename, path, size_bytes, source, category)
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
def queue_document(self, file_hash):
|
||||
conn = self._get_conn()
|
||||
row = conn.execute("SELECT * FROM catalogue WHERE hash = ?", (file_hash,)).fetchone()
|
||||
if not row:
|
||||
return False
|
||||
conn.execute("UPDATE catalogue SET status = 'queued' WHERE hash = ?", (file_hash,))
|
||||
conn.execute(
|
||||
"""INSERT INTO documents (hash, filename, path, size_bytes, status)
|
||||
VALUES (?, ?, ?, ?, 'queued')
|
||||
ON CONFLICT(hash) DO UPDATE SET
|
||||
path = excluded.path,
|
||||
filename = excluded.filename""",
|
||||
(row['hash'], row['filename'], row['path'], row['size_bytes'])
|
||||
)
|
||||
conn.commit()
|
||||
return True
|
||||
|
||||
def update_status(self, file_hash, status, **kwargs):
|
||||
conn = self._get_conn()
|
||||
sets = ["status = ?"]
|
||||
vals = [status]
|
||||
|
||||
ts_field = {
|
||||
'extracted': 'extracted_at',
|
||||
'enriched': 'enriched_at',
|
||||
'complete': 'embedded_at',
|
||||
}.get(status)
|
||||
if ts_field:
|
||||
sets.append(f"{ts_field} = ?")
|
||||
vals.append(datetime.now(timezone.utc).isoformat())
|
||||
|
||||
for k, v in kwargs.items():
|
||||
sets.append(f"{k} = ?")
|
||||
vals.append(v)
|
||||
|
||||
vals.append(file_hash)
|
||||
conn.execute(f"UPDATE documents SET {', '.join(sets)} WHERE hash = ?", vals)
|
||||
conn.commit()
|
||||
|
||||
def get_by_status(self, status, limit=None):
|
||||
conn = self._get_conn()
|
||||
q = "SELECT * FROM documents WHERE status = ? ORDER BY discovered_at"
|
||||
if limit:
|
||||
q += f" LIMIT {int(limit)}"
|
||||
return [dict(r) for r in conn.execute(q, (status,)).fetchall()]
|
||||
|
||||
def get_catalogued(self, source=None, category=None, limit=None):
|
||||
conn = self._get_conn()
|
||||
q = "SELECT * FROM catalogue WHERE status = 'catalogued'"
|
||||
params = []
|
||||
if source:
|
||||
q += " AND source = ?"
|
||||
params.append(source)
|
||||
if category:
|
||||
q += " AND category = ?"
|
||||
params.append(category)
|
||||
q += " ORDER BY discovered_at"
|
||||
if limit:
|
||||
q += f" LIMIT {int(limit)}"
|
||||
return [dict(r) for r in conn.execute(q, params).fetchall()]
|
||||
|
||||
def get_document(self, file_hash):
|
||||
conn = self._get_conn()
|
||||
row = conn.execute("SELECT * FROM documents WHERE hash = ?", (file_hash,)).fetchone()
|
||||
return dict(row) if row else None
|
||||
|
||||
def get_status_counts(self):
|
||||
conn = self._get_conn()
|
||||
cat_counts = {}
|
||||
for row in conn.execute("SELECT status, COUNT(*) as cnt FROM catalogue GROUP BY status"):
|
||||
cat_counts[row['status']] = row['cnt']
|
||||
|
||||
doc_counts = {}
|
||||
for row in conn.execute("SELECT status, COUNT(*) as cnt FROM documents GROUP BY status"):
|
||||
doc_counts[row['status']] = row['cnt']
|
||||
|
||||
return {'catalogue': cat_counts, 'documents': doc_counts}
|
||||
|
||||
def get_failures(self):
|
||||
conn = self._get_conn()
|
||||
return [dict(r) for r in conn.execute(
|
||||
"SELECT * FROM documents WHERE status = 'failed' ORDER BY discovered_at"
|
||||
).fetchall()]
|
||||
|
||||
def mark_failed(self, file_hash, error_msg):
|
||||
conn = self._get_conn()
|
||||
conn.execute(
|
||||
"UPDATE documents SET status = 'failed', error_message = ? WHERE hash = ?",
|
||||
(str(error_msg)[:1000], file_hash)
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
def increment_retry(self, file_hash):
|
||||
conn = self._get_conn()
|
||||
conn.execute(
|
||||
"UPDATE documents SET retry_count = retry_count + 1, status = 'queued', error_message = NULL WHERE hash = ?",
|
||||
(file_hash,)
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
def get_sources(self):
|
||||
conn = self._get_conn()
|
||||
return [r[0] for r in conn.execute(
|
||||
"SELECT DISTINCT source FROM catalogue ORDER BY source"
|
||||
).fetchall()]
|
||||
|
||||
def get_categories(self, source=None):
|
||||
conn = self._get_conn()
|
||||
if source:
|
||||
return [r[0] for r in conn.execute(
|
||||
"SELECT DISTINCT category FROM catalogue WHERE source = ? ORDER BY category", (source,)
|
||||
).fetchall()]
|
||||
return [r[0] for r in conn.execute(
|
||||
"SELECT DISTINCT category FROM catalogue ORDER BY category"
|
||||
).fetchall()]
|
||||
|
||||
def get_all_documents(self, status=None, source=None, category=None, limit=None, offset=None):
|
||||
conn = self._get_conn()
|
||||
q = """SELECT d.*, c.source, c.category FROM documents d
|
||||
LEFT JOIN catalogue c ON d.hash = c.hash WHERE 1=1"""
|
||||
params = []
|
||||
if status:
|
||||
q += " AND d.status = ?"
|
||||
params.append(status)
|
||||
if source:
|
||||
q += " AND c.source = ?"
|
||||
params.append(source)
|
||||
if category:
|
||||
q += " AND c.category = ?"
|
||||
params.append(category)
|
||||
q += " ORDER BY d.discovered_at DESC"
|
||||
if limit:
|
||||
q += f" LIMIT {int(limit)}"
|
||||
if offset:
|
||||
q += f" OFFSET {int(offset)}"
|
||||
return [dict(r) for r in conn.execute(q, params).fetchall()]
|
||||
|
||||
def count_documents(self, source=None, category=None):
|
||||
"""Count documents matching optional source/category filters."""
|
||||
conn = self._get_conn()
|
||||
q = """SELECT COUNT(*) FROM documents d
|
||||
LEFT JOIN catalogue c ON d.hash = c.hash WHERE 1=1"""
|
||||
params = []
|
||||
if source:
|
||||
q += " AND c.source = ?"
|
||||
params.append(source)
|
||||
if category:
|
||||
q += " AND c.category = ?"
|
||||
params.append(category)
|
||||
return conn.execute(q, params).fetchone()[0]
|
||||
|
||||
def catalogue_count(self):
|
||||
conn = self._get_conn()
|
||||
return conn.execute("SELECT COUNT(*) FROM catalogue").fetchone()[0]
|
||||
|
||||
def source_breakdown(self):
|
||||
conn = self._get_conn()
|
||||
return [dict(r) for r in conn.execute(
|
||||
"SELECT source, COUNT(*) as count, SUM(size_bytes) as total_bytes FROM catalogue GROUP BY source ORDER BY count DESC"
|
||||
).fetchall()]
|
||||
|
||||
def category_breakdown(self, source=None):
|
||||
conn = self._get_conn()
|
||||
if source:
|
||||
return [dict(r) for r in conn.execute(
|
||||
"SELECT category, COUNT(*) as count FROM catalogue WHERE source = ? GROUP BY category ORDER BY count DESC",
|
||||
(source,)
|
||||
).fetchall()]
|
||||
return [dict(r) for r in conn.execute(
|
||||
"SELECT source, category, COUNT(*) as count FROM catalogue GROUP BY source, category ORDER BY source, count DESC"
|
||||
).fetchall()]
|
||||
|
||||
def get_path_updates(self):
|
||||
"""Get catalogue entries where path was updated since last sync."""
|
||||
conn = self._get_conn()
|
||||
return [dict(r) for r in conn.execute(
|
||||
"SELECT hash, filename, path, source, category FROM catalogue "
|
||||
"WHERE path_updated_at IS NOT NULL"
|
||||
).fetchall()]
|
||||
|
||||
def clear_path_update(self, file_hash):
|
||||
"""Clear path_updated_at flag after Qdrant sync."""
|
||||
conn = self._get_conn()
|
||||
conn.execute(
|
||||
"UPDATE catalogue SET path_updated_at = NULL WHERE hash = ?",
|
||||
(file_hash,)
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
def sync_document_path(self, file_hash, path, filename):
|
||||
"""Update path and filename in documents table."""
|
||||
conn = self._get_conn()
|
||||
conn.execute(
|
||||
"UPDATE documents SET path = ?, filename = ? WHERE hash = ?",
|
||||
(path, filename, file_hash)
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
def status_breakdown(self):
|
||||
conn = self._get_conn()
|
||||
rows = conn.execute(
|
||||
"SELECT status, COUNT(*) as count FROM catalogue GROUP BY status ORDER BY count DESC"
|
||||
).fetchall()
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
def get_unorganized(self, limit=None):
|
||||
"""Get completed documents that haven't been organized yet."""
|
||||
conn = self._get_conn()
|
||||
q = "SELECT hash, filename, path FROM documents WHERE status = 'complete' AND organized_at IS NULL ORDER BY embedded_at"
|
||||
if limit:
|
||||
q += " LIMIT {}".format(int(limit))
|
||||
return [dict(r) for r in conn.execute(q).fetchall()]
|
||||
|
||||
|
||||
def get_ingest_pending(self, ingest_dir, limit=50):
|
||||
"""Get completed docs in _ingest/ that haven't been organized."""
|
||||
conn = self._get_conn()
|
||||
pattern = ingest_dir + '%'
|
||||
return [dict(r) for r in conn.execute(
|
||||
"SELECT hash, filename, path FROM documents "
|
||||
"WHERE status = 'complete' AND organized_at IS NULL AND path LIKE ? "
|
||||
"ORDER BY embedded_at LIMIT ?",
|
||||
(pattern, limit)
|
||||
).fetchall()]
|
||||
|
||||
def mark_organized(self, file_hash):
|
||||
"""Mark a document as organized (sets organized_at timestamp)."""
|
||||
conn = self._get_conn()
|
||||
conn.execute(
|
||||
"UPDATE documents SET organized_at = CURRENT_TIMESTAMP WHERE hash = ?",
|
||||
(file_hash,)
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
def update_catalogue_path(self, file_hash, new_path, new_filename):
|
||||
"""Update catalogue path/filename and flag for Qdrant sync."""
|
||||
conn = self._get_conn()
|
||||
conn.execute(
|
||||
"UPDATE catalogue SET path = ?, filename = ?, path_updated_at = CURRENT_TIMESTAMP WHERE hash = ?",
|
||||
(new_path, new_filename, file_hash)
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
# ── Stream B: File Operations ───────────────────────────────────
|
||||
|
||||
def log_file_operation(self, doc_hash, operation, source_path, target_path,
|
||||
source_filename, target_filename, original_filename=None,
|
||||
collision_step=None, qdrant_points_updated=0, notes=None):
|
||||
"""Log a file move/rename operation for audit trail and rollback."""
|
||||
conn = self._get_conn()
|
||||
conn.execute(
|
||||
"""INSERT INTO file_operations
|
||||
(doc_hash, operation, source_path, target_path,
|
||||
source_filename, target_filename, original_filename,
|
||||
collision_step, qdrant_points_updated, notes)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
|
||||
(doc_hash, operation, source_path, target_path,
|
||||
source_filename, target_filename, original_filename,
|
||||
collision_step, qdrant_points_updated, notes)
|
||||
)
|
||||
conn.commit()
|
||||
return conn.execute("SELECT last_insert_rowid()").fetchone()[0]
|
||||
|
||||
def get_file_operations(self, doc_hash=None, limit=50):
|
||||
"""Get file operations, optionally filtered by doc_hash."""
|
||||
conn = self._get_conn()
|
||||
if doc_hash:
|
||||
return [dict(r) for r in conn.execute(
|
||||
"SELECT * FROM file_operations WHERE doc_hash = ? ORDER BY performed_at DESC LIMIT ?",
|
||||
(doc_hash, limit)
|
||||
).fetchall()]
|
||||
return [dict(r) for r in conn.execute(
|
||||
"SELECT * FROM file_operations WHERE reversed_at IS NULL ORDER BY performed_at DESC LIMIT ?",
|
||||
(limit,)
|
||||
).fetchall()]
|
||||
|
||||
def get_file_operation(self, op_id):
|
||||
"""Get a single file operation by ID."""
|
||||
conn = self._get_conn()
|
||||
row = conn.execute("SELECT * FROM file_operations WHERE id = ?", (op_id,)).fetchone()
|
||||
return dict(row) if row else None
|
||||
|
||||
def mark_operation_reversed(self, op_id):
|
||||
"""Mark a file operation as reversed."""
|
||||
conn = self._get_conn()
|
||||
conn.execute(
|
||||
"UPDATE file_operations SET reversed_at = CURRENT_TIMESTAMP WHERE id = ?",
|
||||
(op_id,)
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
def queue_duplicate_review(self, doc_hash, original_filename, sanitized_filename,
|
||||
collision_with_hash=None, collision_path=None,
|
||||
duplicate_path='', domain=None, subdomain=None,
|
||||
book_author=None, book_title=None):
|
||||
"""Queue a file for human duplicate review."""
|
||||
conn = self._get_conn()
|
||||
conn.execute(
|
||||
"""INSERT INTO duplicate_review
|
||||
(doc_hash, original_filename, sanitized_filename,
|
||||
collision_with_hash, collision_path, duplicate_path,
|
||||
domain, subdomain, book_author, book_title)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
|
||||
(doc_hash, original_filename, sanitized_filename,
|
||||
collision_with_hash, collision_path, duplicate_path,
|
||||
domain, subdomain, book_author, book_title)
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
def get_duplicate_reviews(self, status='pending', limit=50):
|
||||
"""Get duplicate review queue."""
|
||||
conn = self._get_conn()
|
||||
return [dict(r) for r in conn.execute(
|
||||
"SELECT * FROM duplicate_review WHERE status = ? ORDER BY discovered_at DESC LIMIT ?",
|
||||
(status, limit)
|
||||
).fetchall()]
|
||||
|
||||
def get_pipeline_stats(self):
|
||||
"""Get Stream B pipeline statistics."""
|
||||
conn = self._get_conn()
|
||||
ops = conn.execute(
|
||||
"SELECT operation, COUNT(*) as cnt FROM file_operations WHERE reversed_at IS NULL GROUP BY operation"
|
||||
).fetchall()
|
||||
dupes = conn.execute(
|
||||
"SELECT status, COUNT(*) as cnt FROM duplicate_review GROUP BY status"
|
||||
).fetchall()
|
||||
acquired = 0
|
||||
ingest = 0
|
||||
try:
|
||||
acquired_dir = get_config().get('new_pipeline', {}).get('acquired_dir', '')
|
||||
ingest_dir = get_config().get('new_pipeline', {}).get('ingest_dir', '')
|
||||
if acquired_dir and os.path.isdir(acquired_dir):
|
||||
acquired = len([f for f in os.listdir(acquired_dir) if f.lower().endswith('.pdf')])
|
||||
if ingest_dir and os.path.isdir(ingest_dir):
|
||||
ingest = len([f for f in os.listdir(ingest_dir) if f.lower().endswith('.pdf')])
|
||||
except Exception:
|
||||
pass
|
||||
return {
|
||||
'operations': {dict(r)['operation']: dict(r)['cnt'] for r in ops},
|
||||
'duplicates': {dict(r)['status']: dict(r)['cnt'] for r in dupes},
|
||||
'acquired_pending': acquired,
|
||||
'ingest_pending': ingest,
|
||||
}
|
||||
390
lib/utils.py
Normal file
390
lib/utils.py
Normal file
|
|
@ -0,0 +1,390 @@
|
|||
"""
|
||||
RECON Utilities
|
||||
|
||||
Content hashing (MD5), config loading (YAML), download URL generation,
|
||||
source/category derivation, logging setup, filename sanitization.
|
||||
|
||||
Config: Loads and caches config.yaml
|
||||
"""
|
||||
import hashlib
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import unicodedata
|
||||
from urllib.parse import quote
|
||||
|
||||
import yaml
|
||||
from logging.handlers import RotatingFileHandler
|
||||
|
||||
_config = None
|
||||
|
||||
|
||||
def get_config():
|
||||
global _config
|
||||
if _config is not None:
|
||||
return _config
|
||||
|
||||
config_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'config.yaml')
|
||||
with open(config_path) as f:
|
||||
_config = yaml.safe_load(f)
|
||||
|
||||
# Load Gemini keys from .env
|
||||
env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), '.env')
|
||||
_config['gemini_keys'] = []
|
||||
if os.path.exists(env_path):
|
||||
with open(env_path) as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line and not line.startswith('#') and '=' in line:
|
||||
key, val = line.split('=', 1)
|
||||
if key.startswith('GEMINI_KEY_') and val != 'PASTE_KEY_HERE':
|
||||
_config['gemini_keys'].append(val)
|
||||
|
||||
return _config
|
||||
|
||||
|
||||
def content_hash(filepath):
|
||||
h = hashlib.md5()
|
||||
with open(filepath, 'rb') as f:
|
||||
for chunk in iter(lambda: f.read(8192), b''):
|
||||
h.update(chunk)
|
||||
return h.hexdigest()
|
||||
|
||||
|
||||
def concept_id(doc_hash, page_num, concept_index):
|
||||
raw = f"{doc_hash}:{page_num}:{concept_index}"
|
||||
h = hashlib.md5(raw.encode()).hexdigest()[:15]
|
||||
return int(h, 16)
|
||||
|
||||
|
||||
def setup_logging(name='recon'):
|
||||
config = get_config()
|
||||
log_dir = config['paths']['logs']
|
||||
os.makedirs(log_dir, exist_ok=True)
|
||||
os.makedirs(os.path.join(log_dir, 'errors'), exist_ok=True)
|
||||
|
||||
logger = logging.getLogger(name)
|
||||
if logger.handlers:
|
||||
return logger
|
||||
logger.setLevel(logging.DEBUG)
|
||||
|
||||
fmt = logging.Formatter('%(asctime)s [%(levelname)s] %(name)s: %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
|
||||
|
||||
fh = RotatingFileHandler(os.path.join(log_dir, 'recon.log'), maxBytes=10*1024*1024, backupCount=5)
|
||||
fh.setLevel(logging.DEBUG)
|
||||
fh.setFormatter(fmt)
|
||||
logger.addHandler(fh)
|
||||
|
||||
eh = RotatingFileHandler(os.path.join(log_dir, 'errors', 'errors.log'), maxBytes=5*1024*1024, backupCount=3)
|
||||
eh.setLevel(logging.ERROR)
|
||||
eh.setFormatter(fmt)
|
||||
logger.addHandler(eh)
|
||||
|
||||
ch = logging.StreamHandler()
|
||||
ch.setLevel(logging.INFO)
|
||||
ch.setFormatter(fmt)
|
||||
logger.addHandler(ch)
|
||||
|
||||
return logger
|
||||
|
||||
|
||||
def derive_source_and_category(filepath, library_root):
|
||||
rel = os.path.relpath(filepath, library_root)
|
||||
parts = rel.split(os.sep)
|
||||
source = parts[0] if parts else 'unknown'
|
||||
category = parts[1] if len(parts) > 2 else source
|
||||
return source, category
|
||||
|
||||
|
||||
def clean_filename_to_title(filename):
|
||||
"""Convert a PDF filename into a human-readable title."""
|
||||
# Strip extension
|
||||
name = os.path.splitext(filename)[0]
|
||||
# Remove common PDF download suffixes (with or without parens)
|
||||
name = re.sub(r'[\s_]*\(?\s*PDFDrive\s*\)?\s*_?', '', name, flags=re.IGNORECASE)
|
||||
name = re.sub(r'[\s_]*\(?\s*z-lib\.org\s*\)?\s*_?', '', name, flags=re.IGNORECASE)
|
||||
# Handle military manual prefixes: FM_23_10 -> FM 23-10, ATP_3_21 -> ATP 3-21
|
||||
name = re.sub(
|
||||
r'\b(FM|ATP|TC|TM|AR|STP|GTA|ATTP|FMFRP|ADP|ADRP)[-_](\d+)[-_](\d+)',
|
||||
lambda m: f"{m.group(1)} {m.group(2)}-{m.group(3)}",
|
||||
name
|
||||
)
|
||||
# Fix common abbreviations: U_S -> U.S., etc.
|
||||
name = re.sub(r'(?<![A-Za-z])U[_\s]S(?=[_\s]|$)', 'U.S.', name)
|
||||
# Replace underscores and hyphens with spaces (but not in manual numbers like FM 23-10)
|
||||
name = re.sub(r'(?<!\d)[-_](?!\d)', ' ', name)
|
||||
name = name.replace('_', ' ')
|
||||
# Remove bracketed years like [1990]
|
||||
year_match = re.search(r'\[(\d{4})\]', name)
|
||||
year_suffix = f" ({year_match.group(1)})" if year_match else ''
|
||||
name = re.sub(r'\s*\[\d{4}\]\s*', ' ', name)
|
||||
# Collapse multiple spaces
|
||||
name = re.sub(r'\s+', ' ', name).strip()
|
||||
# Title-case, but preserve uppercase military abbreviations
|
||||
words = name.split()
|
||||
titled = []
|
||||
for w in words:
|
||||
if w.isupper() and len(w) >= 2:
|
||||
titled.append(w)
|
||||
elif re.match(r'^\d', w):
|
||||
titled.append(w)
|
||||
else:
|
||||
titled.append(w.capitalize() if w.islower() else w)
|
||||
name = ' '.join(titled) + year_suffix
|
||||
name = name.strip()
|
||||
if len(name) < 3:
|
||||
return os.path.splitext(filename)[0]
|
||||
return name
|
||||
|
||||
|
||||
# ── Mojibake fix table ──────────────────────────────────────────────
|
||||
_MOJIBAKE = {
|
||||
'\u00e2\u0080\u0099': "'", # ’ → ' (right single quote)
|
||||
'\u00e2\u0080\u0098': "'", # ‘ → ' (left single quote)
|
||||
'\u00e2\u0080\u009c': '"', # “ → " (left double quote)
|
||||
'\u00e2\u0080\u009d': '"', # †→ " (right double quote)
|
||||
'\u00e2\u0080\u0093': '-', # â€" → - (en dash)
|
||||
'\u00e2\u0080\u0094': '-', # â€" → - (em dash)
|
||||
'\u00e2\u0080\u00a6': '...', # … → ... (ellipsis)
|
||||
'\u00c3\u00a9': 'e', # é → e (e-acute)
|
||||
'\u00c3\u00a8': 'e', # è → e (e-grave)
|
||||
'\u00c3\u00b6': 'o', # ö → o (o-umlaut)
|
||||
'\u00c3\u00bc': 'u', # ü → u (u-umlaut)
|
||||
'\u00c3\u00a4': 'a', # ä → a (a-umlaut)
|
||||
'\u00c3\u00b1': 'n', # ñ → n (n-tilde)
|
||||
'\u00c3\u00ad': 'i', # Ã → i (i-acute)
|
||||
'\u00c3\u00a1': 'a', # á → a (a-acute)
|
||||
'\u00c3\u00ba': 'u', # ú → u (u-acute)
|
||||
'\u00c3\u00b3': 'o', # ó → o (o-acute)
|
||||
'\u00c2\u00ae': '', # ® → (registered)
|
||||
'\u00c2\u00a9': '', # © → (copyright)
|
||||
'\u00c2\u00ab': '"', # « → " (guillemet left)
|
||||
'\u00c2\u00bb': '"', # » → " (guillemet right)
|
||||
}
|
||||
|
||||
# Pre-compile: replace longer sequences first to avoid partial matches
|
||||
_MOJIBAKE_PATTERN = re.compile(
|
||||
'|'.join(re.escape(k) for k in sorted(_MOJIBAKE.keys(), key=len, reverse=True))
|
||||
)
|
||||
|
||||
|
||||
def sanitize_filename(filename, doc_hash=None):
|
||||
"""Sanitize a PDF filename for cross-platform filesystem safety.
|
||||
|
||||
Six-phase pipeline:
|
||||
1. Strip source-site metadata (Anna's Archive, PDFDrive, z-lib, torrent tags)
|
||||
2. Strip embedded identifiers (ISBN, MD5 hash, z-lib hex suffix)
|
||||
3. Fix character encoding (mojibake, NFKD normalization)
|
||||
4. Normalize structure (military prefixes, period-separated words, underscores)
|
||||
5. Clean characters (Windows-illegal, control chars, collapse whitespace)
|
||||
6. Validate and truncate (120 char max, word-boundary break)
|
||||
|
||||
Args:
|
||||
filename: Original filename (with extension)
|
||||
doc_hash: Optional doc_hash to verify z-lib suffix matches
|
||||
|
||||
Returns:
|
||||
Sanitized filename (with extension preserved)
|
||||
"""
|
||||
stem, ext = os.path.splitext(filename)
|
||||
ext = ext.lower()
|
||||
if not ext:
|
||||
ext = '.pdf'
|
||||
|
||||
# ── Phase 1: Strip source-site metadata ─────────────────────────
|
||||
# Anna's Archive pattern: Title -- Authors -- Edition -- ISBN -- Hash -- Source
|
||||
segments = stem.split(' -- ')
|
||||
if len(segments) >= 3:
|
||||
stem = segments[0]
|
||||
elif len(segments) == 2:
|
||||
second = segments[1]
|
||||
if re.search(r'97[89]\d{10}|[0-9a-f]{32}|(?:19|20)\d{2}|[Aa]nna', second):
|
||||
stem = segments[0]
|
||||
|
||||
# PDFDrive tags
|
||||
stem = re.sub(r'\s*\(\s*PDFDrive\s*\)\s*', ' ', stem, flags=re.IGNORECASE)
|
||||
stem = re.sub(r'\s*_PDFDrive_\s*', ' ', stem, flags=re.IGNORECASE)
|
||||
|
||||
# z-lib tags
|
||||
stem = re.sub(r'\s*\(\s*z-lib\.org\s*\)\s*', ' ', stem, flags=re.IGNORECASE)
|
||||
stem = re.sub(r'\s*_z-lib\.org_\s*', ' ', stem, flags=re.IGNORECASE)
|
||||
|
||||
# Torrent tags in curly braces
|
||||
stem = re.sub(r'\s*\{[A-Za-z0-9]+\}\s*', ' ', stem)
|
||||
|
||||
# ── Phase 2: Strip embedded identifiers ─────────────────────────
|
||||
# ISBN-13 (with optional dashes/spaces)
|
||||
stem = re.sub(r'\s*97[89][\s-]?\d[\s-]?\d{2}[\s-]?\d{5,6}[\s-]?\d\s*', ' ', stem)
|
||||
# ISBN-10 with dashes
|
||||
stem = re.sub(r'\s*\d[\s-]\d{2}[\s-]\d{5,6}[\s-][\dXx]\s*', ' ', stem)
|
||||
# MD5 hashes (32 hex chars, standalone)
|
||||
stem = re.sub(r'\s*\b[0-9a-f]{32}\b\s*', ' ', stem)
|
||||
# z-lib 8-char hex suffix like _4d969c3c
|
||||
if doc_hash:
|
||||
# Only strip if it matches the doc_hash prefix
|
||||
match = re.search(r'_([0-9a-f]{8})$', stem)
|
||||
if match and doc_hash.startswith(match.group(1)):
|
||||
stem = stem[:match.start()]
|
||||
else:
|
||||
# Strip any trailing 8-char hex suffix after underscore
|
||||
stem = re.sub(r'_[0-9a-f]{8}$', '', stem)
|
||||
|
||||
# ── Phase 3: Fix character encoding ─────────────────────────────
|
||||
# Fix known mojibake sequences
|
||||
stem = _MOJIBAKE_PATTERN.sub(lambda m: _MOJIBAKE[m.group()], stem)
|
||||
|
||||
# Common single-char mojibake that slip through
|
||||
stem = stem.replace('\u00e2\u0080', '-') # partial em/en dash mojibake
|
||||
stem = stem.replace('H_', 'H. ') # Anna's Archive initial abbreviation pattern
|
||||
|
||||
# NFKD normalize: decompose accented chars, strip combining marks
|
||||
nfkd = unicodedata.normalize('NFKD', stem)
|
||||
cleaned = []
|
||||
for ch in nfkd:
|
||||
cat = unicodedata.category(ch)
|
||||
if cat.startswith('M'): # combining mark — skip
|
||||
continue
|
||||
if cat.startswith('C') and ch not in (' ', '\t'): # control char — skip
|
||||
continue
|
||||
# Keep ASCII + common punctuation; drop CJK/Cyrillic/etc if not transliteratable
|
||||
cp = ord(ch)
|
||||
if cp < 128:
|
||||
cleaned.append(ch)
|
||||
elif cat.startswith('L') or cat.startswith('N'):
|
||||
# Letter or number outside ASCII — try to keep if Latin-ish
|
||||
if cp < 0x0250: # Latin Extended range
|
||||
cleaned.append(ch)
|
||||
# else: drop CJK, Cyrillic, etc.
|
||||
elif cat.startswith('P') or cat.startswith('S'):
|
||||
# Punctuation/symbol — map to ASCII equivalent
|
||||
if ch in ('\u2018', '\u2019', '\u201a', '\u0060'):
|
||||
cleaned.append("'")
|
||||
elif ch in ('\u201c', '\u201d', '\u201e'):
|
||||
cleaned.append('"')
|
||||
elif ch in ('\u2013', '\u2014', '\u2012'):
|
||||
cleaned.append('-')
|
||||
elif ch == '\u2026':
|
||||
cleaned.append('...')
|
||||
elif ch in ('\u00ab', '\u00bb'):
|
||||
cleaned.append('"')
|
||||
else:
|
||||
cleaned.append(' ')
|
||||
elif cat.startswith('Z'):
|
||||
cleaned.append(' ')
|
||||
stem = ''.join(cleaned)
|
||||
|
||||
# ── Phase 4: Normalize structure ────────────────────────────────
|
||||
# Detect URL-derived filenames — skip aggressive normalization
|
||||
is_url_derived = bool(re.match(r'[a-z0-9-]+\.[a-z]{2,}[_/]', stem))
|
||||
|
||||
if not is_url_derived:
|
||||
# Military manual prefixes: FM_23_10 -> FM 23-10
|
||||
stem = re.sub(
|
||||
r'\b(FM|ATP|TC|TM|AR|STP|GTA|ATTP|FMFRP|ADP|ADRP)[-_](\d+)[-_](\d+)',
|
||||
lambda m: '{} {}-{}'.format(m.group(1), m.group(2), m.group(3)),
|
||||
stem
|
||||
)
|
||||
# Period-separated words (4+ segments = likely word-separated, not abbreviations like U.S.)
|
||||
if stem.count('.') >= 4:
|
||||
stem = re.sub(r'\.(?=[A-Za-z])', ' ', stem)
|
||||
|
||||
# Underscores to spaces (always)
|
||||
stem = stem.replace('_', ' ')
|
||||
|
||||
# ── Phase 5: Clean characters ───────────────────────────────────
|
||||
# Remove Windows-illegal chars and control chars
|
||||
stem = re.sub(r'[<>:"|?*\\\/]', '', stem)
|
||||
stem = re.sub(r'[\x00-\x1f\x7f]', '', stem)
|
||||
|
||||
# Collapse multiple spaces, hyphens, underscores
|
||||
stem = re.sub(r' {2,}', ' ', stem)
|
||||
stem = re.sub(r'-{2,}', '-', stem)
|
||||
|
||||
# Strip leading/trailing dots, spaces, dashes
|
||||
stem = stem.strip('. -')
|
||||
|
||||
# ── Phase 6: Validate and truncate ──────────────────────────────
|
||||
stem = stem.strip()
|
||||
if not stem or len(stem) < 2:
|
||||
stem = 'untitled'
|
||||
|
||||
max_stem = 120 - len(ext)
|
||||
if len(stem) > max_stem:
|
||||
# Break at word boundary
|
||||
truncated = stem[:max_stem]
|
||||
last_space = truncated.rfind(' ')
|
||||
if last_space > max_stem * 0.6:
|
||||
truncated = truncated[:last_space]
|
||||
stem = truncated.rstrip('. -,')
|
||||
|
||||
return stem + ext
|
||||
|
||||
|
||||
def filename_needs_sanitization(filename, doc_hash=None):
|
||||
"""Return True if sanitize_filename() would change the filename."""
|
||||
return sanitize_filename(filename, doc_hash) != filename
|
||||
|
||||
|
||||
def resolve_collisions(entries):
|
||||
"""Resolve filename collisions after sanitization.
|
||||
|
||||
Args:
|
||||
entries: list of dicts, each with 'sanitized_filename', 'proposed_dir', 'hash'
|
||||
|
||||
Returns:
|
||||
Updated entries with collision suffixes applied where needed.
|
||||
Each entry gets 'collision' key (True/False) and possibly updated 'sanitized_filename'.
|
||||
"""
|
||||
from collections import defaultdict
|
||||
|
||||
# Group by (dir, lowercase filename) to find collisions
|
||||
groups = defaultdict(list)
|
||||
for i, e in enumerate(entries):
|
||||
key = (e['proposed_dir'], e['sanitized_filename'].lower())
|
||||
groups[key].append(i)
|
||||
|
||||
collision_count = 0
|
||||
for key, indices in groups.items():
|
||||
if len(indices) <= 1:
|
||||
for i in indices:
|
||||
entries[i]['collision'] = False
|
||||
continue
|
||||
|
||||
# Collision — add hash suffix to all but the first
|
||||
collision_count += len(indices) - 1
|
||||
entries[indices[0]]['collision'] = False
|
||||
|
||||
for i in indices[1:]:
|
||||
e = entries[i]
|
||||
h6 = e['hash'][:6]
|
||||
stem, ext = os.path.splitext(e['sanitized_filename'])
|
||||
new_name = '{} [{}]{}'.format(stem, h6, ext)
|
||||
# Re-check length
|
||||
if len(new_name) > 120:
|
||||
max_stem = 120 - len(ext) - 9 # 9 = len(' [XXXXXX]')
|
||||
stem = stem[:max_stem].rstrip('. -,')
|
||||
new_name = '{} [{}]{}'.format(stem, h6, ext)
|
||||
e['sanitized_filename'] = new_name
|
||||
e['collision'] = True
|
||||
|
||||
return entries, collision_count
|
||||
|
||||
|
||||
def generate_download_url(filepath, library_root='/mnt/library', base_url='https://files.echo6.co'):
|
||||
"""Generate a download/source URL from a document path.
|
||||
|
||||
For web URLs (http/https): returns the URL directly -- it's already a link.
|
||||
For file paths: converts to files.echo6.co URL.
|
||||
"""
|
||||
if not filepath:
|
||||
return ''
|
||||
|
||||
# Web content -- path IS the source URL
|
||||
if filepath.startswith(('http://', 'https://')):
|
||||
return filepath
|
||||
|
||||
# File content -- convert to files.echo6.co URL
|
||||
rel = os.path.relpath(filepath, library_root)
|
||||
parts = rel.split(os.sep)
|
||||
encoded = '/'.join(quote(p) for p in parts)
|
||||
return f"{base_url}/{encoded}"
|
||||
324
lib/web_scraper.py
Normal file
324
lib/web_scraper.py
Normal file
|
|
@ -0,0 +1,324 @@
|
|||
"""
|
||||
RECON Web Scraper — URL-based content ingestion.
|
||||
|
||||
Fetches web pages, extracts clean text, chunks into pages,
|
||||
and feeds into the standard RECON enrichment pipeline.
|
||||
|
||||
Output format matches lib/extractor.py so the enricher
|
||||
processes web content identically to PDF content.
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from urllib.parse import urlparse, unquote
|
||||
|
||||
import requests
|
||||
import trafilatura
|
||||
|
||||
from .utils import get_config, setup_logging
|
||||
from .status import StatusDB
|
||||
|
||||
logger = setup_logging('recon.web_scraper')
|
||||
|
||||
# Defaults (overridden by config.yaml web_scraper section)
|
||||
DEFAULT_WORDS_PER_PAGE = 2000
|
||||
DEFAULT_FETCH_TIMEOUT = 30
|
||||
DEFAULT_USER_AGENT = 'RECON/1.0 (Knowledge Extraction Pipeline)'
|
||||
DEFAULT_RATE_LIMIT_DELAY = 1.0
|
||||
|
||||
|
||||
def _get_scraper_config(config=None):
|
||||
"""Get web scraper settings from config, with defaults."""
|
||||
if config is None:
|
||||
config = get_config()
|
||||
ws = config.get('web_scraper', {})
|
||||
return {
|
||||
'words_per_page': ws.get('words_per_page', DEFAULT_WORDS_PER_PAGE),
|
||||
'fetch_timeout': ws.get('fetch_timeout', DEFAULT_FETCH_TIMEOUT),
|
||||
'user_agent': ws.get('user_agent', DEFAULT_USER_AGENT),
|
||||
'rate_limit_delay': ws.get('rate_limit_delay', DEFAULT_RATE_LIMIT_DELAY),
|
||||
'max_batch_size': ws.get('max_batch_size', 50),
|
||||
}
|
||||
|
||||
|
||||
def fetch_url(url, config=None):
|
||||
"""
|
||||
Fetch a URL and extract clean text + metadata using trafilatura.
|
||||
|
||||
Returns dict with: text, title, author, date, description, url,
|
||||
sitename, raw_length, text_length.
|
||||
|
||||
Raises ValueError if fetch or extraction fails.
|
||||
"""
|
||||
sc = _get_scraper_config(config)
|
||||
logger.info(f"Fetching URL: {url}")
|
||||
|
||||
try:
|
||||
response = requests.get(
|
||||
url,
|
||||
headers={'User-Agent': sc['user_agent']},
|
||||
timeout=sc['fetch_timeout'],
|
||||
allow_redirects=True
|
||||
)
|
||||
response.raise_for_status()
|
||||
except requests.RequestException as e:
|
||||
raise ValueError(f"Failed to fetch {url}: {e}")
|
||||
|
||||
raw_html = response.text
|
||||
if not raw_html or len(raw_html) < 100:
|
||||
raise ValueError(f"Empty or too-short response from {url}")
|
||||
|
||||
text = trafilatura.extract(
|
||||
raw_html,
|
||||
include_comments=False,
|
||||
include_tables=True,
|
||||
include_links=False,
|
||||
include_images=False,
|
||||
favor_precision=False,
|
||||
deduplicate=True
|
||||
)
|
||||
|
||||
if not text or len(text.strip()) < 50:
|
||||
raise ValueError(f"No meaningful text extracted from {url}")
|
||||
|
||||
metadata = trafilatura.extract_metadata(raw_html)
|
||||
|
||||
result = {
|
||||
'text': text.strip(),
|
||||
'title': '',
|
||||
'author': '',
|
||||
'date': '',
|
||||
'description': '',
|
||||
'url': url,
|
||||
'sitename': '',
|
||||
'raw_length': len(raw_html),
|
||||
'text_length': len(text),
|
||||
}
|
||||
|
||||
if metadata:
|
||||
result['title'] = metadata.title or ''
|
||||
result['author'] = metadata.author or ''
|
||||
result['date'] = metadata.date or ''
|
||||
result['description'] = metadata.description or ''
|
||||
result['sitename'] = metadata.sitename or ''
|
||||
|
||||
if not result['title']:
|
||||
result['title'] = _title_from_url(url)
|
||||
|
||||
logger.info(f"Extracted {result['text_length']} chars from {url} — \"{result['title']}\"")
|
||||
return result
|
||||
|
||||
|
||||
def _title_from_url(url):
|
||||
"""Generate a readable title from a URL as fallback."""
|
||||
parsed = urlparse(url)
|
||||
path = unquote(parsed.path).strip('/')
|
||||
if path:
|
||||
segment = path.split('/')[-1]
|
||||
segment = re.sub(r'[-_]', ' ', segment)
|
||||
segment = re.sub(r'\.\w+$', '', segment)
|
||||
return segment.title() if segment else parsed.netloc
|
||||
return parsed.netloc
|
||||
|
||||
|
||||
def chunk_text(text, words_per_page=DEFAULT_WORDS_PER_PAGE):
|
||||
"""
|
||||
Split text into page-sized chunks for enrichment windows.
|
||||
|
||||
Breaks at paragraph boundaries. Each chunk is ~words_per_page words.
|
||||
Returns list of strings (each is one "page").
|
||||
"""
|
||||
paragraphs = text.split('\n\n')
|
||||
pages = []
|
||||
current_page = []
|
||||
current_words = 0
|
||||
|
||||
for para in paragraphs:
|
||||
para = para.strip()
|
||||
if not para:
|
||||
continue
|
||||
|
||||
para_words = len(para.split())
|
||||
|
||||
if para_words > words_per_page * 1.5:
|
||||
if current_page:
|
||||
pages.append('\n\n'.join(current_page))
|
||||
current_page = []
|
||||
current_words = 0
|
||||
|
||||
sentences = re.split(r'(?<=[.!?])\s+', para)
|
||||
for sentence in sentences:
|
||||
sentence_words = len(sentence.split())
|
||||
if current_words + sentence_words > words_per_page and current_page:
|
||||
pages.append('\n\n'.join(current_page))
|
||||
current_page = [sentence]
|
||||
current_words = sentence_words
|
||||
else:
|
||||
current_page.append(sentence)
|
||||
current_words += sentence_words
|
||||
elif current_words + para_words > words_per_page and current_page:
|
||||
pages.append('\n\n'.join(current_page))
|
||||
current_page = [para]
|
||||
current_words = para_words
|
||||
else:
|
||||
current_page.append(para)
|
||||
current_words += para_words
|
||||
|
||||
if current_page:
|
||||
pages.append('\n\n'.join(current_page))
|
||||
|
||||
if not pages:
|
||||
pages = [text]
|
||||
|
||||
return pages
|
||||
|
||||
|
||||
def _content_hash(text):
|
||||
"""MD5 hash of text content — same hash type as PDF pipeline."""
|
||||
return hashlib.md5(text.encode('utf-8')).hexdigest()
|
||||
|
||||
|
||||
def _display_filename(url):
|
||||
"""Create a display filename from a URL."""
|
||||
parsed = urlparse(url)
|
||||
name = f"{parsed.netloc}_{parsed.path.strip('/').replace('/', '_')}"
|
||||
name = re.sub(r'[^\w._-]', '_', name)[:200]
|
||||
if not name.endswith('.html'):
|
||||
name += '.html'
|
||||
return name
|
||||
|
||||
|
||||
def ingest_url(url, category='Web', source='web', config=None):
|
||||
"""
|
||||
Full URL ingestion: fetch -> extract -> chunk -> save -> catalogue -> queue as extracted.
|
||||
|
||||
Returns dict with hash, title, page_count, status.
|
||||
Raises ValueError on failure.
|
||||
"""
|
||||
if config is None:
|
||||
config = get_config()
|
||||
sc = _get_scraper_config(config)
|
||||
db = StatusDB()
|
||||
|
||||
# Fetch and extract
|
||||
extracted = fetch_url(url, config)
|
||||
|
||||
# Hash the extracted text content
|
||||
doc_hash = _content_hash(extracted['text'])
|
||||
|
||||
# Check for duplicate in catalogue
|
||||
conn = db._get_conn()
|
||||
existing = conn.execute("SELECT * FROM catalogue WHERE hash = ?", (doc_hash,)).fetchone()
|
||||
if existing:
|
||||
# Also check documents table for status
|
||||
doc = db.get_document(doc_hash)
|
||||
existing_status = doc['status'] if doc else existing['status']
|
||||
logger.info(f"Duplicate content (hash {doc_hash[:12]}...) — already exists as '{existing['filename']}'")
|
||||
return {
|
||||
'hash': doc_hash,
|
||||
'status': 'duplicate',
|
||||
'title': doc.get('book_title', '') if doc else existing['filename'],
|
||||
'existing_status': existing_status,
|
||||
}
|
||||
|
||||
# Chunk into pages
|
||||
pages = chunk_text(extracted['text'], sc['words_per_page'])
|
||||
|
||||
# Save text files in extractor-compatible format:
|
||||
# data/text/{hash}/page_0001.txt, page_0002.txt, ... + meta.json
|
||||
text_dir = os.path.join(config['paths']['text'], doc_hash)
|
||||
os.makedirs(text_dir, exist_ok=True)
|
||||
|
||||
for i, page_text in enumerate(pages, 1):
|
||||
page_file = os.path.join(text_dir, f"page_{i:04d}.txt")
|
||||
with open(page_file, 'w', encoding='utf-8') as f:
|
||||
f.write(page_text)
|
||||
|
||||
meta = {
|
||||
'hash': doc_hash,
|
||||
'source_type': 'web',
|
||||
'url': url,
|
||||
'title': extracted['title'],
|
||||
'author': extracted['author'],
|
||||
'date': extracted['date'],
|
||||
'description': extracted['description'],
|
||||
'sitename': extracted['sitename'],
|
||||
'page_count': len(pages),
|
||||
'text_length': extracted['text_length'],
|
||||
'fetched_at': datetime.now(timezone.utc).isoformat(),
|
||||
}
|
||||
with open(os.path.join(text_dir, 'meta.json'), 'w') as f:
|
||||
json.dump(meta, f, indent=2)
|
||||
|
||||
display_name = _display_filename(url)
|
||||
|
||||
# Add to catalogue
|
||||
db.add_to_catalogue(doc_hash, display_name, url, extracted['text_length'], source, category)
|
||||
|
||||
# Queue (creates documents entry as 'queued')
|
||||
db.queue_document(doc_hash)
|
||||
|
||||
# Advance directly to 'extracted' — text is already saved, skip PDF extraction
|
||||
db.update_status(doc_hash, 'extracted',
|
||||
page_count=len(pages),
|
||||
pages_extracted=len(pages),
|
||||
book_title=extracted['title'],
|
||||
book_author=extracted['author'] or None)
|
||||
|
||||
logger.info(f"Ingested URL: {url} -> {doc_hash[:12]}... ({len(pages)} pages, \"{extracted['title']}\")")
|
||||
|
||||
return {
|
||||
'hash': doc_hash,
|
||||
'status': 'extracted',
|
||||
'title': extracted['title'],
|
||||
'author': extracted['author'],
|
||||
'page_count': len(pages),
|
||||
'url': url,
|
||||
}
|
||||
|
||||
|
||||
def ingest_urls(urls, category='Web', source='web', delay=None, config=None):
|
||||
"""
|
||||
Batch URL ingestion with rate limiting.
|
||||
Returns list of result dicts (one per URL).
|
||||
"""
|
||||
if config is None:
|
||||
config = get_config()
|
||||
if delay is None:
|
||||
delay = _get_scraper_config(config)['rate_limit_delay']
|
||||
|
||||
results = []
|
||||
total = len(urls)
|
||||
|
||||
for i, url in enumerate(urls, 1):
|
||||
url = url.strip()
|
||||
if not url or url.startswith('#'):
|
||||
continue
|
||||
|
||||
logger.info(f"[{i}/{total}] Processing: {url}")
|
||||
|
||||
try:
|
||||
result = ingest_url(url, category=category, source=source, config=config)
|
||||
result['url'] = url
|
||||
results.append(result)
|
||||
except Exception as e:
|
||||
logger.error(f"[{i}/{total}] Failed: {url} — {e}")
|
||||
results.append({
|
||||
'url': url,
|
||||
'status': 'failed',
|
||||
'error': str(e),
|
||||
})
|
||||
|
||||
if i < total and delay > 0:
|
||||
time.sleep(delay)
|
||||
|
||||
succeeded = sum(1 for r in results if r.get('status') not in ('failed', 'duplicate'))
|
||||
failed = sum(1 for r in results if r.get('status') == 'failed')
|
||||
dupes = sum(1 for r in results if r.get('status') == 'duplicate')
|
||||
logger.info(f"Batch complete: {succeeded} new, {dupes} duplicates, {failed} failed out of {total}")
|
||||
|
||||
return results
|
||||
72
migrate_paths.py
Normal file
72
migrate_paths.py
Normal file
|
|
@ -0,0 +1,72 @@
|
|||
#!/usr/bin/env python3
|
||||
"""One-time migration: rescan library to detect moved files and sync paths to Qdrant.
|
||||
|
||||
This rescans all PDFs in the library. The upsert in add_to_catalogue() will
|
||||
detect any files whose paths changed since they were originally catalogued,
|
||||
and flag them with path_updated_at. Then sync_qdrant_paths() propagates
|
||||
those path changes to Qdrant download_url payloads.
|
||||
|
||||
Usage: cd /opt/recon && source venv/bin/activate && python3 migrate_paths.py [--dry-run]
|
||||
"""
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, '/opt/recon')
|
||||
|
||||
from recon import scan_library, sync_qdrant_paths
|
||||
from lib.status import StatusDB
|
||||
from lib.utils import setup_logging
|
||||
|
||||
logger = setup_logging('recon.migrate')
|
||||
|
||||
|
||||
def main():
|
||||
dry_run = '--dry-run' in sys.argv
|
||||
|
||||
db = StatusDB()
|
||||
conn = db._get_conn()
|
||||
|
||||
total_cat = conn.execute("SELECT COUNT(*) FROM catalogue").fetchone()[0]
|
||||
total_docs = conn.execute("SELECT COUNT(*) FROM documents").fetchone()[0]
|
||||
print(f"Before: {total_cat} catalogue entries, {total_docs} documents")
|
||||
|
||||
# Rescan library — upsert will detect and flag path changes
|
||||
print("\nScanning library (this will re-hash all files)...")
|
||||
count = scan_library()
|
||||
print(f"Scanned {count} PDFs")
|
||||
|
||||
# Check how many paths changed
|
||||
updates = db.get_path_updates()
|
||||
print(f"\nDetected {len(updates)} path changes")
|
||||
|
||||
if not updates:
|
||||
print("No paths need syncing — all up to date")
|
||||
return 0
|
||||
|
||||
# Show what changed
|
||||
for row in updates[:20]:
|
||||
print(f" {row['hash'][:8]} {row['filename']}")
|
||||
if len(updates) > 20:
|
||||
print(f" ... and {len(updates) - 20} more")
|
||||
|
||||
if dry_run:
|
||||
print(f"\n[DRY RUN] Would sync {len(updates)} paths to Qdrant. Re-run without --dry-run to apply.")
|
||||
return 0
|
||||
|
||||
# Sync to Qdrant
|
||||
print(f"\nSyncing {len(updates)} paths to Qdrant...")
|
||||
synced = sync_qdrant_paths()
|
||||
print(f"Synced {synced} document paths to Qdrant")
|
||||
|
||||
# Verify
|
||||
remaining = db.get_path_updates()
|
||||
if remaining:
|
||||
print(f"\nWARNING: {len(remaining)} paths still pending (Qdrant sync may have partially failed)")
|
||||
else:
|
||||
print("\nAll paths synced successfully")
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
69
requirements.txt
Normal file
69
requirements.txt
Normal file
|
|
@ -0,0 +1,69 @@
|
|||
annotated-types==0.7.0
|
||||
anyio==4.12.1
|
||||
babel==2.18.0
|
||||
beautifulsoup4==4.14.3
|
||||
blinker==1.9.0
|
||||
certifi==2026.1.4
|
||||
cffi==2.0.0
|
||||
charset-normalizer==3.4.4
|
||||
click==8.3.1
|
||||
courlan==1.3.2
|
||||
cryptography==46.0.5
|
||||
dateparser==1.3.0
|
||||
Flask==3.1.2
|
||||
google-ai-generativelanguage==0.6.15
|
||||
google-api-core==2.29.0
|
||||
google-api-python-client==2.190.0
|
||||
google-auth==2.48.0
|
||||
google-auth-httplib2==0.3.0
|
||||
google-generativeai==0.8.6
|
||||
googleapis-common-protos==1.72.0
|
||||
grpcio==1.78.0
|
||||
grpcio-status==1.71.2
|
||||
h11==0.16.0
|
||||
h2==4.3.0
|
||||
hpack==4.1.0
|
||||
htmldate==1.9.4
|
||||
httpcore==1.0.9
|
||||
httplib2==0.31.2
|
||||
httpx==0.28.1
|
||||
hyperframe==6.1.0
|
||||
idna==3.11
|
||||
itsdangerous==2.2.0
|
||||
Jinja2==3.1.6
|
||||
jusText==3.0.2
|
||||
lxml==6.0.2
|
||||
lxml_html_clean==0.4.3
|
||||
MarkupSafe==3.0.3
|
||||
numpy==2.4.2
|
||||
packaging==26.0
|
||||
pillow==12.1.1
|
||||
portalocker==3.2.0
|
||||
proto-plus==1.27.1
|
||||
protobuf==5.29.6
|
||||
pyasn1==0.6.2
|
||||
pyasn1_modules==0.4.2
|
||||
pycparser==3.0
|
||||
pydantic==2.12.5
|
||||
pydantic_core==2.41.5
|
||||
pyparsing==3.3.2
|
||||
PyPDF2==3.0.1
|
||||
pytesseract==0.3.13
|
||||
python-dateutil==2.9.0.post0
|
||||
pytz==2025.2
|
||||
PyYAML==6.0.3
|
||||
qdrant-client==1.16.2
|
||||
regex==2026.1.15
|
||||
requests==2.32.5
|
||||
rsa==4.9.1
|
||||
six==1.17.0
|
||||
soupsieve==2.8.3
|
||||
tld==0.13.1
|
||||
tqdm==4.67.3
|
||||
trafilatura==2.0.0
|
||||
typing-inspection==0.4.2
|
||||
typing_extensions==4.15.0
|
||||
tzlocal==5.3.1
|
||||
uritemplate==4.2.0
|
||||
urllib3==2.6.3
|
||||
Werkzeug==3.1.5
|
||||
67
run-pipeline-now.sh
Executable file
67
run-pipeline-now.sh
Executable file
|
|
@ -0,0 +1,67 @@
|
|||
#!/bin/bash
|
||||
# RECON Pipeline — Skip scan, run extract + enrich in parallel, then embed
|
||||
# Scan already completed (10,162 catalogued). 6,211 extracted, 3,603 queued.
|
||||
|
||||
set -euo pipefail
|
||||
cd /opt/recon
|
||||
source venv/bin/activate
|
||||
|
||||
LOGDIR="logs"
|
||||
mkdir -p "$LOGDIR"
|
||||
TS=$(date +%Y%m%d_%H%M%S)
|
||||
MAIN_LOG="$LOGDIR/pipeline_${TS}.log"
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$MAIN_LOG"
|
||||
}
|
||||
|
||||
log "=== RECON Pipeline (parallel extract+enrich) ==="
|
||||
log "Skipping scan (already done). Starting extract + enrich concurrently."
|
||||
|
||||
# Reset any stuck docs from previous kill
|
||||
sqlite3 data/recon.db "UPDATE documents SET status='queued' WHERE status='extracting';"
|
||||
sqlite3 data/recon.db "UPDATE documents SET status='extracted' WHERE status='enriching';"
|
||||
sqlite3 data/recon.db "UPDATE documents SET status='enriched' WHERE status='embedding';"
|
||||
|
||||
# Status before
|
||||
log "Before:"
|
||||
sqlite3 data/recon.db "SELECT status, COUNT(*) FROM documents GROUP BY status;" | while read line; do log " $line"; done
|
||||
|
||||
# Start extract and enrich in parallel
|
||||
log "--- Starting Extract (4 workers) + Enrich (16 workers) ---"
|
||||
|
||||
python3 recon.py extract --workers 4 >> "$LOGDIR/extract_${TS}.log" 2>&1 &
|
||||
EXTRACT_PID=$!
|
||||
log " Extract PID: $EXTRACT_PID"
|
||||
|
||||
sleep 3
|
||||
|
||||
python3 recon.py enrich --workers 16 >> "$LOGDIR/enrich_${TS}.log" 2>&1 &
|
||||
ENRICH_PID=$!
|
||||
log " Enrich PID: $ENRICH_PID"
|
||||
|
||||
# Monitor loop — report progress every 5 minutes
|
||||
while kill -0 $EXTRACT_PID 2>/dev/null || kill -0 $ENRICH_PID 2>/dev/null; do
|
||||
sleep 300
|
||||
STATS=$(sqlite3 data/recon.db "SELECT status, COUNT(*) FROM documents GROUP BY status;" | tr '\n' ' ')
|
||||
log " Progress: $STATS"
|
||||
done
|
||||
|
||||
log " Extract + Enrich finished"
|
||||
|
||||
# Second enrich pass (catch docs extracted during first enrich)
|
||||
REMAINING=$(sqlite3 data/recon.db "SELECT COUNT(*) FROM documents WHERE status='extracted';")
|
||||
if [ "$REMAINING" -gt 0 ]; then
|
||||
log "--- Enrich pass 2: $REMAINING remaining ---"
|
||||
python3 recon.py enrich --workers 16 >> "$LOGDIR/enrich_${TS}.log" 2>&1
|
||||
log " Pass 2 complete"
|
||||
fi
|
||||
|
||||
# Embed
|
||||
log "--- Embed ---"
|
||||
python3 recon.py embed --workers 4 >> "$LOGDIR/embed_${TS}.log" 2>&1
|
||||
log " Embed complete"
|
||||
|
||||
log "=== Pipeline Complete ==="
|
||||
python3 recon.py status 2>&1 | tee -a "$MAIN_LOG"
|
||||
log "Finished: $(date)"
|
||||
0
scripts/__init__.py
Normal file
0
scripts/__init__.py
Normal file
373
scripts/aa_download.py
Executable file
373
scripts/aa_download.py
Executable file
|
|
@ -0,0 +1,373 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
aa_download.py — Anna's Archive bulk downloader for RECON library acquisition.
|
||||
|
||||
For each target book:
|
||||
1. Searches annas-archive.org for the title + author
|
||||
2. Extracts the best PDF match (verified by author/page count)
|
||||
3. Gets the MD5 from the book page
|
||||
4. Attempts download from Libgen mirrors in order
|
||||
5. Verifies downloaded file is a valid PDF
|
||||
6. Writes full acquisition report
|
||||
|
||||
Usage:
|
||||
python3 /opt/recon/scripts/aa_download.py [--dry-run] [--limit N]
|
||||
|
||||
Report output: ~/projects/recon/aa_acquisition_report.md
|
||||
"""
|
||||
|
||||
import json
|
||||
import time
|
||||
import random
|
||||
import hashlib
|
||||
import logging
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
REPORT_PATH = Path.home() / "projects/recon/aa_acquisition_report.md"
|
||||
LOG_FILE = Path("/opt/recon/logs/aa_download.log")
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(message)s",
|
||||
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
|
||||
)
|
||||
log = logging.getLogger("aa_download")
|
||||
|
||||
SESSION = requests.Session()
|
||||
SESSION.headers.update({
|
||||
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0",
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
})
|
||||
|
||||
BASE_AA = "https://annas-archive.gl"
|
||||
|
||||
# Download attempt order — try fastest mirrors first
|
||||
LIBGEN_MIRRORS = [
|
||||
"https://libgen.is/get.php?md5={md5}",
|
||||
"https://libgen.rs/get.php?md5={md5}",
|
||||
"https://libgen.st/get.php?md5={md5}",
|
||||
"https://libgen.li/ads.php?md5={md5}",
|
||||
]
|
||||
|
||||
# ── Target book list ──────────────────────────────────────────────────────────
|
||||
TARGETS = [
|
||||
# (title, author, dest_dir)
|
||||
|
||||
# Medical — Herbalism
|
||||
("Medical Herbalism", "David Hoffmann", "Medical/Herbalism"),
|
||||
("Making Plant Medicine", "Richo Cech", "Medical/Herbalism"),
|
||||
("The Earthwise Herbal Volume 1", "Matthew Wood", "Medical/Herbalism"),
|
||||
("The Earthwise Herbal Volume 2", "Matthew Wood", "Medical/Herbalism"),
|
||||
("Herbal Antibiotics", "Stephen Buhner", "Medical/Herbalism"),
|
||||
("Herbal Antivirals", "Stephen Buhner", "Medical/Herbalism"),
|
||||
("The Herbal Medicine-Maker's Handbook", "James Green", "Medical/Herbalism"),
|
||||
("Rosemary Gladstar's Medicinal Herbs", "Rosemary Gladstar", "Medical/Herbalism"),
|
||||
|
||||
# Medical — Austere
|
||||
("Wilderness Medicine", "Paul Auerbach", "Medical/Austere"),
|
||||
("Medicine for Mountaineering", "James Wilkerson", "Medical/Austere"),
|
||||
|
||||
# Medical — Veterinary
|
||||
("The Chicken Health Handbook", "Gail Damerow", "Medical/Veterinary"),
|
||||
("Goat Husbandry", "David Mackenzie", "Medical/Veterinary"),
|
||||
|
||||
# Power Systems
|
||||
("The Renewable Energy Handbook", "William Kemp", "Power"),
|
||||
("Homebrew Wind Power", "Dan Bartmann", "Power"),
|
||||
("Wind Energy Basics", "Paul Gipe", "Power"),
|
||||
("12-Volt Bible", "Brotherton", "Power"),
|
||||
("Wiring a House", "Rex Cauldwell", "Power"),
|
||||
|
||||
# Navigation
|
||||
("Wilderness Navigation", "Bob Burns", "Navigation"),
|
||||
("Be Expert with Map and Compass", "Bjorn Kjellstrom", "Navigation"),
|
||||
("Emergency Navigation", "David Burch", "Navigation"),
|
||||
("The Natural Navigator", "Tristan Gooley", "Navigation"),
|
||||
("The Essential Wilderness Navigator", "David Seidman", "Navigation"),
|
||||
|
||||
# Water Systems
|
||||
("Rainwater Harvesting for Drylands Volume 1", "Brad Lancaster", "Water"),
|
||||
("Rainwater Harvesting for Drylands Volume 2", "Brad Lancaster", "Water"),
|
||||
("Rainwater Harvesting for Drylands Volume 3", "Brad Lancaster", "Water"),
|
||||
("Water Storage", "Art Ludwig", "Water"),
|
||||
("The Home Water Supply", "Stu Campbell", "Water"),
|
||||
|
||||
# Food Systems
|
||||
("The Art of Fermentation", "Sandor Katz", "Food"),
|
||||
("Fermented Vegetables", "Kirsten Shockey", "Food"),
|
||||
("Mastering Artisan Cheesemaking", "Gianaclis Caldwell", "Food"),
|
||||
("Home Cheese Making", "Ricki Carroll", "Food"),
|
||||
("The Art of Natural Cheesemaking", "David Asher", "Food"),
|
||||
|
||||
# Permaculture
|
||||
("Edible Forest Gardens Volume 1", "Dave Jacke", "Permaculture"),
|
||||
("Edible Forest Gardens Volume 2", "Dave Jacke", "Permaculture"),
|
||||
("Creating a Forest Garden", "Martin Crawford", "Permaculture"),
|
||||
("Sepp Holzer's Permaculture", "Sepp Holzer", "Permaculture"),
|
||||
("The Permaculture Handbook", "Peter Bane", "Permaculture"),
|
||||
("The Market Gardener", "Jean-Martin Fortier", "Permaculture"),
|
||||
|
||||
# Scenario / Emergency
|
||||
("SAS Survival Handbook", "John Wiseman", "Scenario"),
|
||||
("Pocket Ref", "Thomas Glover", "Scenario"),
|
||||
("Deep Survival", "Laurence Gonzales", "Scenario"),
|
||||
|
||||
# Foundational Skills
|
||||
("Back to Basics", "Reader's Digest", "Skills"),
|
||||
("A Pattern Language", "Christopher Alexander", "Skills"),
|
||||
]
|
||||
|
||||
BASE_LIB = Path("/mnt/library/Acquired")
|
||||
|
||||
|
||||
def search_aa(title, author):
|
||||
"""Search Anna's Archive and return list of candidate result dicts."""
|
||||
query = f"{title} {author}"
|
||||
url = f"{BASE_AA}/search"
|
||||
params = {"q": query, "ext": "pdf", "lang": "en"}
|
||||
try:
|
||||
r = SESSION.get(url, params=params, timeout=20)
|
||||
r.raise_for_status()
|
||||
except Exception as e:
|
||||
log.warning(f"Search failed for '{title}': {e}")
|
||||
return []
|
||||
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
results = []
|
||||
|
||||
seen_md5 = set()
|
||||
for item in soup.select("a[href^='/md5/']"):
|
||||
href = item.get("href", "")
|
||||
md5 = href.split("/md5/")[-1].split("/")[0].split("?")[0].strip()
|
||||
if not md5 or len(md5) != 32:
|
||||
continue
|
||||
text = item.get_text(" ", strip=True)
|
||||
if not text or md5 in seen_md5:
|
||||
continue
|
||||
seen_md5.add(md5)
|
||||
results.append({"md5": md5, "text": text, "href": href})
|
||||
if len(results) >= 5:
|
||||
break
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def get_book_details(md5):
|
||||
"""Fetch the book detail page and extract useful metadata."""
|
||||
url = f"{BASE_AA}/md5/{md5}"
|
||||
try:
|
||||
r = SESSION.get(url, timeout=20)
|
||||
r.raise_for_status()
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
text = soup.get_text(" ", strip=True)
|
||||
# Extract page count if visible
|
||||
pages = None
|
||||
for word in text.split():
|
||||
if word.isdigit() and 50 < int(word) < 5000:
|
||||
pages = int(word)
|
||||
break
|
||||
return {"pages": pages, "text": text[:500]}
|
||||
except Exception as e:
|
||||
log.warning(f"Detail fetch failed for md5={md5}: {e}")
|
||||
return {}
|
||||
|
||||
|
||||
def try_download(md5, dest_path):
|
||||
"""Try each libgen mirror until one works. Returns True on success."""
|
||||
for mirror_tpl in LIBGEN_MIRRORS:
|
||||
url = mirror_tpl.format(md5=md5)
|
||||
try:
|
||||
r = SESSION.get(url, timeout=60, stream=True, allow_redirects=True)
|
||||
content_type = r.headers.get("content-type", "")
|
||||
if r.status_code != 200:
|
||||
continue
|
||||
# Some mirrors return an HTML ads page before the real file
|
||||
if "text/html" in content_type:
|
||||
# Parse redirect link from ads page
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
dl_link = soup.select_one("a[href*='.pdf']")
|
||||
if not dl_link:
|
||||
dl_link = soup.select_one("a[href*='get.php']")
|
||||
if not dl_link:
|
||||
continue
|
||||
actual_url = dl_link["href"]
|
||||
if not actual_url.startswith("http"):
|
||||
actual_url = f"https://libgen.is{actual_url}"
|
||||
r = SESSION.get(actual_url, timeout=120, stream=True)
|
||||
if r.status_code != 200:
|
||||
continue
|
||||
|
||||
# Stream to disk
|
||||
dest_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(dest_path, "wb") as f:
|
||||
for chunk in r.iter_content(8192):
|
||||
f.write(chunk)
|
||||
|
||||
# Verify it's a real PDF
|
||||
with open(dest_path, "rb") as f:
|
||||
header = f.read(4)
|
||||
if header == b"%PDF":
|
||||
size_mb = dest_path.stat().st_size / 1024 / 1024
|
||||
log.info(f" [OK] {dest_path.name} ({size_mb:.1f}MB) via {url}")
|
||||
return True
|
||||
else:
|
||||
log.warning(f" [BAD] Not a PDF from {url}")
|
||||
dest_path.unlink(missing_ok=True)
|
||||
|
||||
except Exception as e:
|
||||
log.warning(f" Mirror failed {url}: {e}")
|
||||
continue
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def process_book(title, author, subdir, dry_run):
|
||||
"""Full search + download pipeline for one book."""
|
||||
log.info(f"[SEARCH] '{title}' — {author}")
|
||||
result = {
|
||||
"title": title,
|
||||
"author": author,
|
||||
"status": "NOT FOUND",
|
||||
"md5": "",
|
||||
"pages": "",
|
||||
"file": "",
|
||||
"notes": "",
|
||||
}
|
||||
|
||||
candidates = search_aa(title, author)
|
||||
if not candidates:
|
||||
result["notes"] = "No results from AA search"
|
||||
return result
|
||||
|
||||
# Pick best candidate — prefer one whose text contains author name
|
||||
best = None
|
||||
for c in candidates:
|
||||
if author.split()[-1].lower() in c["text"].lower():
|
||||
best = c
|
||||
break
|
||||
if not best:
|
||||
best = candidates[0] # take first result if no author match
|
||||
|
||||
md5 = best["md5"]
|
||||
result["md5"] = md5
|
||||
|
||||
details = get_book_details(md5)
|
||||
result["pages"] = details.get("pages", "")
|
||||
|
||||
if dry_run:
|
||||
result["status"] = "DRY RUN — found"
|
||||
result["notes"] = f"MD5: {md5}"
|
||||
return result
|
||||
|
||||
# Build destination path
|
||||
safe_title = "".join(c if c.isalnum() or c in " ._-" else "_" for c in title)[:60]
|
||||
safe_author = author.split()[-1]
|
||||
filename = f"{safe_title}_{safe_author}.pdf"
|
||||
dest = BASE_LIB / subdir / filename
|
||||
|
||||
if dest.exists():
|
||||
result["status"] = "ALREADY EXISTS"
|
||||
result["file"] = str(dest)
|
||||
return result
|
||||
|
||||
log.info(f" MD5: {md5} — attempting download...")
|
||||
ok = try_download(md5, dest)
|
||||
|
||||
if ok:
|
||||
result["status"] = "DOWNLOADED"
|
||||
result["file"] = str(dest)
|
||||
else:
|
||||
result["status"] = "MD5 ONLY"
|
||||
result["notes"] = f"All mirrors failed. MD5: {md5}"
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def write_report(results):
|
||||
REPORT_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||||
downloaded = [r for r in results if r["status"] == "DOWNLOADED"]
|
||||
md5_only = [r for r in results if r["status"] == "MD5 ONLY"]
|
||||
not_found = [r for r in results if r["status"] == "NOT FOUND"]
|
||||
already_have = [r for r in results if r["status"] == "ALREADY EXISTS"]
|
||||
|
||||
lines = [
|
||||
f"# Anna's Archive Acquisition Report",
|
||||
f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M')}",
|
||||
f"**Total searched:** {len(results)}",
|
||||
f"",
|
||||
f"| Status | Count |",
|
||||
f"|--------|-------|",
|
||||
f"| Downloaded | {len(downloaded)} |",
|
||||
f"| MD5 only (mirrors failed) | {len(md5_only)} |",
|
||||
f"| Not found on AA | {len(not_found)} |",
|
||||
f"| Already in library | {len(already_have)} |",
|
||||
f"",
|
||||
]
|
||||
|
||||
if downloaded:
|
||||
lines += ["## Downloaded", ""]
|
||||
lines += ["| Title | Author | Pages | File |", "|-------|--------|-------|------|"]
|
||||
for r in downloaded:
|
||||
lines.append(f"| {r['title']} | {r['author']} | {r['pages']} | `{Path(r['file']).name}` |")
|
||||
lines.append("")
|
||||
|
||||
if md5_only:
|
||||
lines += ["## Found on AA — Download Failed (use MD5 for manual retrieval)", ""]
|
||||
lines += ["| Title | Author | MD5 | Notes |", "|-------|--------|-----|-------|"]
|
||||
for r in md5_only:
|
||||
lines.append(f"| {r['title']} | {r['author']} | `{r['md5']}` | {r['notes']} |")
|
||||
lines.append("")
|
||||
|
||||
if not_found:
|
||||
lines += ["## Not Found on Anna's Archive", ""]
|
||||
lines += ["| Title | Author | Notes |", "|-------|--------|-------|"]
|
||||
for r in not_found:
|
||||
lines.append(f"| {r['title']} | {r['author']} | {r['notes']} |")
|
||||
lines.append("")
|
||||
|
||||
if already_have:
|
||||
lines += ["## Already in Library", ""]
|
||||
lines += ["| Title | Author |", "|-------|--------|"]
|
||||
for r in already_have:
|
||||
lines.append(f"| {r['title']} | {r['author']} |")
|
||||
lines.append("")
|
||||
|
||||
REPORT_PATH.write_text("\n".join(lines))
|
||||
log.info(f"Report written to {REPORT_PATH}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--limit", type=int, default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
targets = TARGETS[:args.limit] if args.limit else TARGETS
|
||||
log.info(f"Starting AA acquisition: {len(targets)} books | dry_run={args.dry_run}")
|
||||
|
||||
results = []
|
||||
for i, (title, author, subdir) in enumerate(targets, 1):
|
||||
log.info(f"[{i}/{len(targets)}]")
|
||||
result = process_book(title, author, subdir, args.dry_run)
|
||||
results.append(result)
|
||||
log.info(f" -> {result['status']}")
|
||||
# Polite delay between requests
|
||||
time.sleep(random.uniform(8, 15))
|
||||
|
||||
write_report(results)
|
||||
|
||||
print(f"\n-- Summary -----------------------------------------------")
|
||||
for status in ["DOWNLOADED", "MD5 ONLY", "NOT FOUND", "ALREADY EXISTS", "DRY RUN — found"]:
|
||||
count = sum(1 for r in results if r["status"] == status)
|
||||
if count:
|
||||
print(f" {status:<35} {count:>3}")
|
||||
print(f" Report: {REPORT_PATH}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
478
scripts/aa_download_pass2.py
Executable file
478
scripts/aa_download_pass2.py
Executable file
|
|
@ -0,0 +1,478 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
aa_download_pass2.py — Second-pass downloader for books that failed in pass 1.
|
||||
|
||||
Reads the MD5 list from pass 1 report and tries:
|
||||
1. Z-Library search by title/author (separate catalog from Libgen)
|
||||
2. IPFS gateways using AA's IPFS CID (different from MD5 but findable)
|
||||
3. Alternative Libgen mirrors not tried in pass 1
|
||||
4. Direct AA slow download with longer timeout + retry
|
||||
|
||||
Checkpoint: saves progress to /opt/recon/data/aa_pass2_checkpoint.json
|
||||
so interrupted runs resume where they left off.
|
||||
|
||||
Usage:
|
||||
python3 /opt/recon/scripts/aa_download_pass2.py [--dry-run]
|
||||
"""
|
||||
|
||||
import json
|
||||
import time
|
||||
import random
|
||||
import logging
|
||||
import hashlib
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
LOG_FILE = Path("/opt/recon/logs/aa_download_pass2.log")
|
||||
REPORT_IN = Path.home() / "projects/recon/aa_acquisition_report.md"
|
||||
REPORT_OUT = Path.home() / "projects/recon/aa_acquisition_report_pass2.md"
|
||||
CHECKPOINT = Path("/opt/recon/data/aa_pass2_checkpoint.json")
|
||||
BASE_LIB = Path("/mnt/library/Acquired")
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(message)s",
|
||||
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
|
||||
)
|
||||
log = logging.getLogger("aa_pass2")
|
||||
|
||||
SESSION = requests.Session()
|
||||
SESSION.headers.update({
|
||||
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0",
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
})
|
||||
|
||||
# ── Mirrors to try in order ───────────────────────────────────────────────────
|
||||
MIRRORS = [
|
||||
# Libgen alternatives
|
||||
"https://libgen.li/ads.php?md5={md5}",
|
||||
"https://library.lol/main/{md5}",
|
||||
"https://libgen.rocks/get.php?md5={md5}",
|
||||
# Z-Library direct MD5 endpoint (sometimes works)
|
||||
"https://z-library.se/md5/{md5}",
|
||||
# IPFS public gateways — AA uses IPFS for storage
|
||||
"https://cloudflare-ipfs.com/ipfs/{md5}",
|
||||
"https://ipfs.io/ipfs/{md5}",
|
||||
"https://gateway.pinata.cloud/ipfs/{md5}",
|
||||
]
|
||||
|
||||
# ── Books that failed in pass 1 — title, author, md5, subdir ─────────────────
|
||||
PASS1_FAILURES = [
|
||||
# Medical/Herbalism
|
||||
("The Earthwise Herbal Volume 1", "Matthew Wood", "fc8dc19f5a17f38849a3979830dc95c1", "Medical/Herbalism"),
|
||||
("The Earthwise Herbal Volume 2", "Matthew Wood", "fc8dc19f5a17f38849a3979830dc95c1", "Medical/Herbalism"),
|
||||
("Herbal Antibiotics", "Stephen Buhner", "5839dab78edfdff0d7986fac62b814da", "Medical/Herbalism"),
|
||||
("The Herbal Medicine-Maker's Handbook", "James Green", "27e8e8a3585705ed194029b69c7d61b1", "Medical/Herbalism"),
|
||||
("Rosemary Gladstar's Medicinal Herbs", "Rosemary Gladstar", "9b1966f20a32ab4331bfece167be1dd0", "Medical/Herbalism"),
|
||||
|
||||
# Medical/Austere
|
||||
("Wilderness Medicine", "Paul Auerbach", "957818eaa4ec40527bb05902f9ef7c51", "Medical/Austere"),
|
||||
("Medicine for Mountaineering", "James Wilkerson", "39cb07998f2034206f0c9472e44cb0b4", "Medical/Austere"),
|
||||
|
||||
# Medical/Veterinary
|
||||
("The Chicken Health Handbook", "Gail Damerow", "0ba42fbea034b9a08ec8e2f8d7606efe", "Medical/Veterinary"),
|
||||
|
||||
# Power
|
||||
("The Renewable Energy Handbook", "William Kemp", "475d89fa80aea6c45aa4b1b4b9c5e274", "Power"),
|
||||
("Homebrew Wind Power", "Dan Bartmann", "0578696d5b1b6bceb3e5e3302c1a31aa", "Power"),
|
||||
("Wind Energy Basics", "Paul Gipe", "ccbe9d22e0a5e32d61921d20d66a8e05", "Power"),
|
||||
("12-Volt Bible", "Brotherton", "3f964fa6d730fdf2c3d3e231e87cf692", "Power"),
|
||||
("Wiring a House", "Rex Cauldwell", "5efcb53450e9eb560210eee40678adcf", "Power"),
|
||||
|
||||
# Navigation
|
||||
("Emergency Navigation", "David Burch", "25e4def9e777b3fa9ca935134732ff9d", "Navigation"),
|
||||
|
||||
# Water
|
||||
("Water Storage", "Art Ludwig", "17c965ec15c6cf4f09b5377b599a5266", "Water"),
|
||||
("The Home Water Supply", "Stu Campbell", "9b22677d2f8e8b39f7a6bf032187295b", "Water"),
|
||||
|
||||
# Food
|
||||
("Fermented Vegetables", "Kirsten Shockey", "74d3bde876b4c17be66c21fdfa85213e", "Food"),
|
||||
("The Art of Natural Cheesemaking", "David Asher", "bc0e0829d701fea9beca912d39f8cc74", "Food"),
|
||||
|
||||
# Permaculture
|
||||
("Edible Forest Gardens Volume 1", "Dave Jacke", "6b069c3bb077fdd89d487a363c070fbb", "Permaculture"),
|
||||
("Edible Forest Gardens Volume 2", "Dave Jacke", "699255bfde7f69285c132a94ec291bf4", "Permaculture"),
|
||||
("Creating a Forest Garden", "Martin Crawford", "96d71d70dba31ae86e14845f913e557e", "Permaculture"),
|
||||
("Sepp Holzer's Permaculture", "Sepp Holzer", "32be55a9fce3e31cacd6912069abb410", "Permaculture"),
|
||||
("The Permaculture Handbook", "Peter Bane", "08cb4492739fda4d01b5a868a408e4a0", "Permaculture"),
|
||||
("The Market Gardener", "Jean-Martin Fortier", "ac69f6c8c22305b42b539482dc761c19", "Permaculture"),
|
||||
|
||||
# Scenario
|
||||
("SAS Survival Handbook", "John Wiseman", "fa967fd5fcbeb3c9887e22f73e590c64", "Scenario"),
|
||||
("Pocket Ref", "Thomas Glover", "8e4988ce513a4aa75e7e6c00ee36692b", "Scenario"),
|
||||
("Deep Survival", "Laurence Gonzales", "9a907ab13b81ea597407fffdb8ea1b04", "Scenario"),
|
||||
|
||||
# Skills
|
||||
("A Pattern Language", "Christopher Alexander","7f5cc06b5399b65a278c4005ccd8d476", "Skills"),
|
||||
]
|
||||
|
||||
|
||||
def load_checkpoint():
|
||||
"""Load checkpoint: dict of {title: result_dict} for completed books."""
|
||||
if CHECKPOINT.exists():
|
||||
try:
|
||||
return json.loads(CHECKPOINT.read_text())
|
||||
except Exception:
|
||||
pass
|
||||
return {}
|
||||
|
||||
|
||||
def save_checkpoint(completed):
|
||||
"""Save checkpoint after each book."""
|
||||
CHECKPOINT.parent.mkdir(parents=True, exist_ok=True)
|
||||
tmp = str(CHECKPOINT) + ".tmp"
|
||||
with open(tmp, "w") as f:
|
||||
json.dump(completed, f, indent=2)
|
||||
Path(tmp).replace(CHECKPOINT)
|
||||
|
||||
|
||||
def load_md5s_from_report():
|
||||
"""Parse MD5 hashes from pass 1 report to pre-populate PASS1_FAILURES."""
|
||||
if not REPORT_IN.exists():
|
||||
return {}
|
||||
md5_map = {}
|
||||
for line in REPORT_IN.read_text().splitlines():
|
||||
if "`" in line and len(line) > 30:
|
||||
parts = line.split("|")
|
||||
if len(parts) >= 4:
|
||||
title = parts[1].strip()
|
||||
md5_cell = parts[3].strip().strip("`")
|
||||
if len(md5_cell) == 32 and md5_cell.isalnum():
|
||||
md5_map[title.lower()] = md5_cell
|
||||
return md5_map
|
||||
|
||||
|
||||
def search_zlib(title, author):
|
||||
"""Try Z-Library search endpoint."""
|
||||
try:
|
||||
url = "https://z-library.se/s/"
|
||||
params = {"q": f"{title} {author}", "extension[]": "pdf"}
|
||||
r = SESSION.get(url, params=params, timeout=15)
|
||||
if r.status_code != 200:
|
||||
return None
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
# Z-lib book links contain /book/
|
||||
for a in soup.select("a[href*='/book/']")[:3]:
|
||||
href = a.get("href", "")
|
||||
if href:
|
||||
book_url = f"https://z-library.se{href}" if href.startswith("/") else href
|
||||
return book_url
|
||||
except Exception as e:
|
||||
log.debug(f"Zlib search failed: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def try_zlib_download(book_url, dest_path):
|
||||
"""Download from Z-Library book page."""
|
||||
try:
|
||||
r = SESSION.get(book_url, timeout=15)
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
dl = soup.select_one("a.addDownloadedBook, a[href*='/dl/'], a.btn-primary[href*='download']")
|
||||
if not dl:
|
||||
return False
|
||||
dl_url = dl["href"]
|
||||
if not dl_url.startswith("http"):
|
||||
dl_url = f"https://z-library.se{dl_url}"
|
||||
r2 = SESSION.get(dl_url, timeout=120, stream=True)
|
||||
if r2.status_code != 200:
|
||||
return False
|
||||
dest_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(dest_path, "wb") as f:
|
||||
for chunk in r2.iter_content(8192):
|
||||
f.write(chunk)
|
||||
with open(dest_path, "rb") as f:
|
||||
if f.read(4) == b"%PDF":
|
||||
return True
|
||||
dest_path.unlink(missing_ok=True)
|
||||
except Exception as e:
|
||||
log.debug(f"Zlib download failed: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def try_mirrors(md5, dest_path):
|
||||
"""Try all mirrors with the MD5."""
|
||||
import re as _re
|
||||
for tpl in MIRRORS:
|
||||
url = tpl.format(md5=md5)
|
||||
try:
|
||||
r = SESSION.get(url, timeout=20, stream=True, allow_redirects=True)
|
||||
if r.status_code != 200:
|
||||
continue
|
||||
ctype = r.headers.get("content-type", "")
|
||||
if "html" in ctype:
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
# For libgen.li ads page, look for get.php with key
|
||||
dl = None
|
||||
match = _re.search(r'href="(get\.php\?md5=[^"]+)"', r.text)
|
||||
if match:
|
||||
actual = f"https://libgen.li/{match.group(1)}"
|
||||
else:
|
||||
dl = (soup.select_one("a[href*='.pdf']") or
|
||||
soup.select_one("a[href*='get.php']") or
|
||||
soup.select_one("a[href*='/get/']"))
|
||||
if not dl:
|
||||
continue
|
||||
actual = dl["href"]
|
||||
if not actual.startswith("http"):
|
||||
base = url.split("/")[0] + "//" + url.split("/")[2]
|
||||
actual = base + ("/" if not actual.startswith("/") else "") + actual
|
||||
|
||||
r = SESSION.get(actual, timeout=60, stream=True)
|
||||
if r.status_code != 200:
|
||||
continue
|
||||
|
||||
dest_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(dest_path, "wb") as f:
|
||||
for chunk in r.iter_content(8192):
|
||||
f.write(chunk)
|
||||
with open(dest_path, "rb") as f:
|
||||
if f.read(4) == b"%PDF":
|
||||
size_mb = dest_path.stat().st_size / 1024 / 1024
|
||||
log.info(f" [OK] {size_mb:.1f}MB via {url}")
|
||||
return True
|
||||
dest_path.unlink(missing_ok=True)
|
||||
except Exception as e:
|
||||
log.debug(f"Mirror {url} failed: {e}")
|
||||
time.sleep(2)
|
||||
return False
|
||||
|
||||
|
||||
def get_ipfs_cids(md5):
|
||||
"""Fetch IPFS CIDs from AA book detail page."""
|
||||
import re as _re
|
||||
cids = []
|
||||
try:
|
||||
r = SESSION.get(f"https://annas-archive.gl/md5/{md5}", timeout=20)
|
||||
if r.status_code == 200:
|
||||
for m in _re.finditer(r'ipfs_cid[:\s]+([A-Za-z0-9]{46,})', r.text):
|
||||
cids.append(m.group(1))
|
||||
# Also check for CIDs in href attributes
|
||||
for m in _re.finditer(r'ipfs://([A-Za-z0-9]{46,})', r.text):
|
||||
if m.group(1) not in cids:
|
||||
cids.append(m.group(1))
|
||||
except Exception as e:
|
||||
log.debug(f"IPFS CID fetch failed: {e}")
|
||||
return cids
|
||||
|
||||
|
||||
def try_ipfs_download(cids, dest_path):
|
||||
"""Try downloading via IPFS public gateways."""
|
||||
gateways = [
|
||||
"https://cloudflare-ipfs.com/ipfs/{}",
|
||||
"https://dweb.link/ipfs/{}",
|
||||
]
|
||||
for cid in cids[:3]: # limit to first 3 CIDs
|
||||
for gw_tpl in gateways:
|
||||
url = gw_tpl.format(cid)
|
||||
try:
|
||||
r = SESSION.get(url, timeout=15, stream=True)
|
||||
if r.status_code != 200:
|
||||
continue
|
||||
dest_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(dest_path, "wb") as f:
|
||||
for chunk in r.iter_content(8192):
|
||||
f.write(chunk)
|
||||
with open(dest_path, "rb") as f:
|
||||
if f.read(4) == b"%PDF":
|
||||
size_mb = dest_path.stat().st_size / 1024 / 1024
|
||||
log.info(f" [OK] {size_mb:.1f}MB via IPFS {url[:60]}...")
|
||||
return True
|
||||
dest_path.unlink(missing_ok=True)
|
||||
except Exception as e:
|
||||
log.debug(f"IPFS {url} failed: {e}")
|
||||
time.sleep(1)
|
||||
return False
|
||||
|
||||
|
||||
def search_aa_fresh(title, author):
|
||||
"""Fresh AA search on .gl domain for books that weren't found before."""
|
||||
for domain in ["annas-archive.gl", "annas-archive.se", "annas-archive.org"]:
|
||||
try:
|
||||
url = f"https://{domain}/search"
|
||||
params = {"q": f"{title} {author}", "ext": "pdf", "lang": "en"}
|
||||
r = SESSION.get(url, params=params, timeout=15)
|
||||
if r.status_code != 200:
|
||||
continue
|
||||
soup = BeautifulSoup(r.text, "html.parser")
|
||||
for a in soup.select("a[href^='/md5/']"):
|
||||
text = a.get_text(" ", strip=True)
|
||||
if not text:
|
||||
continue
|
||||
md5 = a["href"].split("/md5/")[-1].split("/")[0].strip()
|
||||
if len(md5) == 32:
|
||||
if author.split()[-1].lower() in text.lower() or title.split()[0].lower() in text.lower():
|
||||
return md5
|
||||
except Exception:
|
||||
continue
|
||||
return None
|
||||
|
||||
|
||||
def process_book(title, author, md5_hint, subdir, dry_run):
|
||||
result = {
|
||||
"title": title, "author": author,
|
||||
"status": "NOT FOUND", "md5": md5_hint,
|
||||
"file": "", "notes": "",
|
||||
}
|
||||
|
||||
safe_title = "".join(c if c.isalnum() or c in " ._-" else "_" for c in title)[:60]
|
||||
safe_author = author.split()[-1]
|
||||
dest = BASE_LIB / subdir / f"{safe_title}_{safe_author}.pdf"
|
||||
|
||||
if dest.exists():
|
||||
result["status"] = "ALREADY EXISTS"
|
||||
result["file"] = str(dest)
|
||||
return result
|
||||
|
||||
if dry_run:
|
||||
result["status"] = "DRY RUN"
|
||||
return result
|
||||
|
||||
# 1. Try Z-Library first (different catalog)
|
||||
log.info(f" Trying Z-Library...")
|
||||
zlib_url = search_zlib(title, author)
|
||||
if zlib_url:
|
||||
if try_zlib_download(zlib_url, dest):
|
||||
result["status"] = "DOWNLOADED (Z-Library)"
|
||||
result["file"] = str(dest)
|
||||
return result
|
||||
|
||||
# 2. If no MD5 from pass 1, do a fresh AA search
|
||||
md5 = md5_hint
|
||||
if not md5:
|
||||
log.info(f" Searching AA for fresh MD5...")
|
||||
md5 = search_aa_fresh(title, author)
|
||||
if md5:
|
||||
result["md5"] = md5
|
||||
log.info(f" Found MD5: {md5}")
|
||||
|
||||
# 3. Try IPFS with real CIDs from AA detail page
|
||||
if md5:
|
||||
log.info(f" Fetching IPFS CIDs from AA...")
|
||||
cids = get_ipfs_cids(md5)
|
||||
if cids:
|
||||
log.info(f" Found {len(cids)} IPFS CID(s), trying gateways...")
|
||||
if try_ipfs_download(cids, dest):
|
||||
result["status"] = "DOWNLOADED (IPFS)"
|
||||
result["file"] = str(dest)
|
||||
return result
|
||||
|
||||
# 4. Try all mirrors with MD5
|
||||
if md5:
|
||||
log.info(f" Trying mirrors with MD5 {md5}...")
|
||||
if try_mirrors(md5, dest):
|
||||
result["status"] = "DOWNLOADED (mirror)"
|
||||
result["file"] = str(dest)
|
||||
return result
|
||||
result["status"] = "MD5 ONLY"
|
||||
result["notes"] = f"MD5 confirmed, all mirrors failed: {md5}"
|
||||
else:
|
||||
result["notes"] = "Not found on AA or Z-Library"
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def write_report(results):
|
||||
downloaded = [r for r in results if "DOWNLOADED" in r["status"]]
|
||||
md5_only = [r for r in results if r["status"] == "MD5 ONLY"]
|
||||
not_found = [r for r in results if r["status"] == "NOT FOUND"]
|
||||
existing = [r for r in results if r["status"] == "ALREADY EXISTS"]
|
||||
|
||||
lines = [
|
||||
"# AA Acquisition Report -- Pass 2",
|
||||
f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M')}",
|
||||
f"**Searched:** {len(results)} | **Downloaded:** {len(downloaded)} | "
|
||||
f"**MD5 only:** {len(md5_only)} | **Not found:** {len(not_found)}",
|
||||
"",
|
||||
]
|
||||
if downloaded:
|
||||
lines += ["## Downloaded", "",
|
||||
"| Title | Author | Via | File |",
|
||||
"|-------|--------|-----|------|"]
|
||||
for r in downloaded:
|
||||
lines.append(f"| {r['title']} | {r['author']} | {r['status']} | `{Path(r['file']).name}` |")
|
||||
lines.append("")
|
||||
|
||||
if existing:
|
||||
lines += ["## Already in Library", "",
|
||||
"| Title | Author |",
|
||||
"|-------|--------|"]
|
||||
for r in existing:
|
||||
lines.append(f"| {r['title']} | {r['author']} |")
|
||||
lines.append("")
|
||||
|
||||
if md5_only:
|
||||
lines += ["## MD5 Known -- All Mirrors Failed", "",
|
||||
"| Title | Author | MD5 |",
|
||||
"|-------|--------|-----|"]
|
||||
for r in md5_only:
|
||||
lines.append(f"| {r['title']} | {r['author']} | `{r['md5']}` |")
|
||||
lines.append("")
|
||||
|
||||
if not_found:
|
||||
lines += ["## Not Found Anywhere", "",
|
||||
"| Title | Author | Notes |",
|
||||
"|-------|--------|-------|"]
|
||||
for r in not_found:
|
||||
lines.append(f"| {r['title']} | {r['author']} | {r['notes']} |")
|
||||
lines.append("")
|
||||
|
||||
REPORT_OUT.parent.mkdir(parents=True, exist_ok=True)
|
||||
REPORT_OUT.write_text("\n".join(lines))
|
||||
log.info(f"Report written to {REPORT_OUT}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load any MD5s captured in pass 1
|
||||
md5_map = load_md5s_from_report()
|
||||
targets = []
|
||||
for title, author, md5_hint, subdir in PASS1_FAILURES:
|
||||
md5 = md5_hint or md5_map.get(title.lower(), "")
|
||||
targets.append((title, author, md5, subdir))
|
||||
|
||||
# Load checkpoint
|
||||
completed = load_checkpoint()
|
||||
if completed:
|
||||
log.info(f"Resuming: {len(completed)} books already processed in previous run")
|
||||
|
||||
log.info(f"Pass 2: {len(targets)} books | dry_run={args.dry_run}")
|
||||
results = []
|
||||
for i, (title, author, md5, subdir) in enumerate(targets, 1):
|
||||
# Check checkpoint — skip already-processed books
|
||||
if title in completed and not args.dry_run:
|
||||
result = completed[title]
|
||||
results.append(result)
|
||||
log.info(f"[{i}/{len(targets)}] {title} — SKIPPED (checkpoint: {result['status']})")
|
||||
continue
|
||||
|
||||
log.info(f"[{i}/{len(targets)}] {title} -- {author}")
|
||||
result = process_book(title, author, md5, subdir, args.dry_run)
|
||||
results.append(result)
|
||||
log.info(f" -> {result['status']}")
|
||||
|
||||
# Save checkpoint after each book (not in dry-run)
|
||||
if not args.dry_run:
|
||||
completed[title] = result
|
||||
save_checkpoint(completed)
|
||||
|
||||
time.sleep(random.uniform(6, 12))
|
||||
|
||||
write_report(results)
|
||||
print(f"\n-- Pass 2 Summary ----------------------------------------")
|
||||
for status in ["DOWNLOADED (Z-Library)", "DOWNLOADED (IPFS)", "DOWNLOADED (mirror)", "MD5 ONLY", "NOT FOUND", "ALREADY EXISTS", "DRY RUN"]:
|
||||
count = sum(1 for r in results if r["status"] == status)
|
||||
if count:
|
||||
print(f" {status:<35} {count:>3}")
|
||||
print(f" Report: {REPORT_OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
64
scripts/backup.sh
Executable file
64
scripts/backup.sh
Executable file
|
|
@ -0,0 +1,64 @@
|
|||
#!/bin/bash
|
||||
# RECON Backup Script
|
||||
# Backs up the precious data: concept JSONs, text extracts, SQLite DB
|
||||
# Qdrant is NOT backed up — rebuilt from JSONs via `recon rebuild`
|
||||
# Destination: Contabo VPS (100.64.0.1) via rsync+SSH
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
RECON_DIR="/opt/recon"
|
||||
DATA_DIR="$RECON_DIR/data"
|
||||
LOG_FILE="$RECON_DIR/logs/backup.log"
|
||||
DATE=$(date +%Y%m%d_%H%M%S)
|
||||
|
||||
BACKUP_HOST="root@100.64.0.1"
|
||||
BACKUP_BASE="/opt/backups/recon"
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
mkdir -p "$RECON_DIR/logs"
|
||||
|
||||
log "=== RECON Backup Starting ==="
|
||||
|
||||
# ── 1. SQLite DB (small, fast, critical) ──
|
||||
log "Backing up recon.db..."
|
||||
LOCAL_DB_BACKUP="/tmp/recon_${DATE}.db"
|
||||
sqlite3 "$DATA_DIR/recon.db" ".backup '$LOCAL_DB_BACKUP'"
|
||||
rsync -az "$LOCAL_DB_BACKUP" "$BACKUP_HOST:$BACKUP_BASE/recon_${DATE}.db"
|
||||
rm -f "$LOCAL_DB_BACKUP"
|
||||
# Keep last 7 daily DB backups on remote
|
||||
ssh "$BACKUP_HOST" "ls -t $BACKUP_BASE/recon_*.db 2>/dev/null | tail -n +8 | xargs rm -f 2>/dev/null || true"
|
||||
log " recon.db backed up"
|
||||
|
||||
# ── 2. Concept JSONs (THE PRECIOUS DATA — $130+ of Gemini work) ──
|
||||
log "Syncing concept JSONs..."
|
||||
rsync -az --delete "$DATA_DIR/concepts/" "$BACKUP_HOST:$BACKUP_BASE/concepts/"
|
||||
CONCEPT_COUNT=$(find "$DATA_DIR/concepts/" -name "*.json" 2>/dev/null | wc -l)
|
||||
log " concepts synced ($CONCEPT_COUNT JSON files)"
|
||||
|
||||
# ── 3. Text extracts (regenerable but expensive in time) ──
|
||||
log "Syncing text extracts..."
|
||||
rsync -az --delete "$DATA_DIR/text/" "$BACKUP_HOST:$BACKUP_BASE/text/"
|
||||
TEXT_COUNT=$(find "$DATA_DIR/text/" -maxdepth 1 -type d 2>/dev/null | wc -l)
|
||||
log " text synced ($((TEXT_COUNT - 1)) document dirs)"
|
||||
|
||||
# ── 4. Intel feeds ──
|
||||
if [ -d "$DATA_DIR/intel" ]; then
|
||||
log "Syncing intel feeds..."
|
||||
rsync -az --delete "$DATA_DIR/intel/" "$BACKUP_HOST:$BACKUP_BASE/intel/"
|
||||
log " intel synced"
|
||||
fi
|
||||
|
||||
# ── 5. Config files ──
|
||||
log "Backing up config..."
|
||||
rsync -az "$RECON_DIR/config.yaml" "$BACKUP_HOST:$BACKUP_BASE/config_${DATE}.yaml"
|
||||
rsync -az "$RECON_DIR/.env" "$BACKUP_HOST:$BACKUP_BASE/env_${DATE}" 2>/dev/null || true
|
||||
ssh "$BACKUP_HOST" "ls -t $BACKUP_BASE/config_*.yaml 2>/dev/null | tail -n +4 | xargs rm -f 2>/dev/null || true"
|
||||
ssh "$BACKUP_HOST" "ls -t $BACKUP_BASE/env_* 2>/dev/null | tail -n +4 | xargs rm -f 2>/dev/null || true"
|
||||
log " config backed up"
|
||||
|
||||
# ── Summary ──
|
||||
BACKUP_SIZE=$(ssh "$BACKUP_HOST" "du -sh $BACKUP_BASE" | cut -f1)
|
||||
log "=== Backup Complete: $BACKUP_SIZE on Contabo ==="
|
||||
449
scripts/cleanup_outliers.py
Executable file
449
scripts/cleanup_outliers.py
Executable file
|
|
@ -0,0 +1,449 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
cleanup_outliers.py — Three-pass cleanup of RECON concept data.
|
||||
|
||||
Pass 1: Remap ~160 non-canonical domain strings in concept JSONs + Qdrant payloads
|
||||
Pass 2: Re-enrich 434 concepts with empty domain arrays via Gemini
|
||||
Pass 3: Purge junk/noise URLs from Qdrant + SQLite DB
|
||||
|
||||
Usage:
|
||||
python3 /opt/recon/scripts/cleanup_outliers.py [--dry-run] [--skip-pass N]
|
||||
"""
|
||||
|
||||
import json
|
||||
import time
|
||||
import random
|
||||
import logging
|
||||
import argparse
|
||||
import threading
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from collections import defaultdict
|
||||
|
||||
import google.generativeai as genai
|
||||
from qdrant_client import QdrantClient
|
||||
from qdrant_client.models import FieldCondition, MatchAny, Filter
|
||||
|
||||
import sys, os
|
||||
sys.path.insert(0, '/opt/recon')
|
||||
from lib.utils import get_config, setup_logging
|
||||
|
||||
LOG_FILE = Path("/opt/recon/logs/cleanup_outliers.log")
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(message)s",
|
||||
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
|
||||
)
|
||||
log = logging.getLogger("cleanup_outliers")
|
||||
|
||||
CONCEPTS_DIR = Path("/opt/recon/data/concepts")
|
||||
DB_PATH = Path("/opt/recon/data/recon.db")
|
||||
|
||||
CANONICAL_DOMAINS = {
|
||||
"Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
|
||||
"Foundational Skills", "Communications", "Medical", "Food Systems",
|
||||
"Navigation", "Logistics", "Power Systems", "Leadership",
|
||||
"Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
|
||||
}
|
||||
|
||||
# Non-canonical → canonical remap
|
||||
OUTLIER_MAP = {
|
||||
"Zoology": "Sustainment Systems",
|
||||
"Botany": "Sustainment Systems",
|
||||
"Nature Lore": "Sustainment Systems",
|
||||
"Ecology": "Sustainment Systems",
|
||||
"Navigational Astronomy": "Navigation",
|
||||
"Troubleshooting": "Foundational Skills",
|
||||
"Chemistry": "Foundational Skills",
|
||||
"Metallurgy": "Foundational Skills",
|
||||
"Weird Science": "Foundational Skills",
|
||||
"Philosophy of physics": "Foundational Skills",
|
||||
"Physics": "Foundational Skills",
|
||||
"Cell biology": "Foundational Skills",
|
||||
"Economics": "Leadership",
|
||||
"Business": "Leadership",
|
||||
"Safety": "Security",
|
||||
"Law Enforcement": "Security",
|
||||
"Security & Intelligence": "Security",
|
||||
"Fire Weather": "Scenario Playbooks",
|
||||
"Legal": "Leadership",
|
||||
# Discard — replace with closest real domain
|
||||
"Site News": "Foundational Skills",
|
||||
"Paleogeography": "Foundational Skills",
|
||||
"Chemical Manipulation": "Foundational Skills",
|
||||
}
|
||||
|
||||
# Junk URL patterns — pages with no knowledge value
|
||||
JUNK_URL_PATTERNS = [
|
||||
# rocketstoves.com nav/template garbage
|
||||
"rocketstoves.com/favicon",
|
||||
"rocketstoves.com/cropped-favicon",
|
||||
"rocketstoves.com/layouts/",
|
||||
"rocketstoves.com/sample",
|
||||
"rocketstoves.com/templates/",
|
||||
"rocketstoves.com/hello-world",
|
||||
"rocketstoves.com/blog-forthcoming",
|
||||
"rocketstoves.com/contact",
|
||||
"rocketstoves.com/acknowledgements",
|
||||
"rocketstoves.com/ja3",
|
||||
"rocketstoves.com/juxtapositions",
|
||||
"rocketstoves.com/no-name-soi",
|
||||
"rocketstoves.com/big4",
|
||||
"rocketstoves.com/roof",
|
||||
"rocketstoves.com/rmh_dloadcover",
|
||||
"rocketstoves.com/pedcover",
|
||||
"rocketstoves.com/laundry-to-landscape",
|
||||
"rocketstoves.com/barreloven",
|
||||
# NRCS calendar/event noise
|
||||
"nrcs.usda.gov/events/",
|
||||
"nrcs.usda.gov/state-offices/massachusetts",
|
||||
"nrcs.usda.gov/state-offices/nebraska",
|
||||
"nrcs.usda.gov/state-offices/oklahoma",
|
||||
"nrcs.usda.gov/state-offices/utah",
|
||||
"nrcs.usda.gov/conservation-basics/natural-resource-concerns/soil/western-call-for-abstracts",
|
||||
# deeranddeerhunting trophy hunt videos (no knowledge value)
|
||||
"deeranddeerhunting.com/trophy-whitetails-exclusive-videos/",
|
||||
# eattheweeds non-content pages
|
||||
"eattheweeds.com/media-interviews-with-green-deane",
|
||||
"eattheweeds.com/motorcycles-and-mushrooms",
|
||||
"eattheweeds.com/sunny-savage",
|
||||
# foragersharvest nav pages
|
||||
"foragersharvest.com/contact",
|
||||
"foragersharvest.com/podcasts",
|
||||
# motherearthnews classifieds/nav
|
||||
"motherearthnews.com/classifieds/",
|
||||
"motherearthnews.com/biographies/",
|
||||
]
|
||||
|
||||
CLASSIFY_PROMPT = """\
|
||||
Classify this knowledge concept into one or more domains.
|
||||
|
||||
VALID DOMAINS (use ONLY these exact strings):
|
||||
Defense & Tactics, Sustainment Systems, Off-Grid Systems, Foundational Skills,
|
||||
Communications, Medical, Food Systems, Navigation, Logistics, Power Systems,
|
||||
Leadership, Scenario Playbooks, Water Systems, Security, Community Coordination
|
||||
|
||||
Concept title: {title}
|
||||
Concept tags: {subdomain}
|
||||
Concept preview: {content}
|
||||
|
||||
Return ONLY valid JSON, no markdown:
|
||||
{{"domain": ["Domain Name"]}}
|
||||
|
||||
Rules:
|
||||
- Never return empty domain list
|
||||
- Medical content, herbs, first aid, veterinary → Medical
|
||||
- Food growing, foraging, hunting, livestock → Sustainment Systems
|
||||
- Food preservation, canning, storage → Food Systems
|
||||
- Solar, wind, batteries, generators → Power Systems
|
||||
- Water sourcing, filtration, sanitation → Water Systems
|
||||
"""
|
||||
|
||||
def load_gemini_keys():
|
||||
keys = []
|
||||
for line in Path("/opt/recon/.env").read_text().splitlines():
|
||||
if line.startswith("GEMINI_KEY_"):
|
||||
keys.append(line.split("=", 1)[1].strip())
|
||||
return keys
|
||||
|
||||
class KeyRotator:
|
||||
def __init__(self, keys):
|
||||
self.keys = keys
|
||||
self._i = 0
|
||||
self._lock = threading.Lock()
|
||||
def next(self):
|
||||
with self._lock:
|
||||
key = self.keys[self._i % len(self.keys)]
|
||||
self._i += 1
|
||||
return key
|
||||
|
||||
def classify_concept(title, subdomains, content, key):
|
||||
prompt = CLASSIFY_PROMPT.format(
|
||||
title=title or "(untitled)",
|
||||
subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
|
||||
content=str(content)[:300] if content else "(none)",
|
||||
)
|
||||
genai.configure(api_key=key)
|
||||
model = genai.GenerativeModel(
|
||||
"gemini-2.0-flash",
|
||||
generation_config={"response_mime_type": "application/json"}
|
||||
)
|
||||
for attempt in range(4):
|
||||
try:
|
||||
resp = model.generate_content(prompt)
|
||||
data = json.loads(resp.text)
|
||||
domains = [d for d in data.get("domain", []) if d in CANONICAL_DOMAINS]
|
||||
if domains:
|
||||
return domains
|
||||
except Exception as e:
|
||||
err = str(e).lower()
|
||||
if any(s in err for s in ["429", "quota", "rate", "503"]):
|
||||
time.sleep(min(5 * (2 ** attempt) + random.uniform(0, 3), 60))
|
||||
else:
|
||||
break
|
||||
return ["Foundational Skills"]
|
||||
|
||||
# ── PASS 1: Remap outlier domains ────────────────────────────────────────────
|
||||
|
||||
def remap_concept_domains(domains):
|
||||
"""Remap any outlier domain names in a domain list."""
|
||||
result = set()
|
||||
changed = False
|
||||
for d in domains:
|
||||
if d in CANONICAL_DOMAINS:
|
||||
result.add(d)
|
||||
elif d in OUTLIER_MAP:
|
||||
result.add(OUTLIER_MAP[d])
|
||||
changed = True
|
||||
else:
|
||||
changed = True # drop unknown
|
||||
return list(result), changed
|
||||
|
||||
def pass1_remap_outliers(qdrant, collection, dry_run):
|
||||
log.info("=== PASS 1: Remapping non-canonical outlier domains ===")
|
||||
outlier_names = list(OUTLIER_MAP.keys())
|
||||
stats = defaultdict(int)
|
||||
|
||||
# Scroll through Qdrant finding affected vectors
|
||||
offset = None
|
||||
affected_points = []
|
||||
|
||||
while True:
|
||||
results, offset = qdrant.scroll(
|
||||
collection_name=collection,
|
||||
scroll_filter=Filter(
|
||||
must=[FieldCondition(
|
||||
key="domain",
|
||||
match=MatchAny(any=outlier_names)
|
||||
)]
|
||||
),
|
||||
limit=500,
|
||||
with_payload=True,
|
||||
with_vectors=False,
|
||||
offset=offset,
|
||||
)
|
||||
affected_points.extend(results)
|
||||
if offset is None:
|
||||
break
|
||||
|
||||
log.info(f"Found {len(affected_points)} Qdrant points with outlier domains")
|
||||
|
||||
for point in affected_points:
|
||||
payload = point.payload
|
||||
old_domains = payload.get("domain", [])
|
||||
if isinstance(old_domains, str):
|
||||
old_domains = [old_domains]
|
||||
|
||||
new_domains, changed = remap_concept_domains(old_domains)
|
||||
if not new_domains:
|
||||
new_domains = ["Foundational Skills"]
|
||||
|
||||
if changed:
|
||||
stats["qdrant_updated"] += 1
|
||||
if not dry_run:
|
||||
qdrant.set_payload(
|
||||
collection_name=collection,
|
||||
payload={"domain": new_domains},
|
||||
points=[point.id],
|
||||
)
|
||||
|
||||
# Also fix concept JSON files on disk
|
||||
json_fixed = 0
|
||||
for window_file in CONCEPTS_DIR.rglob("window_*.json"):
|
||||
try:
|
||||
with open(window_file, "r", encoding="utf-8") as f:
|
||||
concepts = json.load(f)
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
if not isinstance(concepts, list):
|
||||
continue
|
||||
|
||||
file_changed = False
|
||||
for concept in concepts:
|
||||
if not isinstance(concept, dict):
|
||||
continue
|
||||
raw = concept.get("domain", [])
|
||||
if isinstance(raw, str):
|
||||
raw = [raw]
|
||||
new, changed = remap_concept_domains(raw)
|
||||
if changed:
|
||||
concept["domain"] = new if new else ["Foundational Skills"]
|
||||
file_changed = True
|
||||
|
||||
if file_changed:
|
||||
json_fixed += 1
|
||||
if not dry_run:
|
||||
with open(window_file, "w", encoding="utf-8") as f:
|
||||
json.dump(concepts, f, indent=2, ensure_ascii=False)
|
||||
|
||||
log.info(f"Pass 1 complete: {stats['qdrant_updated']} Qdrant points updated, {json_fixed} JSON files updated")
|
||||
return stats
|
||||
|
||||
# ── PASS 2: Re-enrich empty domain concepts ──────────────────────────────────
|
||||
|
||||
def pass2_empty_domains(qdrant, collection, key_rotator, dry_run):
|
||||
log.info("=== PASS 2: Re-enriching empty domain concepts ===")
|
||||
stats = defaultdict(int)
|
||||
|
||||
# Find empty domain points in Qdrant
|
||||
offset = None
|
||||
empty_points = []
|
||||
while True:
|
||||
results, offset = qdrant.scroll(
|
||||
collection_name=collection,
|
||||
limit=500,
|
||||
with_payload=True,
|
||||
with_vectors=False,
|
||||
offset=offset,
|
||||
)
|
||||
for r in results:
|
||||
d = r.payload.get("domain", [])
|
||||
if not d or d == [] or d == [""]:
|
||||
empty_points.append(r)
|
||||
if offset is None:
|
||||
break
|
||||
|
||||
log.info(f"Found {len(empty_points)} points with empty domains")
|
||||
|
||||
for point in empty_points:
|
||||
payload = point.payload
|
||||
title = payload.get("title", "")
|
||||
subdomains = payload.get("subdomain", [])
|
||||
content = payload.get("content", payload.get("summary", ""))
|
||||
|
||||
key = key_rotator.next()
|
||||
new_domains = classify_concept(title, subdomains, content, key)
|
||||
stats["classified"] += 1
|
||||
|
||||
if not dry_run:
|
||||
qdrant.set_payload(
|
||||
collection_name=collection,
|
||||
payload={"domain": new_domains},
|
||||
points=[point.id],
|
||||
)
|
||||
|
||||
# Also update the concept JSON on disk
|
||||
doc_hash = payload.get("doc_hash", "")
|
||||
if doc_hash:
|
||||
doc_concepts_dir = CONCEPTS_DIR / doc_hash
|
||||
if doc_concepts_dir.exists():
|
||||
for wf in doc_concepts_dir.glob("window_*.json"):
|
||||
try:
|
||||
with open(wf, "r", encoding="utf-8") as f:
|
||||
concepts = json.load(f)
|
||||
changed = False
|
||||
for c in concepts:
|
||||
if isinstance(c, dict) and c.get("title") == title:
|
||||
d = c.get("domain", [])
|
||||
if not d or d == []:
|
||||
c["domain"] = new_domains
|
||||
changed = True
|
||||
if changed and not dry_run:
|
||||
with open(wf, "w", encoding="utf-8") as f:
|
||||
json.dump(concepts, f, indent=2, ensure_ascii=False)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
time.sleep(0.05)
|
||||
|
||||
log.info(f"Pass 2 complete: {stats['classified']} concepts re-classified")
|
||||
return stats
|
||||
|
||||
# ── PASS 3: Purge junk URLs ──────────────────────────────────────────────────
|
||||
|
||||
def is_junk_url(url):
|
||||
url_lower = url.lower()
|
||||
return any(pattern.lower() in url_lower for pattern in JUNK_URL_PATTERNS)
|
||||
|
||||
def pass3_purge_junk(qdrant, collection, dry_run):
|
||||
log.info("=== PASS 3: Purging junk URLs ===")
|
||||
stats = defaultdict(int)
|
||||
|
||||
# Scroll all web-source points and find junk
|
||||
offset = None
|
||||
junk_point_ids = []
|
||||
junk_doc_hashes = set()
|
||||
|
||||
while True:
|
||||
results, offset = qdrant.scroll(
|
||||
collection_name=collection,
|
||||
scroll_filter=Filter(
|
||||
must=[FieldCondition(key="source_type", match=MatchAny(any=["web"]))]
|
||||
),
|
||||
limit=500,
|
||||
with_payload=True,
|
||||
with_vectors=False,
|
||||
offset=offset,
|
||||
)
|
||||
for r in results:
|
||||
filename = r.payload.get("filename", "")
|
||||
doc_hash = r.payload.get("doc_hash", "")
|
||||
if is_junk_url(filename):
|
||||
junk_point_ids.append(r.id)
|
||||
if doc_hash:
|
||||
junk_doc_hashes.add(doc_hash)
|
||||
if offset is None:
|
||||
break
|
||||
|
||||
log.info(f"Found {len(junk_point_ids)} junk vectors across {len(junk_doc_hashes)} documents")
|
||||
|
||||
if not dry_run and junk_point_ids:
|
||||
# Delete in batches
|
||||
batch_size = 500
|
||||
for i in range(0, len(junk_point_ids), batch_size):
|
||||
batch = junk_point_ids[i:i + batch_size]
|
||||
qdrant.delete(collection_name=collection, points_selector=batch)
|
||||
log.info(f"Deleted {len(junk_point_ids)} junk vectors from Qdrant")
|
||||
|
||||
# Mark junk docs as skipped in SQLite
|
||||
conn = sqlite3.connect(str(DB_PATH))
|
||||
for doc_hash in junk_doc_hashes:
|
||||
conn.execute(
|
||||
"UPDATE documents SET status = 'skipped', error_message = 'junk content purged' WHERE hash = ?",
|
||||
(doc_hash,)
|
||||
)
|
||||
conn.commit()
|
||||
conn.close()
|
||||
log.info(f"Marked {len(junk_doc_hashes)} documents as skipped in DB")
|
||||
|
||||
stats["junk_vectors"] = len(junk_point_ids)
|
||||
stats["junk_docs"] = len(junk_doc_hashes)
|
||||
log.info(f"Pass 3 complete: {stats['junk_vectors']} vectors, {stats['junk_docs']} docs purged")
|
||||
return stats
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--skip-pass", type=int, action="append", default=[])
|
||||
args = parser.parse_args()
|
||||
|
||||
config = get_config()
|
||||
keys = load_gemini_keys()
|
||||
rotator = KeyRotator(keys)
|
||||
|
||||
qdrant = QdrantClient(
|
||||
host=config['vector_db']['host'],
|
||||
port=config['vector_db']['port'],
|
||||
timeout=60
|
||||
)
|
||||
collection = config['vector_db']['collection']
|
||||
|
||||
log.info(f"Starting cleanup | dry_run={args.dry_run} | skipping passes: {args.skip_pass}")
|
||||
|
||||
if 1 not in args.skip_pass:
|
||||
pass1_remap_outliers(qdrant, collection, args.dry_run)
|
||||
|
||||
if 2 not in args.skip_pass:
|
||||
pass2_empty_domains(qdrant, collection, rotator, args.dry_run)
|
||||
|
||||
if 3 not in args.skip_pass:
|
||||
pass3_purge_junk(qdrant, collection, args.dry_run)
|
||||
|
||||
log.info("All passes complete.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
215
scripts/domain_reenrich.py
Executable file
215
scripts/domain_reenrich.py
Executable file
|
|
@ -0,0 +1,215 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
domain_reenrich.py — Re-enriches solo-Reference concepts that domain_remap.py
|
||||
couldn't fix via subdomain lookup. Reads remap_unknowns.jsonl, calls Gemini
|
||||
with a lightweight classification-only prompt, updates domain in-place.
|
||||
|
||||
Usage:
|
||||
python3 /opt/recon/scripts/domain_reenrich.py [--workers 16] [--limit N]
|
||||
|
||||
Reads: /opt/recon/data/remap_unknowns.jsonl
|
||||
Writes: domain field in-place in window JSON files
|
||||
Log: /opt/recon/logs/domain_reenrich.log
|
||||
"""
|
||||
|
||||
import json
|
||||
import time
|
||||
import random
|
||||
import logging
|
||||
import argparse
|
||||
import threading
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from collections import defaultdict
|
||||
|
||||
import google.generativeai as genai
|
||||
|
||||
UNKNOWNS_FILE = Path("/opt/recon/data/remap_unknowns.jsonl")
|
||||
LOG_FILE = Path("/opt/recon/logs/domain_reenrich.log")
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(message)s",
|
||||
handlers=[
|
||||
logging.FileHandler(LOG_FILE),
|
||||
logging.StreamHandler(),
|
||||
]
|
||||
)
|
||||
log = logging.getLogger("domain_reenrich")
|
||||
|
||||
CANONICAL_DOMAINS = [
|
||||
"Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
|
||||
"Foundational Skills", "Communications", "Medical", "Food Systems",
|
||||
"Navigation", "Logistics", "Power Systems", "Leadership",
|
||||
"Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
|
||||
]
|
||||
|
||||
DOMAIN_SET = set(CANONICAL_DOMAINS)
|
||||
|
||||
CLASSIFY_PROMPT = """\
|
||||
Classify this knowledge concept into one or more domains.
|
||||
|
||||
VALID DOMAINS (use ONLY these exact strings, no others):
|
||||
{domains}
|
||||
|
||||
Concept title: {title}
|
||||
Concept tags: {subdomain}
|
||||
Concept preview: {content}
|
||||
|
||||
Return ONLY valid JSON, no markdown, no explanation:
|
||||
{{"domain": ["Domain Name"]}}
|
||||
|
||||
Rules:
|
||||
- Use only the domain strings listed above, spelled exactly
|
||||
- If genuinely multi-domain assign all that apply
|
||||
- Never return empty domain list — pick the closest match
|
||||
- Medical content, herbs, first aid, veterinary → Medical
|
||||
- Food growing, foraging, hunting, livestock → Sustainment Systems
|
||||
- Food preservation, canning, storage → Food Systems
|
||||
- Solar, wind, batteries, generators → Power Systems
|
||||
- Water sourcing, filtration, sanitation → Water Systems
|
||||
"""
|
||||
|
||||
def load_gemini_keys():
|
||||
env = Path("/opt/recon/.env")
|
||||
keys = []
|
||||
for line in env.read_text().splitlines():
|
||||
if line.startswith("GEMINI_KEY_"):
|
||||
keys.append(line.split("=", 1)[1].strip())
|
||||
return keys
|
||||
|
||||
class KeyRotator:
|
||||
def __init__(self, keys):
|
||||
self.keys = keys
|
||||
self._i = 0
|
||||
self._lock = threading.Lock()
|
||||
def next(self):
|
||||
with self._lock:
|
||||
key = self.keys[self._i % len(self.keys)]
|
||||
self._i += 1
|
||||
return key
|
||||
|
||||
def classify_concept(title, subdomains, content, key):
|
||||
prompt = CLASSIFY_PROMPT.format(
|
||||
domains="\n".join(f" {d}" for d in CANONICAL_DOMAINS),
|
||||
title=title or "(untitled)",
|
||||
subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
|
||||
content=content[:300] if content else "(none)",
|
||||
)
|
||||
genai.configure(api_key=key)
|
||||
model = genai.GenerativeModel(
|
||||
"gemini-2.0-flash",
|
||||
generation_config={"response_mime_type": "application/json"}
|
||||
)
|
||||
for attempt in range(4):
|
||||
try:
|
||||
resp = model.generate_content(prompt)
|
||||
data = json.loads(resp.text)
|
||||
domains = [d for d in data.get("domain", []) if d in DOMAIN_SET]
|
||||
if domains:
|
||||
return domains
|
||||
except Exception as e:
|
||||
err = str(e).lower()
|
||||
if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
|
||||
delay = min(5 * (2 ** attempt) + random.uniform(0, 3), 60)
|
||||
time.sleep(delay)
|
||||
else:
|
||||
break
|
||||
return ["Foundational Skills"] # last-resort fallback
|
||||
|
||||
def process_unknown(item, key_rotator):
|
||||
filepath = Path(item["filepath"])
|
||||
title = item.get("title", "")
|
||||
subdomains = item.get("subdomain", [])
|
||||
content = item.get("content_preview", "")
|
||||
|
||||
if not filepath.exists():
|
||||
return "file_missing"
|
||||
|
||||
try:
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
concepts = json.load(f)
|
||||
except Exception:
|
||||
return "read_error"
|
||||
|
||||
if not isinstance(concepts, list):
|
||||
return "not_list"
|
||||
|
||||
# Find this concept by title and update its domain
|
||||
matched = False
|
||||
for concept in concepts:
|
||||
if not isinstance(concept, dict):
|
||||
continue
|
||||
if concept.get("title", "") == title:
|
||||
raw = concept.get("domain", [])
|
||||
if isinstance(raw, str):
|
||||
raw = [raw]
|
||||
# Only re-enrich if still stuck on Reference
|
||||
if raw == ["Reference"] or raw == []:
|
||||
key = key_rotator.next()
|
||||
new_domains = classify_concept(title, subdomains, content, key)
|
||||
concept["domain"] = new_domains
|
||||
concept["_reenriched"] = True
|
||||
matched = True
|
||||
break
|
||||
|
||||
if not matched:
|
||||
return "already_fixed"
|
||||
|
||||
try:
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump(concepts, f, indent=2, ensure_ascii=False)
|
||||
except Exception:
|
||||
return "write_error"
|
||||
|
||||
return "ok"
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--workers", type=int, default=16)
|
||||
parser.add_argument("--limit", type=int, default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
keys = load_gemini_keys()
|
||||
if not keys:
|
||||
log.error("No Gemini keys found in .env")
|
||||
return
|
||||
rotator = KeyRotator(keys)
|
||||
|
||||
unknowns = []
|
||||
with open(UNKNOWNS_FILE, "r", encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line:
|
||||
unknowns.append(json.loads(line))
|
||||
|
||||
if args.limit:
|
||||
unknowns = unknowns[:args.limit]
|
||||
|
||||
total = len(unknowns)
|
||||
log.info(f"Re-enriching {total:,} concepts | {args.workers} workers | {len(keys)} API keys")
|
||||
log.info(f"Estimated Gemini Flash cost: ~${total * 0.0004:.2f} (conservative)")
|
||||
|
||||
results = defaultdict(int)
|
||||
lock = threading.Lock()
|
||||
done = 0
|
||||
|
||||
with ThreadPoolExecutor(max_workers=args.workers) as ex:
|
||||
futures = {ex.submit(process_unknown, item, rotator): item for item in unknowns}
|
||||
for future in as_completed(futures):
|
||||
status = future.result()
|
||||
with lock:
|
||||
results[status] += 1
|
||||
done += 1
|
||||
if done % 5000 == 0:
|
||||
pct = done / total * 100
|
||||
log.info(f" Progress: {done:,}/{total:,} ({pct:.1f}%) | {dict(results)}")
|
||||
time.sleep(0.05)
|
||||
|
||||
log.info("── Final Results ─────────────────────────────────────────────")
|
||||
for status, count in sorted(results.items(), key=lambda x: -x[1]):
|
||||
log.info(f" {status:<25} {count:>10,}")
|
||||
log.info(f" Total: {total:,}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
428
scripts/domain_remap.py
Executable file
428
scripts/domain_remap.py
Executable file
|
|
@ -0,0 +1,428 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
domain_remap.py — Fix RECON concept domain classifications without API calls.
|
||||
|
||||
What this does:
|
||||
1. Strips "Reference" from concepts that have other real domains
|
||||
2. Remaps variant domain spellings to canonical names
|
||||
3. Reclassifies solo-Reference concepts using their subdomain tags
|
||||
4. Writes a JSONL file of true unknowns for API re-enrichment
|
||||
|
||||
Each window file is a JSON array of concept dicts.
|
||||
Field names: "domain" (list), "subdomain" (list)
|
||||
|
||||
Usage:
|
||||
python3 /opt/recon/scripts/domain_remap.py --dry-run # report only
|
||||
python3 /opt/recon/scripts/domain_remap.py # apply fixes
|
||||
python3 /opt/recon/scripts/domain_remap.py --workers 16
|
||||
"""
|
||||
|
||||
import json
|
||||
import argparse
|
||||
import threading
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from collections import defaultdict
|
||||
|
||||
CONCEPTS_DIR = Path("/opt/recon/data/concepts")
|
||||
UNKNOWNS_OUTPUT = Path("/opt/recon/data/remap_unknowns.jsonl")
|
||||
|
||||
CANONICAL_DOMAINS = {
|
||||
"Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
|
||||
"Foundational Skills", "Communications", "Medical", "Food Systems",
|
||||
"Navigation", "Logistics", "Power Systems", "Leadership",
|
||||
"Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
|
||||
}
|
||||
|
||||
# Variant → Canonical mapping
|
||||
VARIANT_MAP = {
|
||||
# Defense & Tactics
|
||||
"Tactical Ops": "Defense & Tactics",
|
||||
"Tactical_Ops": "Defense & Tactics",
|
||||
"Tactical Operations": "Defense & Tactics",
|
||||
"Tactical": "Defense & Tactics",
|
||||
"Tactical Skills": "Defense & Tactics",
|
||||
"Tactics": "Defense & Tactics",
|
||||
"Tactics & Defense": "Defense & Tactics",
|
||||
"Reconnaissance": "Defense & Tactics",
|
||||
"Fire Support": "Defense & Tactics",
|
||||
"Improvised Munitions": "Defense & Tactics",
|
||||
"Military Intelligence": "Defense & Tactics",
|
||||
"Military History": "Defense & Tactics",
|
||||
"Military Engineering": "Defense & Tactics",
|
||||
# Medical
|
||||
"Medical Care": "Medical",
|
||||
"Medical Alternatives": "Medical",
|
||||
"Medical/Dental": "Medical",
|
||||
"Medical & Dental": "Medical",
|
||||
"medical": "Medical",
|
||||
"Medical Awareness": "Medical",
|
||||
"Medical Disasters": "Medical",
|
||||
"Medical Emergency Survival": "Medical",
|
||||
"Medical Procedures": "Medical",
|
||||
"Medical Treatment": "Medical",
|
||||
"Medical Science": "Medical",
|
||||
"Medical History": "Medical",
|
||||
"Medical Diagnosis": "Medical",
|
||||
"Medical Skills": "Medical",
|
||||
"Medical Supply": "Medical",
|
||||
"Medical Gear": "Medical",
|
||||
"Medical Kits": "Medical",
|
||||
"Medical Logistics": "Logistics",
|
||||
"Medical First Aid": "Medical",
|
||||
"Medical Ethics": "Medical",
|
||||
"Medical Reference Ranges": "Medical",
|
||||
"Medical andSurgical Hints": "Medical",
|
||||
"Medical Aspects of Radiation Injury": "Medical",
|
||||
"Medical Uses": "Medical",
|
||||
"Medical Care in Developing Countries": "Medical",
|
||||
"Survival Medicine": "Medical",
|
||||
"Emergency War Surgery": "Medical",
|
||||
"First Aid": "Medical",
|
||||
"First Aid and Life Saving": "Medical",
|
||||
"Veterinary Medicine": "Medical",
|
||||
"Veterinary Hygiene": "Medical",
|
||||
"Veterinary": "Medical",
|
||||
"Pharmacology": "Medical",
|
||||
"Public Health": "Medical",
|
||||
"Health": "Medical",
|
||||
# Food Systems
|
||||
"Food_Systems": "Food Systems",
|
||||
"Food_systems": "Food Systems",
|
||||
"food_systems": "Food Systems",
|
||||
"Food Preservation": "Food Systems",
|
||||
"Food Safety": "Food Systems",
|
||||
"Food Security": "Food Systems",
|
||||
"Food & Nutrition": "Food Systems",
|
||||
"Diet & Nutrition": "Food Systems",
|
||||
"Culinary Arts": "Food Systems",
|
||||
"Foodprocessing": "Food Systems",
|
||||
"Food": "Food Systems",
|
||||
# Sustainment Systems
|
||||
"Sustainment_Systems": "Sustainment Systems",
|
||||
"Agriculture": "Sustainment Systems",
|
||||
"Agriculture & Natural Resources": "Sustainment Systems",
|
||||
"Agriculture and Natural Resources": "Sustainment Systems",
|
||||
"Horticulture": "Sustainment Systems",
|
||||
"Gardening": "Sustainment Systems",
|
||||
"Hydroponics": "Sustainment Systems",
|
||||
"Survival Skills": "Sustainment Systems",
|
||||
# Foundational Skills
|
||||
"Foundational_Skills": "Foundational Skills",
|
||||
"Primitive Living Skills": "Foundational Skills",
|
||||
"Woodcraft": "Foundational Skills",
|
||||
"Home Workshop": "Foundational Skills",
|
||||
"Science": "Foundational Skills",
|
||||
"Engineering": "Foundational Skills",
|
||||
"Construction": "Foundational Skills",
|
||||
"Industrial Processes": "Foundational Skills",
|
||||
"Machine Technology": "Foundational Skills",
|
||||
"Training": "Foundational Skills",
|
||||
"Education": "Foundational Skills",
|
||||
# Off-Grid Systems
|
||||
"Off-Grid_Systems": "Off-Grid Systems",
|
||||
"Appropriate Technology": "Off-Grid Systems",
|
||||
# Power Systems
|
||||
"Homebrewed Electricity": "Power Systems",
|
||||
"Renewable Energy": "Power Systems",
|
||||
"Renewable Energy FAQs": "Power Systems",
|
||||
"Alternative Fuels": "Power Systems",
|
||||
"Power_Systems": "Power Systems",
|
||||
# Water Systems
|
||||
"Water_Systems": "Water Systems",
|
||||
# Community Coordination
|
||||
"Community_Coordination": "Community Coordination",
|
||||
"Community_coordination": "Community Coordination",
|
||||
"Community": "Community Coordination",
|
||||
# Leadership
|
||||
"Leadership & Planning": "Leadership",
|
||||
"Planning": "Leadership",
|
||||
"Administration": "Leadership",
|
||||
"Governance": "Leadership",
|
||||
"Government": "Leadership",
|
||||
# Communications
|
||||
"Emergency Communications": "Communications",
|
||||
# Security
|
||||
"Security Systems": "Security",
|
||||
# Logistics
|
||||
"Transportation": "Logistics",
|
||||
# Scenario Playbooks
|
||||
"General Preparedness": "Scenario Playbooks",
|
||||
"Emergency Preparedness": "Scenario Playbooks",
|
||||
"Emergency Management": "Scenario Playbooks",
|
||||
"Wilderness Preparedness": "Scenario Playbooks",
|
||||
"Urban Preparedness": "Scenario Playbooks",
|
||||
"Winter Preparedness": "Scenario Playbooks",
|
||||
# Discard (noise domains)
|
||||
"Humor": None,
|
||||
"Recreation": None,
|
||||
"Business": None,
|
||||
"Finance": None,
|
||||
"Economics": None,
|
||||
"Economics/Finances": None,
|
||||
"Weird Science": None,
|
||||
}
|
||||
|
||||
# Subdomain keyword → canonical domain (for solo-Reference reclassification)
|
||||
SUBDOMAIN_MAP = {
|
||||
"first aid": "Medical",
|
||||
"emergency care": "Medical",
|
||||
"emergency medicine": "Medical",
|
||||
"trauma": "Medical",
|
||||
"anatomy": "Medical",
|
||||
"oral rehydration": "Medical",
|
||||
"ors": "Medical",
|
||||
"pharmacology": "Medical",
|
||||
"toxicology": "Medical",
|
||||
"antidote": "Medical",
|
||||
"nerve agent": "Defense & Tactics",
|
||||
"chemical warfare": "Defense & Tactics",
|
||||
"biological warfare": "Defense & Tactics",
|
||||
"nbc": "Defense & Tactics",
|
||||
"infectious disease": "Medical",
|
||||
"microbiology": "Medical",
|
||||
"virology": "Medical",
|
||||
"bacteriology": "Medical",
|
||||
"pediatric": "Medical",
|
||||
"surgery": "Medical",
|
||||
"wound care": "Medical",
|
||||
"veterinary": "Medical",
|
||||
"dental": "Medical",
|
||||
"dentistry": "Medical",
|
||||
"herbal": "Medical",
|
||||
"medicinal plant": "Medical",
|
||||
"medicinal herb": "Medical",
|
||||
"herbalism": "Medical",
|
||||
"food preservation": "Food Systems",
|
||||
"canning": "Food Systems",
|
||||
"fermentation": "Food Systems",
|
||||
"food storage": "Food Systems",
|
||||
"food safety": "Food Systems",
|
||||
"cooking": "Food Systems",
|
||||
"food processing": "Food Systems",
|
||||
"agriculture": "Sustainment Systems",
|
||||
"soil": "Sustainment Systems",
|
||||
"permaculture": "Sustainment Systems",
|
||||
"agroforestry": "Sustainment Systems",
|
||||
"livestock": "Sustainment Systems",
|
||||
"animal husbandry": "Sustainment Systems",
|
||||
"beekeeping": "Sustainment Systems",
|
||||
"foraging": "Sustainment Systems",
|
||||
"hunting": "Sustainment Systems",
|
||||
"fishing": "Sustainment Systems",
|
||||
"gardening": "Sustainment Systems",
|
||||
"mycology": "Sustainment Systems",
|
||||
"mushroom": "Sustainment Systems",
|
||||
"water purification": "Water Systems",
|
||||
"water filtration": "Water Systems",
|
||||
"water sanitation": "Water Systems",
|
||||
"water disinfection": "Water Systems",
|
||||
"water storage": "Water Systems",
|
||||
"well construction": "Water Systems",
|
||||
"rainwater": "Water Systems",
|
||||
"solar": "Power Systems",
|
||||
"wind turbine": "Power Systems",
|
||||
"battery": "Power Systems",
|
||||
"batteries": "Power Systems",
|
||||
"generator": "Power Systems",
|
||||
"photovoltaic": "Power Systems",
|
||||
"charge controller": "Power Systems",
|
||||
"inverter": "Power Systems",
|
||||
"biogas": "Off-Grid Systems",
|
||||
"biomass": "Off-Grid Systems",
|
||||
"wood gasification": "Off-Grid Systems",
|
||||
"rocket stove": "Off-Grid Systems",
|
||||
"mechanical system": "Off-Grid Systems",
|
||||
"power transmission": "Off-Grid Systems",
|
||||
"radio": "Communications",
|
||||
"ham radio": "Communications",
|
||||
"amateur radio": "Communications",
|
||||
"antenna": "Communications",
|
||||
"meshtastic": "Communications",
|
||||
"encryption": "Communications",
|
||||
"navigation": "Navigation",
|
||||
"celestial navigation": "Navigation",
|
||||
"land navigation": "Navigation",
|
||||
"map reading": "Navigation",
|
||||
"compass": "Navigation",
|
||||
"pottery": "Foundational Skills",
|
||||
"ceramics": "Foundational Skills",
|
||||
"blacksmithing": "Foundational Skills",
|
||||
"woodworking": "Foundational Skills",
|
||||
"leatherwork": "Foundational Skills",
|
||||
"textile": "Foundational Skills",
|
||||
"masonry": "Foundational Skills",
|
||||
"metalworking": "Foundational Skills",
|
||||
"historical technology": "Foundational Skills",
|
||||
"weapons": "Defense & Tactics",
|
||||
"firearms": "Defense & Tactics",
|
||||
"ballistics": "Defense & Tactics",
|
||||
"tactics": "Defense & Tactics",
|
||||
"perimeter": "Security",
|
||||
"surveillance": "Security",
|
||||
"supply chain": "Logistics",
|
||||
"logistics": "Logistics",
|
||||
"leadership": "Leadership",
|
||||
"governance": "Leadership",
|
||||
"community": "Community Coordination",
|
||||
"emergency preparedness": "Scenario Playbooks",
|
||||
"disaster": "Scenario Playbooks",
|
||||
"evacuation": "Scenario Playbooks",
|
||||
}
|
||||
|
||||
|
||||
def remap_domains(domains):
|
||||
"""Remap a list of domain strings — variants to canonical, strip Reference."""
|
||||
result = set()
|
||||
for d in domains:
|
||||
if d == "Reference":
|
||||
continue
|
||||
if d in CANONICAL_DOMAINS:
|
||||
result.add(d)
|
||||
elif d in VARIANT_MAP:
|
||||
mapped = VARIANT_MAP[d]
|
||||
if mapped: # None means discard
|
||||
result.add(mapped)
|
||||
# Unknown non-canonical domains: drop them
|
||||
return list(result)
|
||||
|
||||
|
||||
def classify_by_subdomain(subdomains):
|
||||
"""Try to infer canonical domain(s) from subdomain keyword matching."""
|
||||
found = set()
|
||||
for sd in subdomains:
|
||||
sd_lower = sd.lower().strip()
|
||||
for key, domain in SUBDOMAIN_MAP.items():
|
||||
if key in sd_lower:
|
||||
found.add(domain)
|
||||
return list(found) if found else None
|
||||
|
||||
|
||||
def process_window_file(filepath, dry_run):
|
||||
"""Process one window JSON file (array of concepts). Returns per-file stats."""
|
||||
stats = defaultdict(int)
|
||||
unknowns = []
|
||||
|
||||
try:
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
concepts = json.load(f)
|
||||
except Exception as e:
|
||||
return {"parse_error": 1}, []
|
||||
|
||||
if not isinstance(concepts, list):
|
||||
return {"skip_not_list": 1}, []
|
||||
|
||||
modified = False
|
||||
|
||||
for concept in concepts:
|
||||
if not isinstance(concept, dict):
|
||||
continue
|
||||
|
||||
raw_domains = concept.get("domain", [])
|
||||
if isinstance(raw_domains, str):
|
||||
raw_domains = [raw_domains]
|
||||
|
||||
subdomains = concept.get("subdomain", [])
|
||||
if isinstance(subdomains, str):
|
||||
subdomains = [subdomains]
|
||||
|
||||
has_reference = "Reference" in raw_domains
|
||||
non_reference = [d for d in raw_domains if d != "Reference"]
|
||||
|
||||
if not has_reference:
|
||||
# No Reference — just fix any variant names
|
||||
remapped = remap_domains(raw_domains)
|
||||
if set(remapped) != set(raw_domains):
|
||||
concept["domain"] = remapped
|
||||
modified = True
|
||||
stats["variant_remapped"] += 1
|
||||
else:
|
||||
stats["no_change"] += 1
|
||||
continue
|
||||
|
||||
# Has Reference — what else does it have?
|
||||
remapped_others = remap_domains(non_reference)
|
||||
|
||||
if remapped_others:
|
||||
# Reference + real domains: drop Reference, keep the rest
|
||||
concept["domain"] = remapped_others
|
||||
modified = True
|
||||
stats["reference_stripped"] += 1
|
||||
continue
|
||||
|
||||
# Solo Reference (or Reference + only-noise): try subdomain lookup
|
||||
inferred = classify_by_subdomain(subdomains)
|
||||
if inferred:
|
||||
concept["domain"] = inferred
|
||||
concept["_reclassified_from_reference"] = True
|
||||
modified = True
|
||||
stats["subdomain_reclassified"] += 1
|
||||
continue
|
||||
|
||||
# True unknown — needs API re-enrichment
|
||||
unknowns.append({
|
||||
"filepath": str(filepath),
|
||||
"title": concept.get("title", ""),
|
||||
"subdomain": subdomains,
|
||||
"content_preview": str(concept.get("content", concept.get("summary", "")))[:300],
|
||||
})
|
||||
stats["needs_enrichment"] += 1
|
||||
|
||||
if modified and not dry_run:
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump(concepts, f, indent=2, ensure_ascii=False)
|
||||
|
||||
return dict(stats), unknowns
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Remap RECON concept domains")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Report without writing")
|
||||
parser.add_argument("--workers", type=int, default=16)
|
||||
args = parser.parse_args()
|
||||
|
||||
print(f"[REMAP] Scanning {CONCEPTS_DIR}")
|
||||
print(f"[REMAP] Dry run: {args.dry_run} | Workers: {args.workers}")
|
||||
|
||||
window_files = [
|
||||
f for f in CONCEPTS_DIR.rglob("window_*.json")
|
||||
]
|
||||
print(f"[REMAP] Found {len(window_files):,} window files")
|
||||
|
||||
total_stats = defaultdict(int)
|
||||
all_unknowns = []
|
||||
lock = threading.Lock()
|
||||
done = 0
|
||||
|
||||
with ThreadPoolExecutor(max_workers=args.workers) as ex:
|
||||
futures = {ex.submit(process_window_file, f, args.dry_run): f for f in window_files}
|
||||
for future in as_completed(futures):
|
||||
file_stats, unknowns = future.result()
|
||||
with lock:
|
||||
for k, v in file_stats.items():
|
||||
total_stats[k] += v
|
||||
all_unknowns.extend(unknowns)
|
||||
done += 1
|
||||
if done % 5000 == 0:
|
||||
print(f" {done:,}/{len(window_files):,} files processed...")
|
||||
|
||||
print("\n── Results ─────────────────────────────────────────────────")
|
||||
for status, count in sorted(total_stats.items(), key=lambda x: -x[1]):
|
||||
print(f" {status:<35} {count:>10,}")
|
||||
|
||||
total_concepts = sum(total_stats.values())
|
||||
print(f"\n Total concepts processed: {total_concepts:>10,}")
|
||||
print(f" True unknowns for re-enrichment:{len(all_unknowns):>10,}")
|
||||
|
||||
if not args.dry_run and all_unknowns:
|
||||
with open(UNKNOWNS_OUTPUT, "w", encoding="utf-8") as f:
|
||||
for item in all_unknowns:
|
||||
f.write(json.dumps(item) + "\n")
|
||||
print(f"\n Unknowns written to: {UNKNOWNS_OUTPUT}")
|
||||
|
||||
if args.dry_run:
|
||||
print("\n [DRY RUN] No files were modified.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
469
scripts/migrate_domains.py
Normal file
469
scripts/migrate_domains.py
Normal file
|
|
@ -0,0 +1,469 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
migrate_domains.py — Reclassify 5 legacy domains via Gemini Flash.
|
||||
|
||||
Targets: Sustainment Systems, Off-Grid Systems, Defense & Tactics,
|
||||
Community Coordination, Leadership
|
||||
|
||||
Maps each to one of the 18 approved domains. 16 parallel workers,
|
||||
checkpoint file, crash-safe, incremental saves, progress every 5,000.
|
||||
|
||||
Usage:
|
||||
python3 /tmp/migrate_domains.py [--dry-run] [--workers 16] [--limit N]
|
||||
"""
|
||||
|
||||
import json
|
||||
import time
|
||||
import random
|
||||
import logging
|
||||
import argparse
|
||||
import threading
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from collections import defaultdict
|
||||
|
||||
import google.generativeai as genai
|
||||
from qdrant_client import QdrantClient
|
||||
from qdrant_client.models import FieldCondition, MatchValue, Filter
|
||||
|
||||
# Suppress noisy HTTP logs
|
||||
import logging as _logging
|
||||
_logging.getLogger("httpx").setLevel(_logging.WARNING)
|
||||
_logging.getLogger("qdrant_client").setLevel(_logging.WARNING)
|
||||
|
||||
LOG_FILE = Path("/opt/recon/logs/migrate_domains.log")
|
||||
CHECKPOINT_FILE = Path("/opt/recon/data/migrate_domains_checkpoint.json")
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(message)s",
|
||||
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
|
||||
)
|
||||
log = logging.getLogger("migrate_domains")
|
||||
|
||||
# ── Constants ───────────────────────────────────────────────────────────────
|
||||
|
||||
VALID_DOMAINS = {
|
||||
'Agriculture & Livestock', 'Civil Organization', 'Communications',
|
||||
'Food Systems', 'Foundational Skills', 'Logistics', 'Medical',
|
||||
'Navigation', 'Operations', 'Power Systems', 'Preservation & Storage',
|
||||
'Security', 'Shelter & Construction', 'Technology', 'Tools & Equipment',
|
||||
'Vehicles', 'Water Systems', 'Wilderness Skills',
|
||||
}
|
||||
|
||||
SOURCE_DOMAINS = {
|
||||
'Sustainment Systems', 'Off-Grid Systems', 'Defense & Tactics',
|
||||
'Community Coordination', 'Leadership',
|
||||
}
|
||||
|
||||
DOMAIN_LIST_STR = ', '.join(sorted(VALID_DOMAINS))
|
||||
|
||||
CLASSIFY_PROMPT = """\
|
||||
Classify this knowledge concept into exactly one domain from this list:
|
||||
Agriculture & Livestock, Civil Organization, Communications, Food Systems, Foundational Skills, Logistics, Medical, Navigation, Operations, Power Systems, Preservation & Storage, Security, Shelter & Construction, Technology, Tools & Equipment, Vehicles, Water Systems, Wilderness Skills
|
||||
|
||||
Return ONLY the exact domain string, nothing else. No explanation, no punctuation, no quotes.
|
||||
|
||||
Content: {content}
|
||||
Summary: {summary}
|
||||
Subdomain: {subdomain}
|
||||
"""
|
||||
|
||||
DOMAIN_FALLBACK = 'Foundational Skills'
|
||||
|
||||
# ── Key management ──────────────────────────────────────────────────────────
|
||||
|
||||
def load_gemini_keys():
|
||||
keys = []
|
||||
env_path = Path("/opt/recon/.env")
|
||||
if not env_path.exists():
|
||||
raise FileNotFoundError(f"{env_path} not found")
|
||||
for line in env_path.read_text().splitlines():
|
||||
if line.startswith("GEMINI_KEY_"):
|
||||
keys.append(line.split("=", 1)[1].strip())
|
||||
if not keys:
|
||||
raise ValueError("No GEMINI_KEY_* found in .env")
|
||||
return keys
|
||||
|
||||
|
||||
class KeyRotator:
|
||||
def __init__(self, keys):
|
||||
self.keys = keys
|
||||
self._i = 0
|
||||
self._lock = threading.Lock()
|
||||
|
||||
def next(self):
|
||||
with self._lock:
|
||||
key = self.keys[self._i % len(self.keys)]
|
||||
self._i += 1
|
||||
return key
|
||||
|
||||
|
||||
# ── Classification ──────────────────────────────────────────────────────────
|
||||
|
||||
def classify_domain(content, summary, subdomains, key):
|
||||
"""Call Gemini Flash to classify into one of 18 domains."""
|
||||
prompt = CLASSIFY_PROMPT.format(
|
||||
content=str(content)[:400] if content else "(none)",
|
||||
summary=str(summary)[:200] if summary else "(none)",
|
||||
subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
|
||||
)
|
||||
genai.configure(api_key=key)
|
||||
model = genai.GenerativeModel(
|
||||
"gemini-2.0-flash",
|
||||
generation_config={"response_mime_type": "text/plain"}
|
||||
)
|
||||
|
||||
for retry in range(4):
|
||||
try:
|
||||
resp = model.generate_content(prompt)
|
||||
value = resp.text.strip().strip('"').strip("'").strip()
|
||||
if value in VALID_DOMAINS:
|
||||
return value
|
||||
# Try case-insensitive match
|
||||
for valid in VALID_DOMAINS:
|
||||
if value.lower() == valid.lower():
|
||||
return valid
|
||||
# Partial match — Gemini sometimes returns with trailing period
|
||||
clean = value.rstrip('.')
|
||||
if clean in VALID_DOMAINS:
|
||||
return clean
|
||||
# Invalid — retry with stricter prompt
|
||||
if retry < 3:
|
||||
prompt = (
|
||||
f"Your previous response '{value}' was invalid. "
|
||||
f"You must return ONLY one of these exact strings: {DOMAIN_LIST_STR}\n\n"
|
||||
f"Content: {str(content)[:300]}\n"
|
||||
f"Return ONLY the exact domain string."
|
||||
)
|
||||
continue
|
||||
except Exception as e:
|
||||
err = str(e).lower()
|
||||
if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
|
||||
time.sleep(min(5 * (2 ** retry) + random.uniform(0, 3), 60))
|
||||
else:
|
||||
log.warning(f"Gemini error (attempt {retry+1}): {e}")
|
||||
if retry >= 2:
|
||||
break
|
||||
|
||||
return heuristic_fallback(content, summary, subdomains)
|
||||
|
||||
|
||||
def heuristic_fallback(content, summary, subdomains):
|
||||
"""Last-resort heuristic when Gemini fails or returns invalid."""
|
||||
text = f"{summary or ''} {' '.join(subdomains or [])} {str(content or '')[:200]}".lower()
|
||||
|
||||
mapping = [
|
||||
(["farming", "agriculture", "livestock", "animal husbandry", "poultry",
|
||||
"cattle", "crop", "soil fertility", "irrigation for crops"], "Agriculture & Livestock"),
|
||||
(["foraging", "hunting", "fishing", "bushcraft", "wilderness", "survival skill",
|
||||
"fire starting", "shelter building", "trapping", "tracking"], "Wilderness Skills"),
|
||||
(["food preservation", "canning", "dehydration", "smoking", "pickling",
|
||||
"fermentation", "food storage", "freeze dry"], "Preservation & Storage"),
|
||||
(["cooking", "recipe", "nutrition", "food preparation", "baking",
|
||||
"food production", "meal"], "Food Systems"),
|
||||
(["first aid", "medical", "trauma", "surgery", "anatomy", "pharmacology",
|
||||
"wound", "triage", "diagnosis", "disease", "infection", "veterinary",
|
||||
"herbal medicine", "medicinal plant"], "Medical"),
|
||||
(["radio", "antenna", "ham radio", "communication", "signal",
|
||||
"networking", "meshtastic", "comms"], "Communications"),
|
||||
(["solar", "battery", "generator", "wind turbine", "hydroelectric",
|
||||
"power grid", "inverter", "photovoltaic", "electricity"], "Power Systems"),
|
||||
(["water purification", "water filter", "well", "rainwater",
|
||||
"sanitation", "water treatment", "desalination"], "Water Systems"),
|
||||
(["navigation", "compass", "map reading", "gps", "celestial",
|
||||
"orienteering", "land nav"], "Navigation"),
|
||||
(["security", "opsec", "perimeter", "surveillance", "threat",
|
||||
"intrusion detection", "physical security"], "Security"),
|
||||
(["vehicle", "engine", "motor", "aircraft", "boat", "motorcycle",
|
||||
"truck", "maintenance", "diesel", "transmission"], "Vehicles"),
|
||||
(["tool", "equipment", "wrench", "saw", "drill", "hammer",
|
||||
"hand tool", "power tool", "blade", "sharpening"], "Tools & Equipment"),
|
||||
(["construction", "building", "shelter", "carpentry", "masonry",
|
||||
"roofing", "concrete", "framing", "plumbing"], "Shelter & Construction"),
|
||||
(["electronics", "computer", "software", "circuit", "programming",
|
||||
"technology", "digital", "engineering"], "Technology"),
|
||||
(["supply chain", "logistics", "transport", "distribution",
|
||||
"inventory", "supply", "stockpile"], "Logistics"),
|
||||
(["governance", "civil", "community", "administration", "organization",
|
||||
"council", "democratic", "municipal"], "Civil Organization"),
|
||||
(["tactics", "combat", "military", "mission", "patrol", "ambush",
|
||||
"defensive position", "fire team", "maneuver", "engagement",
|
||||
"search and rescue", "sar", "reconnaissance"], "Operations"),
|
||||
]
|
||||
|
||||
for keywords, domain in mapping:
|
||||
if any(kw in text for kw in keywords):
|
||||
return domain
|
||||
|
||||
return DOMAIN_FALLBACK
|
||||
|
||||
|
||||
# ── Checkpoint ──────────────────────────────────────────────────────────────
|
||||
|
||||
class Checkpoint:
|
||||
"""Thread-safe checkpoint tracker for crash recovery."""
|
||||
def __init__(self, path):
|
||||
self.path = path
|
||||
self._lock = threading.Lock()
|
||||
self._completed = set()
|
||||
self._dirty = 0
|
||||
self._load()
|
||||
|
||||
def _load(self):
|
||||
if self.path.exists():
|
||||
try:
|
||||
data = json.loads(self.path.read_text())
|
||||
self._completed = set(data.get("completed", []))
|
||||
log.info(f"Loaded checkpoint: {len(self._completed):,} completed points")
|
||||
except Exception:
|
||||
self._completed = set()
|
||||
|
||||
def is_done(self, point_id):
|
||||
return point_id in self._completed
|
||||
|
||||
def mark_done(self, point_id):
|
||||
with self._lock:
|
||||
self._completed.add(point_id)
|
||||
self._dirty += 1
|
||||
if self._dirty >= 1000:
|
||||
self._flush()
|
||||
|
||||
def _flush(self):
|
||||
tmp = self.path.with_suffix('.tmp')
|
||||
tmp.write_text(json.dumps({"completed": list(self._completed)}))
|
||||
tmp.rename(self.path)
|
||||
self._dirty = 0
|
||||
|
||||
def flush(self):
|
||||
with self._lock:
|
||||
self._flush()
|
||||
|
||||
def count(self):
|
||||
return len(self._completed)
|
||||
|
||||
|
||||
# ── Per-point processing ───────────────────────────────────────────────────
|
||||
|
||||
def process_point(point, qdrant, collection, key_rotator, checkpoint, dry_run, stats):
|
||||
point_id = point.id
|
||||
if checkpoint.is_done(point_id):
|
||||
return "skipped"
|
||||
|
||||
payload = point.payload
|
||||
content = payload.get("content", payload.get("summary", ""))
|
||||
summary = payload.get("summary", "")
|
||||
subdomains = payload.get("subdomain", [])
|
||||
if isinstance(subdomains, str):
|
||||
subdomains = [subdomains]
|
||||
old_domain = payload.get("domain", [])
|
||||
if isinstance(old_domain, list):
|
||||
old_domain_str = old_domain[0] if old_domain else "(empty)"
|
||||
else:
|
||||
old_domain_str = str(old_domain)
|
||||
|
||||
key = key_rotator.next()
|
||||
new_domain = classify_domain(content, summary, subdomains, key)
|
||||
|
||||
# Track the mapping
|
||||
stats_key = f"{old_domain_str} -> {new_domain}"
|
||||
stats[stats_key] = stats.get(stats_key, 0) + 1
|
||||
|
||||
if dry_run:
|
||||
return f"would: {old_domain_str} -> {new_domain}"
|
||||
|
||||
# Write new domain as single string
|
||||
qdrant.set_payload(
|
||||
collection_name=collection,
|
||||
payload={"domain": new_domain},
|
||||
points=[point_id],
|
||||
)
|
||||
|
||||
checkpoint.mark_done(point_id)
|
||||
return "ok"
|
||||
|
||||
|
||||
# ── Main loop ───────────────────────────────────────────────────────────────
|
||||
|
||||
SCROLL_BATCH = 5000
|
||||
|
||||
|
||||
def count_source_domains(qdrant, collection):
|
||||
"""Count vectors with source domains."""
|
||||
counts = {}
|
||||
for domain in SOURCE_DOMAINS:
|
||||
result = qdrant.count(
|
||||
collection_name=collection,
|
||||
count_filter=Filter(
|
||||
must=[FieldCondition(key="domain", match=MatchValue(value=domain))]
|
||||
),
|
||||
exact=True,
|
||||
)
|
||||
counts[domain] = result.count
|
||||
return counts
|
||||
|
||||
|
||||
def stream_and_process(qdrant, collection, rotator, checkpoint, workers, limit=None, dry_run=False):
|
||||
"""Scroll source domains in batches, process with thread pool."""
|
||||
lock = threading.Lock()
|
||||
done = 0
|
||||
skipped_checkpoint = 0
|
||||
start = time.time()
|
||||
stats = {} # shared mapping stats
|
||||
|
||||
for source_domain in sorted(SOURCE_DOMAINS):
|
||||
log.info(f"\n--- Processing domain: {source_domain} ---")
|
||||
offset = None
|
||||
domain_done = 0
|
||||
|
||||
while True:
|
||||
scroll_results, offset = qdrant.scroll(
|
||||
collection_name=collection,
|
||||
limit=SCROLL_BATCH,
|
||||
with_payload=True,
|
||||
with_vectors=False,
|
||||
offset=offset,
|
||||
scroll_filter=Filter(
|
||||
must=[FieldCondition(key="domain", match=MatchValue(value=source_domain))]
|
||||
),
|
||||
)
|
||||
|
||||
if not scroll_results:
|
||||
if offset is None:
|
||||
break
|
||||
continue
|
||||
|
||||
# Filter already checkpointed
|
||||
pending = [p for p in scroll_results if not checkpoint.is_done(p.id)]
|
||||
skipped_checkpoint += len(scroll_results) - len(pending)
|
||||
|
||||
if pending:
|
||||
with ThreadPoolExecutor(max_workers=workers) as ex:
|
||||
futures = {
|
||||
ex.submit(process_point, p, qdrant, collection, rotator,
|
||||
checkpoint, dry_run, stats): p
|
||||
for p in pending
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
try:
|
||||
future.result()
|
||||
except Exception as e:
|
||||
log.error(f"Worker error: {e}")
|
||||
with lock:
|
||||
done += 1
|
||||
domain_done += 1
|
||||
if done % 5000 == 0:
|
||||
elapsed = time.time() - start
|
||||
rate = done / elapsed * 60
|
||||
log.info(f" {done:,} done | {rate:.0f}/min | "
|
||||
f"elapsed {elapsed/60:.1f}min")
|
||||
checkpoint.flush()
|
||||
time.sleep(0.02)
|
||||
|
||||
if limit and done >= limit:
|
||||
break
|
||||
if offset is None:
|
||||
break
|
||||
|
||||
log.info(f" {source_domain}: {domain_done:,} vectors processed")
|
||||
|
||||
if limit and done >= limit:
|
||||
break
|
||||
|
||||
checkpoint.flush()
|
||||
return done, skipped_checkpoint, stats, start
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Classify 20 samples without writing")
|
||||
parser.add_argument("--workers", type=int, default=16)
|
||||
parser.add_argument("--limit", type=int, default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
keys = load_gemini_keys()
|
||||
rotator = KeyRotator(keys)
|
||||
|
||||
qdrant = QdrantClient(host="localhost", port=6333, timeout=120)
|
||||
collection = "recon_knowledge"
|
||||
checkpoint = Checkpoint(CHECKPOINT_FILE)
|
||||
|
||||
# Count source domains
|
||||
counts = count_source_domains(qdrant, collection)
|
||||
total_source = sum(counts.values())
|
||||
pre_checkpoint = checkpoint.count()
|
||||
|
||||
log.info(f"Source domain counts:")
|
||||
for domain, count in sorted(counts.items(), key=lambda x: -x[1]):
|
||||
log.info(f" {domain:30s} {count:>10,}")
|
||||
log.info(f" {'TOTAL':30s} {total_source:>10,}")
|
||||
log.info(f"Checkpoint: {pre_checkpoint:,} already completed")
|
||||
log.info(f"Workers: {args.workers} | Keys: {len(keys)}")
|
||||
|
||||
# Cost estimate
|
||||
remaining = total_source - pre_checkpoint
|
||||
input_tokens = remaining * 200
|
||||
output_tokens = remaining * 5
|
||||
input_cost = input_tokens / 1_000_000 * 0.10
|
||||
output_cost = output_tokens / 1_000_000 * 0.40
|
||||
total_cost = input_cost + output_cost
|
||||
log.info(f"\nEstimated Gemini 2.0 Flash cost:")
|
||||
log.info(f" Vectors to process: {remaining:,}")
|
||||
log.info(f" Input: ~{input_tokens/1_000_000:.1f}M tokens = ${input_cost:.2f}")
|
||||
log.info(f" Output: ~{output_tokens/1_000_000:.1f}M tokens = ${output_cost:.2f}")
|
||||
log.info(f" TOTAL: ~${total_cost:.2f}")
|
||||
|
||||
if args.dry_run:
|
||||
log.info(f"\nDRY RUN: classifying 20 samples...\n")
|
||||
for source_domain in sorted(SOURCE_DOMAINS):
|
||||
scroll_results, _ = qdrant.scroll(
|
||||
collection_name=collection,
|
||||
limit=5,
|
||||
with_payload=True,
|
||||
with_vectors=False,
|
||||
scroll_filter=Filter(
|
||||
must=[FieldCondition(key="domain", match=MatchValue(value=source_domain))]
|
||||
),
|
||||
)
|
||||
for p in scroll_results[:4]:
|
||||
pay = p.payload
|
||||
title = pay.get("title", "(no title)")
|
||||
content = pay.get("content", pay.get("summary", ""))
|
||||
summary = pay.get("summary", "")
|
||||
subdomains = pay.get("subdomain", [])
|
||||
if isinstance(subdomains, str):
|
||||
subdomains = [subdomains]
|
||||
|
||||
key = rotator.next()
|
||||
new_domain = classify_domain(content, summary, subdomains, key)
|
||||
|
||||
old = pay.get("domain", [])
|
||||
if isinstance(old, list):
|
||||
old = old[0] if old else "?"
|
||||
print(f" [{old:25s}] -> [{new_domain:25s}] {title[:60]}")
|
||||
|
||||
print(f"\nDRY RUN complete. ~{remaining:,} vectors would be migrated.")
|
||||
print(f"Estimated cost: ~${total_cost:.2f}")
|
||||
return
|
||||
|
||||
# ── Full migration ──────────────────────────────────────────────────
|
||||
log.info(f"\nStarting full migration...")
|
||||
|
||||
done, skipped_ckpt, stats, start = stream_and_process(
|
||||
qdrant, collection, rotator, checkpoint, args.workers, args.limit
|
||||
)
|
||||
|
||||
elapsed = time.time() - start
|
||||
log.info(f"\n{'='*70}")
|
||||
log.info(f"MIGRATION COMPLETE in {elapsed/60:.1f}min:")
|
||||
log.info(f" Processed: {done:,}")
|
||||
log.info(f" Skipped (checkpoint): {skipped_ckpt:,}")
|
||||
log.info(f" Rate: {done/elapsed*60:.0f}/min")
|
||||
log.info(f"\nMapping distribution:")
|
||||
for mapping, count in sorted(stats.items(), key=lambda x: -x[1])[:30]:
|
||||
log.info(f" {mapping:<55s} {count:>8,}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
469
scripts/migrate_skill_level.py
Executable file
469
scripts/migrate_skill_level.py
Executable file
|
|
@ -0,0 +1,469 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
migrate_skill_level.py — Replaces skill_level with knowledge_type + complexity
|
||||
on all vectors in Qdrant and on-disk concept JSONs.
|
||||
|
||||
Scrolls entire collection, classifies each concept via Gemini Flash,
|
||||
writes knowledge_type + complexity, deletes skill_level.
|
||||
|
||||
Crash-safe: completed point IDs tracked in checkpoint file.
|
||||
|
||||
Usage:
|
||||
python3 /opt/recon/scripts/migrate_skill_level.py [--dry-run] [--workers 16] [--limit N]
|
||||
"""
|
||||
|
||||
import json
|
||||
import time
|
||||
import random
|
||||
import logging
|
||||
import argparse
|
||||
import threading
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from collections import defaultdict
|
||||
|
||||
import google.generativeai as genai
|
||||
from qdrant_client import QdrantClient
|
||||
from qdrant_client.models import FieldCondition, MatchValue, Filter
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, '/opt/recon')
|
||||
from lib.utils import get_config, setup_logging
|
||||
|
||||
# Suppress noisy HTTP request logging from qdrant_client/httpx
|
||||
import logging as _logging
|
||||
_logging.getLogger("httpx").setLevel(_logging.WARNING)
|
||||
_logging.getLogger("qdrant_client").setLevel(_logging.WARNING)
|
||||
|
||||
LOG_FILE = Path("/opt/recon/logs/migrate_skill_level.log")
|
||||
CHECKPOINT_FILE = Path("/opt/recon/data/migrate_skill_level_checkpoint.json")
|
||||
CONCEPTS_DIR = Path("/opt/recon/data/concepts")
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(message)s",
|
||||
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
|
||||
)
|
||||
log = logging.getLogger("migrate_skill_level")
|
||||
|
||||
# ── Prompt ──────────────────────────────────────────────────────────────────
|
||||
|
||||
CLASSIFY_PROMPT = """\
|
||||
You are a knowledge classification engine. Given a concept, assign two fields:
|
||||
|
||||
knowledge_type — what KIND of knowledge this is:
|
||||
foundational — concepts, definitions, theory, background knowledge, explanations of how things work
|
||||
procedural — step-by-step techniques, instructions, how-to skills, methods you execute
|
||||
operational — application under real conditions, decision-making, mission execution, judgment calls in context
|
||||
|
||||
complexity — how much prior knowledge is needed:
|
||||
basic — requires little or no prior knowledge, introductory material, simple concepts
|
||||
intermediate — requires some domain familiarity, assumes foundational knowledge is in place
|
||||
advanced — requires significant experience or expertise, high-stakes or highly technical material
|
||||
|
||||
EXAMPLES:
|
||||
- "Needle chest decompression procedure" → procedural, advanced
|
||||
- "What is soil texture and why does it matter" → foundational, basic
|
||||
- "Coordinating a fire team withdrawal under contact" → operational, advanced
|
||||
- "How to start a campfire with a ferro rod" → procedural, basic
|
||||
- "Antenna gain and radiation patterns explained" → foundational, intermediate
|
||||
- "Triage decision-making in a mass casualty event" → operational, advanced
|
||||
- "Step-by-step: building a Dakota fire hole" → procedural, intermediate
|
||||
- "Understanding the water cycle" → foundational, basic
|
||||
|
||||
Concept title: {title}
|
||||
Concept domain: {domain}
|
||||
Concept subdomain: {subdomain}
|
||||
Concept content: {content}
|
||||
|
||||
Return ONLY valid JSON, no markdown, no explanation:
|
||||
{{"knowledge_type": "foundational|procedural|operational", "complexity": "basic|intermediate|advanced"}}
|
||||
"""
|
||||
|
||||
VALID_KNOWLEDGE_TYPES = {"foundational", "procedural", "operational"}
|
||||
VALID_COMPLEXITIES = {"basic", "intermediate", "advanced"}
|
||||
|
||||
# ── Key management ──────────────────────────────────────────────────────────
|
||||
|
||||
def load_gemini_keys():
|
||||
keys = []
|
||||
for line in Path("/opt/recon/.env").read_text().splitlines():
|
||||
if line.startswith("GEMINI_KEY_"):
|
||||
keys.append(line.split("=", 1)[1].strip())
|
||||
return keys
|
||||
|
||||
|
||||
class KeyRotator:
|
||||
def __init__(self, keys):
|
||||
self.keys = keys
|
||||
self._i = 0
|
||||
self._lock = threading.Lock()
|
||||
|
||||
def next(self):
|
||||
with self._lock:
|
||||
key = self.keys[self._i % len(self.keys)]
|
||||
self._i += 1
|
||||
return key
|
||||
|
||||
# ── Classification ──────────────────────────────────────────────────────────
|
||||
|
||||
def classify(title, domains, subdomains, content, key):
|
||||
"""Call Gemini Flash to classify knowledge_type + complexity."""
|
||||
prompt = CLASSIFY_PROMPT.format(
|
||||
title=title or "(untitled)",
|
||||
domain=", ".join(domains[:5]) if domains else "(none)",
|
||||
subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
|
||||
content=str(content)[:400] if content else "(none)",
|
||||
)
|
||||
genai.configure(api_key=key)
|
||||
model = genai.GenerativeModel(
|
||||
"gemini-2.0-flash",
|
||||
generation_config={"response_mime_type": "application/json"}
|
||||
)
|
||||
for retry in range(4):
|
||||
try:
|
||||
resp = model.generate_content(prompt)
|
||||
data = json.loads(resp.text)
|
||||
kt = data.get("knowledge_type", "").lower().strip()
|
||||
cx = data.get("complexity", "").lower().strip()
|
||||
if kt in VALID_KNOWLEDGE_TYPES and cx in VALID_COMPLEXITIES:
|
||||
return kt, cx
|
||||
# Invalid values — retry once
|
||||
if retry == 0:
|
||||
continue
|
||||
except Exception as e:
|
||||
err = str(e).lower()
|
||||
if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
|
||||
time.sleep(min(5 * (2 ** retry) + random.uniform(0, 3), 60))
|
||||
else:
|
||||
break
|
||||
|
||||
# Fallback heuristic based on old skill_level + content analysis
|
||||
return heuristic_fallback(title, subdomains, content)
|
||||
|
||||
|
||||
def heuristic_fallback(title, subdomains, content):
|
||||
"""Last-resort heuristic when Gemini fails."""
|
||||
text = f"{title} {' '.join(subdomains)} {str(content)[:200]}".lower()
|
||||
|
||||
# Knowledge type heuristic
|
||||
procedural_signals = ["how to", "step-by-step", "procedure", "instructions",
|
||||
"method", "technique", "build", "make", "construct",
|
||||
"install", "assemble", "recipe", "prepare"]
|
||||
operational_signals = ["decision", "coordinate", "execute", "deploy",
|
||||
"mission", "triage", "under fire", "in the field",
|
||||
"real-world", "scenario", "assessment", "plan"]
|
||||
|
||||
if any(s in text for s in operational_signals):
|
||||
kt = "operational"
|
||||
elif any(s in text for s in procedural_signals):
|
||||
kt = "procedural"
|
||||
else:
|
||||
kt = "foundational"
|
||||
|
||||
# Complexity heuristic — default intermediate (safest middle ground)
|
||||
cx = "intermediate"
|
||||
basic_signals = ["introduction", "what is", "basic", "beginner", "overview",
|
||||
"definition", "simple", "fundamentals"]
|
||||
advanced_signals = ["advanced", "expert", "complex", "critical", "high-stakes",
|
||||
"surgery", "trauma", "tactical", "classified"]
|
||||
if any(s in text for s in basic_signals):
|
||||
cx = "basic"
|
||||
elif any(s in text for s in advanced_signals):
|
||||
cx = "advanced"
|
||||
|
||||
return kt, cx
|
||||
|
||||
# ── Checkpoint management ───────────────────────────────────────────────────
|
||||
|
||||
class Checkpoint:
|
||||
"""Thread-safe checkpoint tracker for crash recovery."""
|
||||
def __init__(self, path):
|
||||
self.path = path
|
||||
self._lock = threading.Lock()
|
||||
self._completed = set()
|
||||
self._dirty = 0
|
||||
self._load()
|
||||
|
||||
def _load(self):
|
||||
if self.path.exists():
|
||||
try:
|
||||
data = json.loads(self.path.read_text())
|
||||
self._completed = set(data.get("completed", []))
|
||||
log.info(f"Loaded checkpoint: {len(self._completed):,} completed points")
|
||||
except Exception:
|
||||
self._completed = set()
|
||||
|
||||
def is_done(self, point_id):
|
||||
return point_id in self._completed
|
||||
|
||||
def mark_done(self, point_id):
|
||||
with self._lock:
|
||||
self._completed.add(point_id)
|
||||
self._dirty += 1
|
||||
if self._dirty >= 1000:
|
||||
self._flush()
|
||||
|
||||
def _flush(self):
|
||||
tmp = self.path.with_suffix('.tmp')
|
||||
tmp.write_text(json.dumps({"completed": list(self._completed)}))
|
||||
tmp.rename(self.path)
|
||||
self._dirty = 0
|
||||
|
||||
def flush(self):
|
||||
with self._lock:
|
||||
self._flush()
|
||||
|
||||
def count(self):
|
||||
return len(self._completed)
|
||||
|
||||
# ── Concept JSON update ────────────────────────────────────────────────────
|
||||
|
||||
def update_concept_json(doc_hash, title, knowledge_type, complexity):
|
||||
"""Update on-disk concept JSON: add knowledge_type + complexity, remove skill_level."""
|
||||
doc_dir = CONCEPTS_DIR / doc_hash
|
||||
if not doc_dir.exists():
|
||||
return False
|
||||
for wf in doc_dir.glob("window_*.json"):
|
||||
try:
|
||||
with open(wf, "r", encoding="utf-8") as f:
|
||||
concepts = json.load(f)
|
||||
changed = False
|
||||
for c in concepts:
|
||||
if not isinstance(c, dict):
|
||||
continue
|
||||
if c.get("title") == title:
|
||||
c["knowledge_type"] = knowledge_type
|
||||
c["complexity"] = complexity
|
||||
c.pop("skill_level", None)
|
||||
changed = True
|
||||
if changed:
|
||||
with open(wf, "w", encoding="utf-8") as f:
|
||||
json.dump(concepts, f, indent=2, ensure_ascii=False)
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
return False
|
||||
|
||||
# ── Per-point processing ───────────────────────────────────────────────────
|
||||
|
||||
def process_point(point, qdrant, collection, key_rotator, checkpoint, dry_run):
|
||||
point_id = point.id
|
||||
if checkpoint.is_done(point_id):
|
||||
return "skipped"
|
||||
|
||||
payload = point.payload
|
||||
title = payload.get("title", "")
|
||||
domains = payload.get("domain", [])
|
||||
if isinstance(domains, str):
|
||||
domains = [domains]
|
||||
subdomains = payload.get("subdomain", [])
|
||||
if isinstance(subdomains, str):
|
||||
subdomains = [subdomains]
|
||||
content = payload.get("content", payload.get("summary", ""))
|
||||
doc_hash = payload.get("doc_hash", "")
|
||||
|
||||
key = key_rotator.next()
|
||||
knowledge_type, complexity = classify(title, domains, subdomains, content, key)
|
||||
|
||||
if dry_run:
|
||||
return f"kt={knowledge_type}, cx={complexity}"
|
||||
|
||||
# Write new fields
|
||||
qdrant.set_payload(
|
||||
collection_name=collection,
|
||||
payload={"knowledge_type": knowledge_type, "complexity": complexity},
|
||||
points=[point_id],
|
||||
)
|
||||
|
||||
# Delete old field
|
||||
qdrant.delete_payload(
|
||||
collection_name=collection,
|
||||
keys=["skill_level"],
|
||||
points=[point_id],
|
||||
)
|
||||
|
||||
# Update JSON on disk
|
||||
if doc_hash:
|
||||
update_concept_json(doc_hash, title, knowledge_type, complexity)
|
||||
|
||||
checkpoint.mark_done(point_id)
|
||||
return "ok"
|
||||
|
||||
# ── Streaming batch processor ───────────────────────────────────────────────
|
||||
|
||||
SCROLL_BATCH = 5000 # vectors per scroll batch — keeps memory bounded (~50MB)
|
||||
|
||||
|
||||
def count_collection(qdrant, collection):
|
||||
"""Quick count of total vectors via collection info."""
|
||||
info = qdrant.get_collection(collection)
|
||||
return info.points_count
|
||||
|
||||
|
||||
def stream_and_process(qdrant, collection, rotator, checkpoint, workers, limit=None):
|
||||
"""Scroll in batches, process each batch with thread pool, then discard.
|
||||
|
||||
Memory-bounded: only holds SCROLL_BATCH payloads at any time (~50MB).
|
||||
"""
|
||||
results_agg = defaultdict(int)
|
||||
lock = threading.Lock()
|
||||
done = 0
|
||||
skipped_checkpoint = 0
|
||||
skipped_no_skill = 0
|
||||
total_estimate = count_collection(qdrant, collection)
|
||||
start = time.time()
|
||||
|
||||
offset = None
|
||||
batch_num = 0
|
||||
|
||||
while True:
|
||||
batch_num += 1
|
||||
scroll_results, offset = qdrant.scroll(
|
||||
collection_name=collection,
|
||||
limit=SCROLL_BATCH,
|
||||
with_payload=True,
|
||||
with_vectors=False,
|
||||
offset=offset,
|
||||
)
|
||||
|
||||
# Filter to points needing migration
|
||||
pending = []
|
||||
for p in scroll_results:
|
||||
if "skill_level" not in p.payload:
|
||||
skipped_no_skill += 1
|
||||
continue
|
||||
if checkpoint.is_done(p.id):
|
||||
skipped_checkpoint += 1
|
||||
continue
|
||||
pending.append(p)
|
||||
|
||||
if pending:
|
||||
with ThreadPoolExecutor(max_workers=workers) as ex:
|
||||
futures = {
|
||||
ex.submit(process_point, p, qdrant, collection, rotator, checkpoint, False): p
|
||||
for p in pending
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
try:
|
||||
status = future.result()
|
||||
except Exception as e:
|
||||
status = f"error: {str(e)[:80]}"
|
||||
log.error(f"Worker error: {e}")
|
||||
with lock:
|
||||
results_agg[status] += 1
|
||||
done += 1
|
||||
if done % 5000 == 0:
|
||||
elapsed = time.time() - start
|
||||
rate = done / elapsed * 60
|
||||
remaining = total_estimate - done - skipped_checkpoint - skipped_no_skill
|
||||
eta = remaining / (done / elapsed) / 60 if done > 0 else 0
|
||||
log.info(f" {done:,} done | {rate:.0f}/min | "
|
||||
f"ETA ~{eta:.0f}min | {dict(results_agg)}")
|
||||
checkpoint.flush()
|
||||
time.sleep(0.02)
|
||||
|
||||
if limit and done >= limit:
|
||||
break
|
||||
if offset is None:
|
||||
break
|
||||
|
||||
checkpoint.flush()
|
||||
return done, skipped_checkpoint, skipped_no_skill, results_agg, start
|
||||
|
||||
|
||||
# ── Main ────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Classify 20 samples without writing anything")
|
||||
parser.add_argument("--workers", type=int, default=16)
|
||||
parser.add_argument("--limit", type=int, default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
config = get_config()
|
||||
keys = load_gemini_keys()
|
||||
rotator = KeyRotator(keys)
|
||||
|
||||
qdrant = QdrantClient(
|
||||
host=config['vector_db']['host'],
|
||||
port=config['vector_db']['port'],
|
||||
timeout=120
|
||||
)
|
||||
collection = config['vector_db']['collection']
|
||||
checkpoint = Checkpoint(CHECKPOINT_FILE)
|
||||
|
||||
total_vectors = count_collection(qdrant, collection)
|
||||
pre_checkpoint = checkpoint.count()
|
||||
|
||||
log.info(f"Collection has {total_vectors:,} vectors")
|
||||
log.info(f"Checkpoint: {pre_checkpoint:,} already completed")
|
||||
log.info(f"Workers: {args.workers} | Keys: {len(keys)} | Dry run: {args.dry_run}")
|
||||
log.info(f"Estimated Gemini Flash cost: ~${(total_vectors - pre_checkpoint) * 0.0004:.2f}")
|
||||
log.info(f"Streaming in batches of {SCROLL_BATCH:,} (memory-bounded)")
|
||||
|
||||
if args.dry_run:
|
||||
# Scroll one batch, classify 20 diverse samples
|
||||
log.info(f"\nDRY RUN: classifying 20 samples...\n")
|
||||
scroll_results, _ = qdrant.scroll(
|
||||
collection_name=collection,
|
||||
limit=200,
|
||||
with_payload=True,
|
||||
with_vectors=False,
|
||||
)
|
||||
samples = []
|
||||
seen_domains = set()
|
||||
for p in scroll_results:
|
||||
if "skill_level" not in p.payload:
|
||||
continue
|
||||
domains = p.payload.get("domain", [])
|
||||
if isinstance(domains, str):
|
||||
domains = [domains]
|
||||
d_key = tuple(sorted(domains[:2]))
|
||||
if d_key not in seen_domains:
|
||||
samples.append(p)
|
||||
seen_domains.add(d_key)
|
||||
if len(samples) >= 20:
|
||||
break
|
||||
|
||||
for i, p in enumerate(samples, 1):
|
||||
pay = p.payload
|
||||
title = pay.get("title", "(no title)")
|
||||
domains = pay.get("domain", [])
|
||||
old_skill = pay.get("skill_level", "?")
|
||||
subdomains = pay.get("subdomain", [])
|
||||
if isinstance(subdomains, str):
|
||||
subdomains = [subdomains]
|
||||
content = pay.get("content", pay.get("summary", ""))
|
||||
|
||||
key = rotator.next()
|
||||
kt, cx = classify(title, domains, subdomains, content, key)
|
||||
|
||||
print(f"\n--- Sample {i}/{len(samples)} ---")
|
||||
print(f" Title: {title}")
|
||||
print(f" Domain: {domains}")
|
||||
print(f" Old skill: {old_skill}")
|
||||
print(f" → knowledge_type: {kt}")
|
||||
print(f" → complexity: {cx}")
|
||||
est = total_vectors - pre_checkpoint
|
||||
print(f"\nDRY RUN complete. ~{est:,} vectors would be migrated.")
|
||||
print(f"Estimated Gemini Flash cost: ~${est * 0.0004:.2f}")
|
||||
return
|
||||
|
||||
# ── Full migration run (streaming) ──────────────────────────────────────
|
||||
done, skipped_ckpt, skipped_no_skill, results, start = stream_and_process(
|
||||
qdrant, collection, rotator, checkpoint, args.workers, args.limit
|
||||
)
|
||||
|
||||
elapsed = time.time() - start
|
||||
log.info(f"\nComplete in {elapsed/60:.1f}min:")
|
||||
log.info(f" Processed: {done:,}")
|
||||
log.info(f" Skipped (checkpoint): {skipped_ckpt:,}")
|
||||
log.info(f" Skipped (no skill): {skipped_no_skill:,}")
|
||||
for status, count in sorted(results.items(), key=lambda x: -x[1]):
|
||||
log.info(f" {status:<30} {count:>10,}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
227
scripts/rebuild_qdrant.py
Executable file
227
scripts/rebuild_qdrant.py
Executable file
|
|
@ -0,0 +1,227 @@
|
|||
"""
|
||||
RECON Qdrant Rebuilder — patched for headless parallel execution
|
||||
|
||||
Deletes and recreates the Qdrant collection, then re-embeds ALL concept JSONs
|
||||
from disk using parallel workers. Pass --confirm to skip interactive prompt.
|
||||
|
||||
Usage:
|
||||
python3 scripts/rebuild_qdrant.py --confirm [--workers 8]
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import argparse
|
||||
import threading
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from collections import defaultdict
|
||||
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
import requests as http_requests
|
||||
from qdrant_client import QdrantClient
|
||||
from qdrant_client.models import VectorParams, Distance, PointStruct
|
||||
|
||||
from lib.utils import get_config, concept_id, setup_logging
|
||||
from lib.status import StatusDB
|
||||
|
||||
logger = setup_logging('recon.rebuild')
|
||||
|
||||
|
||||
def embed_content(config, content):
|
||||
try:
|
||||
tei_url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/embed"
|
||||
resp = http_requests.post(tei_url, json={"inputs": content}, timeout=120)
|
||||
resp.raise_for_status()
|
||||
return resp.json()[0]
|
||||
except Exception as tei_err:
|
||||
logger.debug(f"TEI failed, trying Ollama: {tei_err}")
|
||||
|
||||
ollama_url = f"http://{config['embedding']['ollama_host']}:{config['embedding']['ollama_port']}/api/embed"
|
||||
resp = http_requests.post(ollama_url, json={
|
||||
"model": config['embedding']['model'],
|
||||
"input": content
|
||||
}, timeout=120)
|
||||
resp.raise_for_status()
|
||||
return resp.json()['embeddings'][0]
|
||||
|
||||
|
||||
def process_doc(doc_hash, config, db, qdrant, collection):
|
||||
"""Embed and upsert all concepts for a single document. Returns (inserted, failed)."""
|
||||
doc_dir = os.path.join(config['paths']['concepts'], doc_hash)
|
||||
doc = db.get_document(doc_hash)
|
||||
filename = doc['filename'] if doc else doc_hash[:8]
|
||||
|
||||
window_files = sorted([
|
||||
f for f in os.listdir(doc_dir)
|
||||
if f.startswith('window_') and f.endswith('.json')
|
||||
])
|
||||
|
||||
all_concepts = []
|
||||
for wf in window_files:
|
||||
path = os.path.join(doc_dir, wf)
|
||||
try:
|
||||
with open(path, encoding='utf-8') as f:
|
||||
concepts = json.load(f)
|
||||
if isinstance(concepts, list):
|
||||
all_concepts.extend(concepts)
|
||||
except Exception as e:
|
||||
logger.warning(f"Skipping corrupted window {wf} in {doc_hash}: {e}")
|
||||
|
||||
if not all_concepts:
|
||||
return 0, 0
|
||||
|
||||
is_web = doc.get('path', '').startswith(('http://', 'https://')) if doc else False
|
||||
|
||||
# Check meta.json for explicit source_type (e.g. 'transcript')
|
||||
source_type = 'web' if is_web else 'document'
|
||||
text_dir = os.path.join(config['paths']['text'], doc_hash)
|
||||
meta_path = os.path.join(text_dir, 'meta.json')
|
||||
if os.path.exists(meta_path):
|
||||
try:
|
||||
with open(meta_path) as mf:
|
||||
meta = json.load(mf)
|
||||
if meta.get('source_type'):
|
||||
source_type = meta['source_type']
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
points = []
|
||||
failed = 0
|
||||
batch_size = config['processing']['embed_batch_size']
|
||||
|
||||
for idx, concept in enumerate(all_concepts):
|
||||
content = concept.get('content', '')
|
||||
if not content or len(content.strip()) < 10:
|
||||
continue
|
||||
try:
|
||||
vector = embed_content(config, content)
|
||||
except Exception as e:
|
||||
logger.warning(f"Embedding failed {doc_hash}:{idx}: {e}")
|
||||
failed += 1
|
||||
continue
|
||||
|
||||
start_page = concept.get('_start_page', 0)
|
||||
point_id = concept_id(doc_hash, start_page, idx)
|
||||
|
||||
payload = {
|
||||
'doc_hash': doc_hash,
|
||||
'filename': filename,
|
||||
'book_title': doc.get('book_title', '') if doc else '',
|
||||
'book_author': doc.get('book_author', '') if doc else '',
|
||||
'source_type': source_type,
|
||||
'verification_status': 'unverified',
|
||||
'credibility_score': 0.7,
|
||||
'language': 'en',
|
||||
}
|
||||
for field in ['content', 'summary', 'title', 'domain', 'subdomain',
|
||||
'keywords', 'skill_level', 'key_facts', 'scenario_applicable',
|
||||
'cross_domain_tags', 'chapter', 'page_ref', 'notes',
|
||||
'_window', '_start_page']:
|
||||
if field in concept:
|
||||
payload[field] = concept[field]
|
||||
|
||||
points.append(PointStruct(id=point_id, vector=vector, payload=payload))
|
||||
|
||||
if len(points) >= batch_size:
|
||||
qdrant.upsert(collection_name=collection, points=points)
|
||||
points = []
|
||||
|
||||
if points:
|
||||
qdrant.upsert(collection_name=collection, points=points)
|
||||
|
||||
inserted = len(all_concepts) - failed
|
||||
if doc:
|
||||
db.update_status(doc_hash, 'complete', vectors_inserted=inserted)
|
||||
|
||||
return inserted, failed
|
||||
|
||||
|
||||
def run_rebuild(workers=8):
|
||||
config = get_config()
|
||||
db = StatusDB()
|
||||
|
||||
qdrant = QdrantClient(
|
||||
host=config['vector_db']['host'],
|
||||
port=config['vector_db']['port'],
|
||||
timeout=60
|
||||
)
|
||||
collection = config['vector_db']['collection']
|
||||
|
||||
# Delete and recreate
|
||||
try:
|
||||
qdrant.delete_collection(collection)
|
||||
logger.info(f"Deleted collection: {collection}")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
qdrant.create_collection(
|
||||
collection_name=collection,
|
||||
vectors_config=VectorParams(
|
||||
size=config['embedding']['dimensions'],
|
||||
distance=Distance.COSINE
|
||||
)
|
||||
)
|
||||
logger.info(f"Created collection: {collection} ({config['embedding']['dimensions']}d, Cosine)")
|
||||
|
||||
concepts_root = config['paths']['concepts']
|
||||
doc_dirs = sorted([
|
||||
d for d in os.listdir(concepts_root)
|
||||
if os.path.isdir(os.path.join(concepts_root, d))
|
||||
])
|
||||
logger.info(f"Found {len(doc_dirs)} document concept directories | {workers} workers")
|
||||
|
||||
total_inserted = 0
|
||||
total_failed = 0
|
||||
done = 0
|
||||
lock = threading.Lock()
|
||||
start = time.time()
|
||||
|
||||
with ThreadPoolExecutor(max_workers=workers) as ex:
|
||||
futures = {
|
||||
ex.submit(process_doc, h, config, StatusDB(), qdrant, collection): h
|
||||
for h in doc_dirs
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
doc_hash = futures[future]
|
||||
try:
|
||||
inserted, failed = future.result()
|
||||
except Exception as e:
|
||||
logger.error(f"Worker error {doc_hash}: {e}")
|
||||
inserted, failed = 0, 0
|
||||
|
||||
with lock:
|
||||
total_inserted += inserted
|
||||
total_failed += failed
|
||||
done += 1
|
||||
if done % 500 == 0:
|
||||
elapsed = time.time() - start
|
||||
rate = total_inserted / elapsed if elapsed > 0 else 0
|
||||
remaining = (len(doc_dirs) - done) / (done / elapsed) if elapsed > 0 else 0
|
||||
logger.info(
|
||||
f" [{done}/{len(doc_dirs)}] "
|
||||
f"{total_inserted:,} vectors | "
|
||||
f"{rate:.0f}/sec | "
|
||||
f"ETA {remaining/60:.0f}min"
|
||||
)
|
||||
|
||||
elapsed = time.time() - start
|
||||
logger.info(f"\nRebuild complete in {elapsed/60:.1f} min: "
|
||||
f"{total_inserted:,} inserted, {total_failed:,} failed")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--confirm', action='store_true', help='Skip interactive prompt')
|
||||
parser.add_argument('--workers', type=int, default=8)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.confirm:
|
||||
print("WARNING: This will DELETE and RECREATE the Qdrant collection.")
|
||||
confirm = input("Type 'REBUILD' to proceed: ")
|
||||
if confirm != 'REBUILD':
|
||||
print("Aborted.")
|
||||
sys.exit(0)
|
||||
|
||||
run_rebuild(workers=args.workers)
|
||||
314
scripts/reenrich_reference.py
Executable file
314
scripts/reenrich_reference.py
Executable file
|
|
@ -0,0 +1,314 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
reenrich_reference.py — Re-classifies all remaining Reference-tagged concepts.
|
||||
|
||||
Scrolls Qdrant for vectors with domain == ["Reference"] or containing "Reference",
|
||||
calls Gemini with a hardened prompt that rejects Reference as a valid response,
|
||||
updates both Qdrant payload and concept JSON on disk.
|
||||
|
||||
Usage:
|
||||
python3 /opt/recon/scripts/reenrich_reference.py [--dry-run] [--workers 16] [--limit N]
|
||||
"""
|
||||
|
||||
import json
|
||||
import time
|
||||
import random
|
||||
import logging
|
||||
import argparse
|
||||
import threading
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from collections import defaultdict
|
||||
|
||||
import google.generativeai as genai
|
||||
from qdrant_client import QdrantClient
|
||||
from qdrant_client.models import FieldCondition, MatchAny, Filter
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, '/opt/recon')
|
||||
from lib.utils import get_config, setup_logging
|
||||
|
||||
LOG_FILE = Path("/opt/recon/logs/reenrich_reference.log")
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(message)s",
|
||||
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
|
||||
)
|
||||
log = logging.getLogger("reenrich_reference")
|
||||
|
||||
CONCEPTS_DIR = Path("/opt/recon/data/concepts")
|
||||
|
||||
CANONICAL_DOMAINS = {
|
||||
"Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
|
||||
"Foundational Skills", "Communications", "Medical", "Food Systems",
|
||||
"Navigation", "Logistics", "Power Systems", "Leadership",
|
||||
"Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
|
||||
}
|
||||
|
||||
# Hardened prompt — Reference explicitly forbidden, classification rules detailed
|
||||
CLASSIFY_PROMPT = """\
|
||||
You are a knowledge classification engine. Classify this concept into its correct domain.
|
||||
|
||||
VALID DOMAINS — use ONLY these exact strings:
|
||||
Defense & Tactics
|
||||
Sustainment Systems
|
||||
Off-Grid Systems
|
||||
Foundational Skills
|
||||
Communications
|
||||
Medical
|
||||
Food Systems
|
||||
Navigation
|
||||
Logistics
|
||||
Power Systems
|
||||
Leadership
|
||||
Scenario Playbooks
|
||||
Water Systems
|
||||
Security
|
||||
Community Coordination
|
||||
|
||||
FORBIDDEN: Do NOT output "Reference" under any circumstances. It is not a valid domain.
|
||||
FORBIDDEN: Do NOT output an empty domain list.
|
||||
|
||||
CLASSIFICATION RULES:
|
||||
- First aid, anatomy, pharmacology, herbs, veterinary, austere medicine, wound care → Medical
|
||||
- Food growing, foraging, hunting, fishing, animal husbandry, livestock → Sustainment Systems
|
||||
- Food preservation, canning, fermentation, food storage, dehydrating → Food Systems
|
||||
- Solar, wind, hydro, batteries, generators, inverters, charge controllers → Power Systems
|
||||
- Water sourcing, filtration, purification, sanitation, wells, rainwater → Water Systems
|
||||
- Radio, antennas, mesh networking, SIGINT, amateur radio → Communications
|
||||
- Weapons, tactics, NBC, security operations, field craft → Defense & Tactics
|
||||
- Permaculture, soil science, agroforestry, composting → Sustainment Systems
|
||||
- Shelter, construction, masonry, blacksmithing, woodworking, crafts → Foundational Skills
|
||||
- Navigation, land nav, celestial nav, map reading, compass → Navigation
|
||||
- Emergency planning, disaster prep, scenario planning → Scenario Playbooks
|
||||
- Leadership, governance, community organization → Leadership
|
||||
- Supply chain, transportation, inventory → Logistics
|
||||
- Physical security, perimeter, surveillance → Security
|
||||
- Community building, cooperation, mutual aid → Community Coordination
|
||||
- Biogas, wood gasification, rocket stoves, appropriate technology → Off-Grid Systems
|
||||
|
||||
If uncertain between two domains, pick the most actionable one for a self-reliant household.
|
||||
|
||||
Concept title: {title}
|
||||
Concept subdomain tags: {subdomain}
|
||||
Concept content: {content}
|
||||
|
||||
Return ONLY valid JSON, no markdown, no explanation:
|
||||
{{"domain": ["Domain Name"]}}
|
||||
"""
|
||||
|
||||
def load_gemini_keys():
|
||||
keys = []
|
||||
for line in Path("/opt/recon/.env").read_text().splitlines():
|
||||
if line.startswith("GEMINI_KEY_"):
|
||||
keys.append(line.split("=", 1)[1].strip())
|
||||
return keys
|
||||
|
||||
class KeyRotator:
|
||||
def __init__(self, keys):
|
||||
self.keys = keys
|
||||
self._i = 0
|
||||
self._lock = threading.Lock()
|
||||
def next(self):
|
||||
with self._lock:
|
||||
key = self.keys[self._i % len(self.keys)]
|
||||
self._i += 1
|
||||
return key
|
||||
|
||||
def classify(title, subdomains, content, key, attempt=0):
|
||||
"""Call Gemini. Rejects Reference. Falls back to subdomain heuristic if needed."""
|
||||
prompt = CLASSIFY_PROMPT.format(
|
||||
title=title or "(untitled)",
|
||||
subdomain=", ".join(subdomains[:10]) if subdomains else "(none)",
|
||||
content=str(content)[:400] if content else "(none)",
|
||||
)
|
||||
genai.configure(api_key=key)
|
||||
model = genai.GenerativeModel(
|
||||
"gemini-2.0-flash",
|
||||
generation_config={"response_mime_type": "application/json"}
|
||||
)
|
||||
for retry in range(4):
|
||||
try:
|
||||
resp = model.generate_content(prompt)
|
||||
data = json.loads(resp.text)
|
||||
domains = [
|
||||
d for d in data.get("domain", [])
|
||||
if d in CANONICAL_DOMAINS # strips Reference automatically
|
||||
]
|
||||
if domains:
|
||||
return domains
|
||||
# Gemini returned Reference or empty — try once more with stronger wording
|
||||
if retry == 0:
|
||||
continue
|
||||
except Exception as e:
|
||||
err = str(e).lower()
|
||||
if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
|
||||
time.sleep(min(5 * (2 ** retry) + random.uniform(0, 3), 60))
|
||||
else:
|
||||
break
|
||||
|
||||
# Last resort: subdomain keyword heuristic
|
||||
return subdomain_fallback(subdomains)
|
||||
|
||||
SUBDOMAIN_FALLBACK_MAP = [
|
||||
(["first aid", "trauma", "wound", "anatomy", "pharmacol", "herbal", "medicin", "veterinar", "dental", "surgery"], "Medical"),
|
||||
(["foraging", "hunting", "fishing", "livestock", "permaculture", "soil", "agroforestry", "mycolog", "mushroom"], "Sustainment Systems"),
|
||||
(["canning", "preservation", "fermentation", "food storage", "dehydrat"], "Food Systems"),
|
||||
(["solar", "battery", "generator", "inverter", "wind turbine", "photovoltaic"], "Power Systems"),
|
||||
(["water purif", "filtration", "sanitation", "well", "rainwater"], "Water Systems"),
|
||||
(["radio", "antenna", "mesh", "sigint", "amateur radio", "meshtastic"], "Communications"),
|
||||
(["weapon", "firearm", "tactic", "nbc", "chemical warfare", "ballistic"], "Defense & Tactics"),
|
||||
(["navigation", "compass", "land nav", "celestial"], "Navigation"),
|
||||
(["blacksmith", "woodwork", "masonry", "construct", "craft", "pottery"], "Foundational Skills"),
|
||||
(["biogas", "gasif", "rocket stove", "appropriate tech"], "Off-Grid Systems"),
|
||||
(["disaster", "emergency prep", "evacuation", "scenario"], "Scenario Playbooks"),
|
||||
(["leadership", "governance", "community"], "Leadership"),
|
||||
(["logistics", "supply chain", "transport"], "Logistics"),
|
||||
(["security", "perimeter", "surveillance"], "Security"),
|
||||
]
|
||||
|
||||
def subdomain_fallback(subdomains):
|
||||
combined = " ".join(s.lower() for s in subdomains)
|
||||
for keywords, domain in SUBDOMAIN_FALLBACK_MAP:
|
||||
if any(kw in combined for kw in keywords):
|
||||
return [domain]
|
||||
return ["Foundational Skills"] # absolute last resort
|
||||
|
||||
def update_concept_json(doc_hash, title, new_domains):
|
||||
"""Update domain in concept JSON files on disk."""
|
||||
doc_dir = CONCEPTS_DIR / doc_hash
|
||||
if not doc_dir.exists():
|
||||
return False
|
||||
for wf in doc_dir.glob("window_*.json"):
|
||||
try:
|
||||
with open(wf, "r", encoding="utf-8") as f:
|
||||
concepts = json.load(f)
|
||||
changed = False
|
||||
for c in concepts:
|
||||
if not isinstance(c, dict):
|
||||
continue
|
||||
if c.get("title") == title:
|
||||
raw = c.get("domain", [])
|
||||
if isinstance(raw, str):
|
||||
raw = [raw]
|
||||
if "Reference" in raw or not [d for d in raw if d in CANONICAL_DOMAINS]:
|
||||
c["domain"] = new_domains
|
||||
changed = True
|
||||
if changed:
|
||||
with open(wf, "w", encoding="utf-8") as f:
|
||||
json.dump(concepts, f, indent=2, ensure_ascii=False)
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
return False
|
||||
|
||||
def process_point(point, qdrant, collection, key_rotator, dry_run):
|
||||
payload = point.payload
|
||||
title = payload.get("title", "")
|
||||
subdomains = payload.get("subdomain", [])
|
||||
if isinstance(subdomains, str):
|
||||
subdomains = [subdomains]
|
||||
content = payload.get("content", payload.get("summary", ""))
|
||||
doc_hash = payload.get("doc_hash", "")
|
||||
|
||||
key = key_rotator.next()
|
||||
new_domains = classify(title, subdomains, content, key)
|
||||
|
||||
if dry_run:
|
||||
return "would_classify"
|
||||
|
||||
# Update Qdrant payload
|
||||
qdrant.set_payload(
|
||||
collection_name=collection,
|
||||
payload={"domain": new_domains},
|
||||
points=[point.id],
|
||||
)
|
||||
|
||||
# Update JSON on disk
|
||||
if doc_hash:
|
||||
update_concept_json(doc_hash, title, new_domains)
|
||||
|
||||
return "ok"
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--workers", type=int, default=16)
|
||||
parser.add_argument("--limit", type=int, default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
config = get_config()
|
||||
keys = load_gemini_keys()
|
||||
rotator = KeyRotator(keys)
|
||||
|
||||
qdrant = QdrantClient(
|
||||
host=config['vector_db']['host'],
|
||||
port=config['vector_db']['port'],
|
||||
timeout=60
|
||||
)
|
||||
collection = config['vector_db']['collection']
|
||||
|
||||
log.info("Scrolling Qdrant for Reference-tagged concepts...")
|
||||
|
||||
# Scroll all points containing Reference in domain
|
||||
offset = None
|
||||
reference_points = []
|
||||
while True:
|
||||
results, offset = qdrant.scroll(
|
||||
collection_name=collection,
|
||||
scroll_filter=Filter(
|
||||
must=[FieldCondition(
|
||||
key="domain",
|
||||
match=MatchAny(any=["Reference"])
|
||||
)]
|
||||
),
|
||||
limit=1000,
|
||||
with_payload=True,
|
||||
with_vectors=False,
|
||||
offset=offset,
|
||||
)
|
||||
reference_points.extend(results)
|
||||
if offset is None:
|
||||
break
|
||||
if args.limit and len(reference_points) >= args.limit:
|
||||
reference_points = reference_points[:args.limit]
|
||||
break
|
||||
|
||||
total = len(reference_points)
|
||||
log.info(f"Found {total:,} Reference-tagged vectors")
|
||||
log.info(f"Workers: {args.workers} | Keys: {len(keys)} | Dry run: {args.dry_run}")
|
||||
log.info(f"Estimated Gemini Flash cost: ~${total * 0.0004:.2f}")
|
||||
|
||||
if args.dry_run:
|
||||
log.info(f"DRY RUN: would re-classify {total:,} concepts. Exiting.")
|
||||
return
|
||||
|
||||
results = defaultdict(int)
|
||||
lock = threading.Lock()
|
||||
done = 0
|
||||
start = time.time()
|
||||
|
||||
with ThreadPoolExecutor(max_workers=args.workers) as ex:
|
||||
futures = {
|
||||
ex.submit(process_point, p, qdrant, collection, rotator, False): p
|
||||
for p in reference_points
|
||||
}
|
||||
for future in as_completed(futures):
|
||||
status = future.result()
|
||||
with lock:
|
||||
results[status] += 1
|
||||
done += 1
|
||||
if done % 5000 == 0:
|
||||
elapsed = time.time() - start
|
||||
rate = done / elapsed * 60
|
||||
eta = (total - done) / (done / elapsed) / 60
|
||||
log.info(f" {done:,}/{total:,} | {rate:.0f}/min | ETA {eta:.0f}min | {dict(results)}")
|
||||
time.sleep(0.02)
|
||||
|
||||
elapsed = time.time() - start
|
||||
log.info(f"\nComplete in {elapsed/60:.1f}min:")
|
||||
for status, count in sorted(results.items(), key=lambda x: -x[1]):
|
||||
log.info(f" {status:<20} {count:>10,}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
315
scripts/repair_corrupted.py
Executable file
315
scripts/repair_corrupted.py
Executable file
|
|
@ -0,0 +1,315 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
repair_corrupted.py — Repairs window files corrupted by concurrent writes.
|
||||
|
||||
Strategy:
|
||||
1. Read corrupted_windows.txt to get the list of bad files
|
||||
2. For each bad file, identify the parent doc hash from the path
|
||||
3. Check if the text directory still exists for that doc
|
||||
4. If yes: re-run Gemini enrichment on just that window
|
||||
5. If no text: mark as unrecoverable
|
||||
6. Report summary
|
||||
|
||||
Usage:
|
||||
python3 /opt/recon/scripts/repair_corrupted.py [--dry-run] [--workers 8]
|
||||
"""
|
||||
|
||||
import json
|
||||
import time
|
||||
import random
|
||||
import logging
|
||||
import argparse
|
||||
import re
|
||||
import threading
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from collections import defaultdict
|
||||
|
||||
import google.generativeai as genai
|
||||
|
||||
CORRUPTED_LIST = Path("/opt/recon/data/corrupted_windows.txt")
|
||||
TEXT_DIR = Path("/opt/recon/data/text")
|
||||
CONCEPTS_DIR = Path("/opt/recon/data/concepts")
|
||||
LOG_FILE = Path("/opt/recon/logs/repair_corrupted.log")
|
||||
UNRECOVERABLE_LOG = Path("/opt/recon/data/unrecoverable_windows.txt")
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(message)s",
|
||||
handlers=[
|
||||
logging.FileHandler(LOG_FILE),
|
||||
logging.StreamHandler(),
|
||||
]
|
||||
)
|
||||
log = logging.getLogger("repair_corrupted")
|
||||
|
||||
CANONICAL_DOMAINS = [
|
||||
"Defense & Tactics", "Sustainment Systems", "Off-Grid Systems",
|
||||
"Foundational Skills", "Communications", "Medical", "Food Systems",
|
||||
"Navigation", "Logistics", "Power Systems", "Leadership",
|
||||
"Scenario Playbooks", "Water Systems", "Security", "Community Coordination"
|
||||
]
|
||||
|
||||
ENRICH_PROMPT = """Extract knowledge concepts from this document text.
|
||||
|
||||
A concept is a SELF-CONTAINED piece of knowledge that can stand alone.
|
||||
|
||||
For each concept, provide ALL fields:
|
||||
|
||||
Required:
|
||||
- content: Full text of the concept (complete procedure, definition, etc.)
|
||||
- summary: 1-2 sentence summary
|
||||
- title: Brief descriptive title
|
||||
- domain: Array of 1-5 from ONLY these exact strings (no others):
|
||||
Defense & Tactics, Sustainment Systems, Off-Grid Systems, Foundational Skills,
|
||||
Communications, Medical, Food Systems, Navigation, Logistics, Power Systems,
|
||||
Leadership, Scenario Playbooks, Water Systems, Security, Community Coordination
|
||||
CRITICAL: Do NOT use "Reference". Every concept belongs somewhere specific.
|
||||
- subdomain: Array of specific subcategories (up to 10)
|
||||
- keywords: Array of 3-30 searchable terms
|
||||
- skill_level: novice | intermediate | advanced
|
||||
- key_facts: Array of specific extractable claims, measurements, data points
|
||||
|
||||
Optional (include when present):
|
||||
- scenario_applicable: Array from: tuesday_prepper, month_prepper, year_prepper, multi_year, eotwawki
|
||||
- cross_domain_tags: Array from: sustainment, medical, security, communications, leadership, logistics, navigation, power_systems, water_systems, food_systems, tactical_ops, community_coordination
|
||||
- chapter: Chapter name if identifiable
|
||||
- page_ref: Page reference
|
||||
|
||||
Return JSON array. If no extractable concepts, return [].
|
||||
|
||||
Document text:
|
||||
"""
|
||||
|
||||
def load_gemini_keys():
|
||||
env = Path("/opt/recon/.env")
|
||||
keys = []
|
||||
for line in env.read_text().splitlines():
|
||||
if line.startswith("GEMINI_KEY_"):
|
||||
keys.append(line.split("=", 1)[1].strip())
|
||||
return keys
|
||||
|
||||
class KeyRotator:
|
||||
def __init__(self, keys):
|
||||
self.keys = keys
|
||||
self._i = 0
|
||||
self._lock = threading.Lock()
|
||||
def next(self):
|
||||
with self._lock:
|
||||
key = self.keys[self._i % len(self.keys)]
|
||||
self._i += 1
|
||||
return key
|
||||
|
||||
def repair_json_truncated(text):
|
||||
"""Last-ditch attempt to salvage a truncated JSON array."""
|
||||
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
|
||||
text = re.sub(r',\s*([}\]])', r'\1', text)
|
||||
try:
|
||||
return json.loads(text)
|
||||
except Exception:
|
||||
pass
|
||||
# Find last complete object
|
||||
last_close = -1
|
||||
depth = 0
|
||||
in_str = False
|
||||
esc = False
|
||||
for i, ch in enumerate(text):
|
||||
if esc:
|
||||
esc = False; continue
|
||||
if ch == '\\' and in_str:
|
||||
esc = True; continue
|
||||
if ch == '"' and not esc:
|
||||
in_str = not in_str; continue
|
||||
if in_str:
|
||||
continue
|
||||
if ch == '{': depth += 1
|
||||
elif ch == '}':
|
||||
depth -= 1
|
||||
if depth == 0:
|
||||
last_close = i
|
||||
if last_close > 0:
|
||||
trimmed = text[:last_close + 1].rstrip().rstrip(',')
|
||||
open_brackets = trimmed.count('[') - trimmed.count(']')
|
||||
try:
|
||||
return json.loads(trimmed + ']' * open_brackets)
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
def enrich_window_text(text, key):
|
||||
"""Call Gemini on raw window text, return concepts list."""
|
||||
genai.configure(api_key=key)
|
||||
model = genai.GenerativeModel(
|
||||
"gemini-2.0-flash",
|
||||
generation_config={"response_mime_type": "application/json"}
|
||||
)
|
||||
for attempt in range(4):
|
||||
try:
|
||||
resp = model.generate_content(ENRICH_PROMPT + text)
|
||||
raw = resp.text
|
||||
try:
|
||||
result = json.loads(raw)
|
||||
except Exception:
|
||||
result = repair_json_truncated(raw)
|
||||
if isinstance(result, list):
|
||||
return [c for c in result if isinstance(c, dict)]
|
||||
elif isinstance(result, dict):
|
||||
return [result]
|
||||
return []
|
||||
except Exception as e:
|
||||
err = str(e).lower()
|
||||
if any(s in err for s in ["429", "quota", "rate", "503", "unavailable"]):
|
||||
delay = min(5 * (2 ** attempt) + random.uniform(0, 3), 60)
|
||||
time.sleep(delay)
|
||||
else:
|
||||
log.warning(f" Non-transient error: {e}")
|
||||
break
|
||||
return None # failed
|
||||
|
||||
def get_window_text(doc_hash, window_filename):
|
||||
"""Reconstruct window text from page files."""
|
||||
# Window filename: window_NNNN.json -> window index is NNNN
|
||||
try:
|
||||
w_idx = int(Path(window_filename).stem.split('_')[1]) - 1
|
||||
except (IndexError, ValueError):
|
||||
return None
|
||||
|
||||
text_path = TEXT_DIR / doc_hash
|
||||
if not text_path.exists():
|
||||
return None
|
||||
|
||||
page_files = sorted([
|
||||
f for f in text_path.iterdir()
|
||||
if f.name.startswith('page_') and f.name.endswith('.txt')
|
||||
])
|
||||
if not page_files:
|
||||
return None
|
||||
|
||||
# Re-derive which pages this window covered (window_size=5 from config)
|
||||
window_size = 5
|
||||
start = w_idx * window_size
|
||||
window_pages = page_files[start:start + window_size]
|
||||
if not window_pages:
|
||||
return None
|
||||
|
||||
parts = []
|
||||
for j, pf in enumerate(window_pages):
|
||||
try:
|
||||
text = pf.read_text(encoding='utf-8')
|
||||
parts.append(f"--- Page {start + j + 1} ---\n{text}")
|
||||
except Exception:
|
||||
pass
|
||||
return "\n\n".join(parts) if parts else None
|
||||
|
||||
def repair_file(corrupted_path, key_rotator, dry_run):
|
||||
"""Attempt to repair a single corrupted window file."""
|
||||
path = Path(corrupted_path)
|
||||
|
||||
# Sanity check -- maybe it fixed itself somehow
|
||||
try:
|
||||
with open(path) as f:
|
||||
existing = json.load(f)
|
||||
return "already_valid"
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Extract doc hash and window name from path structure
|
||||
# Expected: /opt/recon/data/concepts/{hash}/window_NNNN.json
|
||||
doc_hash = path.parent.name
|
||||
window_filename = path.name
|
||||
|
||||
# Get source text for this window
|
||||
window_text = get_window_text(doc_hash, window_filename)
|
||||
if not window_text:
|
||||
return "no_source_text"
|
||||
|
||||
if dry_run:
|
||||
return "would_repair"
|
||||
|
||||
# Re-enrich from source text
|
||||
key = key_rotator.next()
|
||||
concepts = enrich_window_text(window_text, key)
|
||||
|
||||
if concepts is None:
|
||||
return "enrichment_failed"
|
||||
|
||||
# Tag concepts with metadata
|
||||
try:
|
||||
w_idx = int(Path(window_filename).stem.split('_')[1]) - 1
|
||||
window_size = 5
|
||||
start_page = w_idx * window_size + 1
|
||||
except Exception:
|
||||
w_idx = 0
|
||||
start_page = 0
|
||||
|
||||
for c in concepts:
|
||||
c['_window'] = w_idx + 1
|
||||
c['_start_page'] = start_page
|
||||
c['_doc_hash'] = doc_hash
|
||||
c['_repaired'] = True
|
||||
|
||||
# Write repaired file
|
||||
try:
|
||||
with open(path, 'w', encoding='utf-8') as f:
|
||||
json.dump(concepts, f, indent=2, ensure_ascii=False)
|
||||
return "repaired"
|
||||
except Exception as e:
|
||||
return "write_error"
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--workers", type=int, default=8)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not CORRUPTED_LIST.exists():
|
||||
log.error(f"Corrupted list not found: {CORRUPTED_LIST}")
|
||||
log.error("Run Task 1 first to generate it.")
|
||||
return
|
||||
|
||||
keys = load_gemini_keys()
|
||||
rotator = KeyRotator(keys)
|
||||
|
||||
corrupted = []
|
||||
with open(CORRUPTED_LIST) as f:
|
||||
for line in f:
|
||||
parts = line.strip().split('\t')
|
||||
if parts:
|
||||
corrupted.append(parts[0])
|
||||
|
||||
log.info(f"Repairing {len(corrupted):,} corrupted window files")
|
||||
log.info(f"Dry run: {args.dry_run} | Workers: {args.workers} | Keys: {len(keys)}")
|
||||
|
||||
results = defaultdict(int)
|
||||
unrecoverable = []
|
||||
lock = threading.Lock()
|
||||
|
||||
with ThreadPoolExecutor(max_workers=args.workers) as ex:
|
||||
futures = {ex.submit(repair_file, p, rotator, args.dry_run): p for p in corrupted}
|
||||
done = 0
|
||||
for future in as_completed(futures):
|
||||
path = futures[future]
|
||||
status = future.result()
|
||||
with lock:
|
||||
results[status] += 1
|
||||
if status in ("no_source_text", "enrichment_failed", "write_error"):
|
||||
unrecoverable.append((path, status))
|
||||
done += 1
|
||||
if done % 100 == 0:
|
||||
log.info(f" {done:,}/{len(corrupted):,} | {dict(results)}")
|
||||
time.sleep(0.05)
|
||||
|
||||
log.info("── Results ─────────────────────────────────────────────────")
|
||||
for status, count in sorted(results.items(), key=lambda x: -x[1]):
|
||||
log.info(f" {status:<25} {count:>8,}")
|
||||
|
||||
if unrecoverable:
|
||||
with open(UNRECOVERABLE_LOG, 'w') as f:
|
||||
for path, reason in unrecoverable:
|
||||
f.write(f"{path}\t{reason}\n")
|
||||
log.info(f"\n Unrecoverable: {len(unrecoverable)} — logged to {UNRECOVERABLE_LOG}")
|
||||
else:
|
||||
log.info("\n All files repaired successfully.")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
178
scripts/validate.py
Executable file
178
scripts/validate.py
Executable file
|
|
@ -0,0 +1,178 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
RECON Pipeline Validator
|
||||
|
||||
Checks pipeline consistency: paths, DB state, file integrity, and service connectivity.
|
||||
Validates TEI, Ollama, and Qdrant are reachable. Deep mode checks every document on disk.
|
||||
|
||||
Usage: python3 scripts/validate.py [--deep]
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
from lib.utils import get_config, setup_logging
|
||||
from lib.status import StatusDB
|
||||
|
||||
logger = setup_logging('recon.validate')
|
||||
|
||||
|
||||
def run_validation(deep=False):
|
||||
config = get_config()
|
||||
db = StatusDB()
|
||||
|
||||
issues = []
|
||||
warnings = []
|
||||
|
||||
print("=== RECON Validation ===\n")
|
||||
|
||||
# Check paths
|
||||
for name, path in config['paths'].items():
|
||||
if name == 'db':
|
||||
if not os.path.exists(path):
|
||||
issues.append(f"Database not found: {path}")
|
||||
else:
|
||||
if not os.path.exists(path):
|
||||
warnings.append(f"Directory missing: {name} = {path}")
|
||||
|
||||
# Check library
|
||||
if not os.path.exists(config['library_root']):
|
||||
issues.append(f"Library root not found: {config['library_root']}")
|
||||
|
||||
# Check Gemini keys
|
||||
keys = config.get('gemini_keys', [])
|
||||
if not keys:
|
||||
warnings.append("No Gemini API keys configured in .env")
|
||||
else:
|
||||
print(f" Gemini keys: {len(keys)} configured")
|
||||
|
||||
# DB status counts
|
||||
counts = db.get_status_counts()
|
||||
cat = counts.get('catalogue', {})
|
||||
doc = counts.get('documents', {})
|
||||
|
||||
print(f" Catalogue: {sum(cat.values())} entries")
|
||||
print(f" Documents: {sum(doc.values())} entries")
|
||||
print(f" Complete: {doc.get('complete', 0)}")
|
||||
print(f" Failed: {doc.get('failed', 0)}")
|
||||
|
||||
if deep:
|
||||
print("\n--- Deep Validation ---\n")
|
||||
|
||||
# Check every document in pipeline has corresponding files
|
||||
all_docs = db.get_all_documents()
|
||||
text_dir = config['paths']['text']
|
||||
concepts_dir = config['paths']['concepts']
|
||||
|
||||
for d in all_docs:
|
||||
h = d['hash']
|
||||
status = d['status']
|
||||
|
||||
if status in ('extracted', 'enriched', 'complete'):
|
||||
doc_text_dir = os.path.join(text_dir, h)
|
||||
if not os.path.exists(doc_text_dir):
|
||||
issues.append(f"[{h[:8]}] {d['filename']}: text dir missing but status={status}")
|
||||
elif deep:
|
||||
pages = [f for f in os.listdir(doc_text_dir) if f.startswith('page_')]
|
||||
if not pages:
|
||||
issues.append(f"[{h[:8]}] {d['filename']}: no page files in text dir")
|
||||
|
||||
if status in ('enriched', 'complete'):
|
||||
doc_concepts_dir = os.path.join(concepts_dir, h)
|
||||
if not os.path.exists(doc_concepts_dir):
|
||||
issues.append(f"[{h[:8]}] {d['filename']}: concepts dir missing but status={status}")
|
||||
elif deep:
|
||||
windows = [f for f in os.listdir(doc_concepts_dir) if f.startswith('window_')]
|
||||
if not windows:
|
||||
issues.append(f"[{h[:8]}] {d['filename']}: no window files in concepts dir")
|
||||
else:
|
||||
for wf in windows:
|
||||
try:
|
||||
with open(os.path.join(doc_concepts_dir, wf)) as f:
|
||||
data = json.load(f)
|
||||
if not isinstance(data, list):
|
||||
issues.append(f"[{h[:8]}] {wf}: not a JSON array")
|
||||
except json.JSONDecodeError:
|
||||
issues.append(f"[{h[:8]}] {wf}: invalid JSON")
|
||||
|
||||
# Check orphaned directories
|
||||
if os.path.exists(text_dir):
|
||||
doc_hashes = {d['hash'] for d in all_docs}
|
||||
for dirname in os.listdir(text_dir):
|
||||
if dirname not in doc_hashes:
|
||||
warnings.append(f"Orphaned text dir: {dirname}")
|
||||
|
||||
if os.path.exists(concepts_dir):
|
||||
for dirname in os.listdir(concepts_dir):
|
||||
if dirname not in doc_hashes:
|
||||
warnings.append(f"Orphaned concepts dir: {dirname}")
|
||||
|
||||
print(f" Checked {len(all_docs)} documents")
|
||||
|
||||
# Connectivity checks
|
||||
print("\n--- Connectivity ---\n")
|
||||
|
||||
import requests as http_requests
|
||||
|
||||
# Check TEI (primary embedding backend)
|
||||
try:
|
||||
tei_url = f"http://{config['embedding']['tei_host']}:{config['embedding']['tei_port']}/info"
|
||||
resp = http_requests.get(tei_url, timeout=10)
|
||||
if resp.status_code == 200:
|
||||
print(f" TEI: OK (bge-m3 at {config['embedding']['tei_host']}:{config['embedding']['tei_port']})")
|
||||
else:
|
||||
issues.append(f"TEI: HTTP {resp.status_code}")
|
||||
except Exception as e:
|
||||
issues.append(f"TEI: unreachable ({e})")
|
||||
|
||||
# Check Ollama (fallback)
|
||||
try:
|
||||
ollama_url = f"http://{config['embedding']['ollama_host']}:{config['embedding']['ollama_port']}/api/tags"
|
||||
resp = http_requests.get(ollama_url, timeout=10)
|
||||
if resp.status_code == 200:
|
||||
print(f" Ollama: OK (fallback at {config['embedding']['ollama_host']}:{config['embedding']['ollama_port']})")
|
||||
else:
|
||||
warnings.append(f"Ollama: HTTP {resp.status_code}")
|
||||
except Exception as e:
|
||||
warnings.append(f"Ollama: unreachable ({e}) — fallback only, not critical")
|
||||
|
||||
try:
|
||||
from qdrant_client import QdrantClient
|
||||
qdrant = QdrantClient(
|
||||
host=config['vector_db']['host'],
|
||||
port=config['vector_db']['port'],
|
||||
timeout=10
|
||||
)
|
||||
collections = [c.name for c in qdrant.get_collections().collections]
|
||||
target = config['vector_db']['collection']
|
||||
if target in collections:
|
||||
info = qdrant.get_collection(target)
|
||||
print(f" Qdrant: OK ({target}: {info.points_count} points)")
|
||||
else:
|
||||
issues.append(f"Qdrant: collection {target} not found")
|
||||
except Exception as e:
|
||||
issues.append(f"Qdrant: unreachable ({e})")
|
||||
|
||||
# Summary
|
||||
print("\n--- Summary ---\n")
|
||||
|
||||
if warnings:
|
||||
print(f"Warnings ({len(warnings)}):")
|
||||
for w in warnings:
|
||||
print(f" ⚠ {w}")
|
||||
|
||||
if issues:
|
||||
print(f"\nIssues ({len(issues)}):")
|
||||
for i in issues:
|
||||
print(f" ✗ {i}")
|
||||
print(f"\nValidation FAILED: {len(issues)} issue(s)")
|
||||
else:
|
||||
print("Validation PASSED")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
deep = '--deep' in sys.argv
|
||||
run_validation(deep=deep)
|
||||
316
static/css/recon.css
Normal file
316
static/css/recon.css
Normal file
|
|
@ -0,0 +1,316 @@
|
|||
/* RECON Design System
|
||||
* Knowledge Extraction Pipeline — Dashboard CSS
|
||||
*/
|
||||
|
||||
:root {
|
||||
--bg-primary: #0a0a0a;
|
||||
--bg-secondary: #111;
|
||||
--bg-tertiary: #1a1a1a;
|
||||
--border: #222;
|
||||
--border-light: #333;
|
||||
--text-primary: #c0c0c0;
|
||||
--text-muted: #888;
|
||||
--text-dim: #666;
|
||||
--text-faint: #555;
|
||||
--green: #00ff41;
|
||||
--green-dim: #16a34a;
|
||||
--red: #ff4444;
|
||||
--red-dim: #dc2626;
|
||||
--orange: #ffa500;
|
||||
--blue: #00bfff;
|
||||
--blue-sky: #0ea5e9;
|
||||
--blue-dark: #0284c7;
|
||||
--purple: #7c3aed;
|
||||
--yellow: #fbbf24;
|
||||
|
||||
/* Pipeline colors */
|
||||
--pipe-queued: #555;
|
||||
--pipe-extracting: #b45309;
|
||||
--pipe-extracted: #d97706;
|
||||
--pipe-enriching: #0284c7;
|
||||
--pipe-enriched: #0ea5e9;
|
||||
--pipe-embedding: #7c3aed;
|
||||
--pipe-complete: #16a34a;
|
||||
--pipe-failed: #dc2626;
|
||||
|
||||
--font-mono: 'Courier New', monospace;
|
||||
--radius: 3px;
|
||||
--radius-md: 4px;
|
||||
}
|
||||
|
||||
* { margin: 0; padding: 0; box-sizing: border-box; }
|
||||
body { font-family: var(--font-mono); background: var(--bg-primary); color: var(--text-primary); }
|
||||
|
||||
/* ── Header ── */
|
||||
.header {
|
||||
background: var(--bg-secondary);
|
||||
border-bottom: 1px solid var(--border-light);
|
||||
padding: 10px 24px;
|
||||
flex-shrink: 0;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
}
|
||||
.header-left {
|
||||
display: flex;
|
||||
align-items: baseline;
|
||||
gap: 12px;
|
||||
}
|
||||
.header-subtitle {
|
||||
font-size: 11px;
|
||||
color: var(--text-dim);
|
||||
letter-spacing: 1px;
|
||||
text-transform: uppercase;
|
||||
}
|
||||
.header h1 { color: var(--green); font-size: 18px; font-weight: 700; letter-spacing: 3px; }
|
||||
.header .stats { font-size: 12px; color: var(--text-dim); }
|
||||
.header .quick-stats { font-size: 11px; color: var(--text-muted); display: flex; gap: 16px; }
|
||||
.header .quick-stats span { white-space: nowrap; }
|
||||
|
||||
/* Heartbeat indicator */
|
||||
.heartbeat {
|
||||
display: inline-block;
|
||||
width: 8px;
|
||||
height: 8px;
|
||||
border-radius: 50%;
|
||||
background: var(--green);
|
||||
margin-right: 6px;
|
||||
vertical-align: middle;
|
||||
animation: pulse 2s ease-in-out infinite;
|
||||
}
|
||||
.heartbeat.dead {
|
||||
background: var(--red);
|
||||
animation: none;
|
||||
}
|
||||
@keyframes pulse {
|
||||
0%, 100% { opacity: 1; }
|
||||
50% { opacity: 0.4; }
|
||||
}
|
||||
|
||||
/* ── Navigation ── */
|
||||
.nav-domain {
|
||||
background: #0d0d0d;
|
||||
border-bottom: 1px solid var(--border);
|
||||
padding: 0 24px;
|
||||
display: flex;
|
||||
gap: 0;
|
||||
flex-shrink: 0;
|
||||
}
|
||||
.nav-domain a {
|
||||
color: var(--text-muted);
|
||||
text-decoration: none;
|
||||
font-size: 13px;
|
||||
text-transform: uppercase;
|
||||
letter-spacing: 1px;
|
||||
padding: 10px 16px;
|
||||
border-bottom: 2px solid transparent;
|
||||
transition: color 0.15s, border-color 0.15s;
|
||||
}
|
||||
.nav-domain a:hover { color: var(--text-primary); }
|
||||
.nav-domain a.active {
|
||||
color: var(--green);
|
||||
border-bottom-color: var(--green);
|
||||
}
|
||||
|
||||
.nav-sub {
|
||||
background: var(--bg-primary);
|
||||
border-bottom: 1px solid var(--border);
|
||||
padding: 6px 24px;
|
||||
}
|
||||
.nav-sub a {
|
||||
color: var(--text-dim);
|
||||
text-decoration: none;
|
||||
margin-right: 16px;
|
||||
font-size: 12px;
|
||||
transition: color 0.15s;
|
||||
}
|
||||
.nav-sub a:hover { color: var(--text-primary); }
|
||||
.nav-sub a.active { color: var(--green); }
|
||||
|
||||
/* ── Content ── */
|
||||
.content { padding: 24px; max-width: 1400px; margin: 0 auto; }
|
||||
|
||||
/* ── Panels ── */
|
||||
.panel {
|
||||
background: var(--bg-secondary);
|
||||
border: 1px solid var(--border);
|
||||
padding: 24px;
|
||||
margin-bottom: 24px;
|
||||
}
|
||||
|
||||
/* ── Forms ── */
|
||||
.search-box {
|
||||
width: 100%;
|
||||
padding: 10px 16px;
|
||||
background: var(--bg-secondary);
|
||||
border: 1px solid var(--border-light);
|
||||
color: var(--text-primary);
|
||||
font-family: inherit;
|
||||
font-size: 14px;
|
||||
margin-bottom: 16px;
|
||||
}
|
||||
.search-box:focus { outline: none; border-color: var(--green); }
|
||||
|
||||
/* ── Tables ── */
|
||||
table { width: 100%; border-collapse: collapse; font-size: 13px; }
|
||||
th { background: var(--bg-secondary); color: var(--green); text-align: left; padding: 8px 12px; border-bottom: 1px solid var(--border-light); }
|
||||
td { padding: 6px 12px; border-bottom: 1px solid var(--bg-tertiary); }
|
||||
tr:hover { background: var(--bg-secondary); }
|
||||
|
||||
/* ── Status badges ── */
|
||||
.status { padding: 2px 8px; border-radius: var(--radius); font-size: 11px; }
|
||||
.status-complete { color: var(--green); }
|
||||
.status-enriched { color: var(--blue); }
|
||||
.status-extracted { color: var(--orange); }
|
||||
.status-failed { color: var(--red); }
|
||||
.status-queued { color: var(--text-muted); }
|
||||
.status-duplicate { color: var(--text-muted); }
|
||||
|
||||
/* ── Stat cards ── */
|
||||
.stat-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; margin-bottom: 24px; }
|
||||
.stat-card { background: var(--bg-secondary); border: 1px solid var(--border); padding: 16px; }
|
||||
.stat-card .label { color: var(--text-dim); font-size: 11px; text-transform: uppercase; }
|
||||
.stat-card .value { color: var(--green); font-size: 28px; margin-top: 4px; }
|
||||
.stat-card .sublabel { color: var(--text-faint); font-size: 10px; margin-top: 2px; }
|
||||
|
||||
/* ── Search results ── */
|
||||
.result { background: var(--bg-secondary); border: 1px solid var(--border); padding: 16px; margin-bottom: 12px; }
|
||||
.result .title { color: var(--green); font-size: 14px; margin-bottom: 4px; }
|
||||
.result .meta { color: var(--text-dim); font-size: 11px; margin-bottom: 8px; }
|
||||
.result .content-text { color: #999; font-size: 12px; line-height: 1.5; }
|
||||
.result .score { color: var(--orange); font-size: 12px; float: right; }
|
||||
|
||||
/* ── Buttons ── */
|
||||
.btn {
|
||||
background: var(--bg-tertiary);
|
||||
border: 1px solid var(--border-light);
|
||||
color: var(--text-primary);
|
||||
padding: 6px 14px;
|
||||
cursor: pointer;
|
||||
font-family: inherit;
|
||||
font-size: 12px;
|
||||
}
|
||||
.btn:hover { border-color: var(--green); color: var(--green); }
|
||||
.btn:disabled { opacity: 0.5; cursor: not-allowed; }
|
||||
.btn.active { border-color: var(--green); color: var(--green); }
|
||||
.btn-danger { color: var(--red); }
|
||||
.btn-danger:hover { border-color: var(--red); }
|
||||
.btn-warn { color: #ff8800; }
|
||||
.btn-warn:hover { border-color: #ff8800; }
|
||||
|
||||
/* ── Tags ── */
|
||||
.domain-tag {
|
||||
display: inline-block;
|
||||
background: var(--bg-tertiary);
|
||||
border: 1px solid var(--border-light);
|
||||
padding: 1px 6px;
|
||||
margin: 1px;
|
||||
font-size: 10px;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
.badge-web { background: #1e3a5f; color: #60a5fa; padding: 2px 8px; border-radius: var(--radius); font-size: 11px; }
|
||||
.badge-pdf { background: #2d5a2d; color: #4ade80; padding: 2px 8px; border-radius: var(--radius); font-size: 11px; }
|
||||
|
||||
/* ── Trend indicators ── */
|
||||
.trend { font-size: 11px; margin-left: 6px; }
|
||||
.trend-up { color: var(--green); }
|
||||
.trend-down { color: var(--red); }
|
||||
.trend-flat { color: var(--text-faint); }
|
||||
|
||||
/* ── Pipeline bar ── */
|
||||
.pipeline-bar {
|
||||
height: 24px;
|
||||
background: var(--bg-secondary);
|
||||
border: 1px solid var(--border);
|
||||
border-radius: var(--radius-md);
|
||||
overflow: hidden;
|
||||
display: flex;
|
||||
}
|
||||
.pipeline-bar .segment { height: 100%; transition: width 0.3s ease; }
|
||||
|
||||
.pipeline-legend { display: flex; gap: 14px; margin-top: 6px; font-size: 10px; color: var(--text-muted); flex-wrap: wrap; }
|
||||
.legend-dot {
|
||||
display: inline-block;
|
||||
width: 10px; height: 10px;
|
||||
border-radius: 2px;
|
||||
margin-right: 4px;
|
||||
vertical-align: middle;
|
||||
}
|
||||
|
||||
/* ── Service status dots ── */
|
||||
.svc-dot {
|
||||
display: inline-block;
|
||||
width: 10px;
|
||||
height: 10px;
|
||||
border-radius: 50%;
|
||||
margin-right: 6px;
|
||||
vertical-align: middle;
|
||||
}
|
||||
.svc-dot.active { background: var(--green); }
|
||||
.svc-dot.inactive { background: var(--red); }
|
||||
.svc-dot.unknown { background: var(--text-faint); }
|
||||
|
||||
/* ── Service status row ── */
|
||||
.svc-row {
|
||||
display: flex;
|
||||
gap: 24px;
|
||||
background: var(--bg-secondary);
|
||||
border: 1px solid var(--border);
|
||||
padding: 12px 16px;
|
||||
margin-bottom: 24px;
|
||||
font-size: 12px;
|
||||
}
|
||||
.svc-row .svc-item { display: flex; align-items: center; }
|
||||
|
||||
/* ── Pagination ── */
|
||||
.pagination {
|
||||
display: flex;
|
||||
gap: 4px;
|
||||
margin-top: 16px;
|
||||
justify-content: center;
|
||||
}
|
||||
.pagination a, .pagination span {
|
||||
padding: 4px 10px;
|
||||
border: 1px solid var(--border-light);
|
||||
color: var(--text-muted);
|
||||
text-decoration: none;
|
||||
font-size: 12px;
|
||||
}
|
||||
.pagination a:hover { border-color: var(--green); color: var(--green); }
|
||||
.pagination .current {
|
||||
border-color: var(--green);
|
||||
color: var(--green);
|
||||
background: var(--bg-tertiary);
|
||||
}
|
||||
|
||||
/* ── Misc helpers ── */
|
||||
.section-title { color: var(--green); margin-bottom: 12px; }
|
||||
.mt-24 { margin-top: 24px; }
|
||||
.mb-16 { margin-bottom: 16px; }
|
||||
.mb-24 { margin-bottom: 24px; }
|
||||
.text-muted { color: var(--text-muted); }
|
||||
.text-dim { color: var(--text-dim); }
|
||||
.text-faint { color: var(--text-faint); }
|
||||
.text-green { color: var(--green); }
|
||||
.text-red { color: var(--red); }
|
||||
.text-orange { color: var(--orange); }
|
||||
.text-blue { color: var(--blue); }
|
||||
.text-small { font-size: 12px; }
|
||||
.text-xs { font-size: 11px; }
|
||||
.text-xxs { font-size: 10px; }
|
||||
.mono { font-family: monospace; }
|
||||
|
||||
.flex { display: flex; }
|
||||
.flex-between { display: flex; justify-content: space-between; }
|
||||
.flex-center { display: flex; align-items: center; }
|
||||
.gap-8 { gap: 8px; }
|
||||
.gap-16 { gap: 16px; }
|
||||
|
||||
.grid-2 { display: grid; grid-template-columns: 1fr 1fr; gap: 24px; }
|
||||
.grid-3 { display: grid; grid-template-columns: repeat(3, 1fr); gap: 16px; }
|
||||
|
||||
/* ── Collapsible errors panel ── */
|
||||
.errors-panel { display: none; }
|
||||
.errors-panel.has-errors { display: block; }
|
||||
.errors-panel summary { color: var(--red); cursor: pointer; font-size: 13px; margin-bottom: 8px; }
|
||||
.errors-panel .error-line { color: var(--text-muted); font-size: 11px; padding: 2px 0; border-bottom: 1px solid var(--border); }
|
||||
120
static/js/channels.js
Normal file
120
static/js/channels.js
Normal file
|
|
@ -0,0 +1,120 @@
|
|||
/* RECON PeerTube Channels page JS */
|
||||
(function() {
|
||||
'use strict';
|
||||
|
||||
async function loadChannelStats() {
|
||||
try {
|
||||
var resp = await fetch('/api/peertube/channels/stats');
|
||||
var data = await resp.json();
|
||||
if (resp.ok) {
|
||||
document.getElementById('pt-total-ch').textContent = data.total_channels;
|
||||
document.getElementById('pt-total-vid').textContent = data.total_videos;
|
||||
var dlEl = document.getElementById('pt-dl-status');
|
||||
dlEl.textContent = data.downloader_active ? 'Active' : 'Stopped';
|
||||
dlEl.style.color = data.downloader_active ? '#00ff41' : '#ff4444';
|
||||
}
|
||||
} catch(e) {
|
||||
console.error('Stats error:', e);
|
||||
}
|
||||
}
|
||||
|
||||
async function loadChannels() {
|
||||
try {
|
||||
var resp = await fetch('/api/peertube/channels');
|
||||
var data = await resp.json();
|
||||
if (!resp.ok) throw new Error(data.error || 'Failed');
|
||||
var tbody = document.getElementById('pt-channel-tbody');
|
||||
if (!data.length) {
|
||||
tbody.innerHTML = '<tr><td colspan="6" style="text-align:center;padding:20px;color:#555;">No channels configured</td></tr>';
|
||||
return;
|
||||
}
|
||||
var cats = [];
|
||||
var catSet = {};
|
||||
data.forEach(function(c) { if (c.category && !catSet[c.category]) { catSet[c.category] = true; cats.push(c.category); } });
|
||||
document.getElementById('pt-cat-list').innerHTML = cats.map(function(c) { return '<option value="' + c + '">'; }).join('');
|
||||
|
||||
var html = '';
|
||||
data.forEach(function(ch) {
|
||||
var vids = ch.videos_in_peertube || 0;
|
||||
var statusColor = vids > 0 ? '#00ff41' : '#ffa500';
|
||||
var statusText = vids > 0 ? 'syncing' : 'new';
|
||||
var ytLink = ch.youtube_url ? '<a href="' + ch.youtube_url + '" target="_blank" style="color:#00a0d0;text-decoration:none;">' + ch.channel_name + '</a>' : ch.channel_name;
|
||||
html += '<tr style="border-bottom:1px solid #1a1a1a;">' +
|
||||
'<td style="padding:8px 10px;">' + ytLink + '</td>' +
|
||||
'<td style="padding:8px 10px;text-align:center;">' + vids + '</td>' +
|
||||
'<td style="padding:8px 10px;color:#888;">' + (ch.category || '') + '</td>' +
|
||||
'<td style="padding:8px 10px;text-align:center;">' + (ch.priority || 'M') + '</td>' +
|
||||
'<td style="padding:8px 10px;text-align:center;"><span style="color:' + statusColor + ';">' + statusText + '</span></td>' +
|
||||
'<td style="padding:8px 10px;text-align:center;"><button onclick="removeChannel(\'' + ch.actor_name + '\')" style="background:none;border:1px solid #333;color:#ff4444;cursor:pointer;padding:2px 8px;font-size:11px;font-family:inherit;">x</button></td>' +
|
||||
'</tr>';
|
||||
});
|
||||
tbody.innerHTML = html;
|
||||
} catch(e) {
|
||||
document.getElementById('pt-channel-tbody').innerHTML = '<tr><td colspan="6" style="text-align:center;padding:20px;color:#ff4444;">Error: ' + e.message + '</td></tr>';
|
||||
}
|
||||
}
|
||||
|
||||
window.addChannel = async function() {
|
||||
var fb = document.getElementById('pt-feedback');
|
||||
var url = document.getElementById('pt-yt-url').value.trim();
|
||||
if (!url) {
|
||||
fb.style.color = '#ff4444';
|
||||
fb.textContent = 'Enter a YouTube channel URL';
|
||||
return;
|
||||
}
|
||||
var category = document.getElementById('pt-category').value.trim();
|
||||
var priority = document.getElementById('pt-priority').value;
|
||||
var btn = document.getElementById('pt-add-btn');
|
||||
btn.disabled = true;
|
||||
fb.style.color = '#ffa500';
|
||||
fb.textContent = 'Resolving channel...';
|
||||
try {
|
||||
var resp = await fetch('/api/peertube/channels/add', {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify({youtube_url: url, category: category, priority: priority})
|
||||
});
|
||||
var data = await resp.json();
|
||||
if (resp.ok) {
|
||||
fb.style.color = '#00ff41';
|
||||
fb.textContent = 'Added: ' + (data.channel_name || 'channel');
|
||||
document.getElementById('pt-yt-url').value = '';
|
||||
loadChannels();
|
||||
loadChannelStats();
|
||||
} else {
|
||||
fb.style.color = '#ff4444';
|
||||
fb.textContent = data.error || 'Failed to add channel';
|
||||
}
|
||||
} catch(e) {
|
||||
fb.style.color = '#ff4444';
|
||||
fb.textContent = 'Error: ' + e.message;
|
||||
}
|
||||
btn.disabled = false;
|
||||
};
|
||||
|
||||
window.removeChannel = async function(actorName) {
|
||||
if (!confirm('Remove channel ' + actorName + '?')) return;
|
||||
var fb = document.getElementById('pt-feedback');
|
||||
fb.style.color = '#ffa500';
|
||||
fb.textContent = 'Removing...';
|
||||
try {
|
||||
var resp = await fetch('/api/peertube/channels/' + encodeURIComponent(actorName), {method: 'DELETE'});
|
||||
var data = await resp.json();
|
||||
if (resp.ok) {
|
||||
fb.style.color = '#00ff41';
|
||||
fb.textContent = data.message || 'Removed';
|
||||
loadChannels();
|
||||
loadChannelStats();
|
||||
} else {
|
||||
fb.style.color = '#ff4444';
|
||||
fb.textContent = data.error || 'Failed';
|
||||
}
|
||||
} catch(e) {
|
||||
fb.style.color = '#ff4444';
|
||||
fb.textContent = 'Error: ' + e.message;
|
||||
}
|
||||
};
|
||||
|
||||
loadChannelStats();
|
||||
loadChannels();
|
||||
})();
|
||||
186
static/js/charts.js
Normal file
186
static/js/charts.js
Normal file
|
|
@ -0,0 +1,186 @@
|
|||
/* RECON Lightweight Canvas Line Chart
|
||||
* No dependencies. drawLineChart(canvasId, datasets, opts)
|
||||
* DPI-aware rendering for sharp lines on all displays.
|
||||
*/
|
||||
var ReconChart = (function() {
|
||||
'use strict';
|
||||
|
||||
var COLORS = ['#00ff41', '#0ea5e9', '#ffa500', '#ff4444', '#7c3aed', '#fbbf24'];
|
||||
|
||||
function drawLineChart(canvasId, datasets, opts) {
|
||||
opts = opts || {};
|
||||
var canvas = document.getElementById(canvasId);
|
||||
if (!canvas) return;
|
||||
|
||||
// DPI-aware sizing — match canvas bitmap to actual CSS pixels
|
||||
var dpr = window.devicePixelRatio || 1;
|
||||
var rect = canvas.getBoundingClientRect();
|
||||
var cssW = rect.width || 800;
|
||||
var cssH = rect.height || 200;
|
||||
canvas.width = cssW * dpr;
|
||||
canvas.height = cssH * dpr;
|
||||
|
||||
var ctx = canvas.getContext('2d');
|
||||
ctx.scale(dpr, dpr);
|
||||
|
||||
var W = cssW;
|
||||
var H = cssH;
|
||||
var pad = {top: 20, right: 20, bottom: 30, left: 60};
|
||||
var plotW = W - pad.left - pad.right;
|
||||
var plotH = H - pad.top - pad.bottom;
|
||||
|
||||
// Clear
|
||||
ctx.fillStyle = '#111';
|
||||
ctx.fillRect(0, 0, W, H);
|
||||
|
||||
if (!datasets || datasets.length === 0) {
|
||||
ctx.fillStyle = '#666';
|
||||
ctx.font = '12px Courier New';
|
||||
ctx.textAlign = 'center';
|
||||
ctx.fillText('No data', W/2, H/2);
|
||||
return;
|
||||
}
|
||||
|
||||
// Find global min/max Y
|
||||
var allY = [];
|
||||
var allX = [];
|
||||
datasets.forEach(function(ds) {
|
||||
ds.points.forEach(function(p) {
|
||||
allY.push(p.y);
|
||||
allX.push(p.x);
|
||||
});
|
||||
});
|
||||
if (allY.length === 0) return;
|
||||
|
||||
var minY = Math.min.apply(null, allY);
|
||||
var maxY = Math.max.apply(null, allY);
|
||||
var minX = Math.min.apply(null, allX);
|
||||
var maxX = Math.max.apply(null, allX);
|
||||
|
||||
// Add 10% padding to Y
|
||||
var yRange = maxY - minY || 1;
|
||||
minY = Math.max(0, minY - yRange * 0.05);
|
||||
maxY = maxY + yRange * 0.1;
|
||||
var xRange = maxX - minX || 1;
|
||||
|
||||
function xToCanvas(x) { return pad.left + ((x - minX) / xRange) * plotW; }
|
||||
function yToCanvas(y) { return pad.top + plotH - ((y - minY) / (maxY - minY)) * plotH; }
|
||||
|
||||
// Grid lines
|
||||
ctx.strokeStyle = '#222';
|
||||
ctx.lineWidth = 1;
|
||||
var ySteps = 5;
|
||||
for (var i = 0; i <= ySteps; i++) {
|
||||
var yVal = minY + (maxY - minY) * (i / ySteps);
|
||||
var cy = yToCanvas(yVal);
|
||||
ctx.beginPath();
|
||||
ctx.moveTo(pad.left, cy);
|
||||
ctx.lineTo(W - pad.right, cy);
|
||||
ctx.stroke();
|
||||
|
||||
// Y labels
|
||||
ctx.fillStyle = '#666';
|
||||
ctx.font = '10px Courier New';
|
||||
ctx.textAlign = 'right';
|
||||
ctx.fillText(Math.round(yVal).toLocaleString(), pad.left - 6, cy + 3);
|
||||
}
|
||||
|
||||
// X labels (time)
|
||||
ctx.textAlign = 'center';
|
||||
ctx.fillStyle = '#666';
|
||||
var xSteps = Math.min(6, allX.length);
|
||||
for (var j = 0; j < xSteps; j++) {
|
||||
var xVal = minX + xRange * (j / (xSteps - 1 || 1));
|
||||
var cx = xToCanvas(xVal);
|
||||
var d = new Date(xVal);
|
||||
var label = d.getHours().toString().padStart(2, '0') + ':' + d.getMinutes().toString().padStart(2, '0');
|
||||
ctx.fillText(label, cx, H - 8);
|
||||
}
|
||||
|
||||
// Draw lines + dots at each data point
|
||||
datasets.forEach(function(ds, idx) {
|
||||
var color = ds.color || COLORS[idx % COLORS.length];
|
||||
ctx.strokeStyle = color;
|
||||
ctx.lineWidth = 2;
|
||||
ctx.beginPath();
|
||||
var pts = ds.points.sort(function(a, b) { return a.x - b.x; });
|
||||
pts.forEach(function(p, i) {
|
||||
var x = xToCanvas(p.x);
|
||||
var y = yToCanvas(p.y);
|
||||
if (i === 0) ctx.moveTo(x, y);
|
||||
else ctx.lineTo(x, y);
|
||||
});
|
||||
ctx.stroke();
|
||||
|
||||
// Draw dots at each point for visibility with sparse data
|
||||
ctx.fillStyle = color;
|
||||
pts.forEach(function(p) {
|
||||
var x = xToCanvas(p.x);
|
||||
var y = yToCanvas(p.y);
|
||||
ctx.beginPath();
|
||||
ctx.arc(x, y, 3, 0, Math.PI * 2);
|
||||
ctx.fill();
|
||||
});
|
||||
|
||||
// Legend label
|
||||
if (ds.label) {
|
||||
ctx.fillStyle = color;
|
||||
ctx.font = '10px Courier New';
|
||||
ctx.textAlign = 'left';
|
||||
ctx.fillText(ds.label, pad.left + idx * 100, 12);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
function loadAndDraw(canvasId, metricType, keys, labels, hours) {
|
||||
hours = hours || 24;
|
||||
RECON.fetchJSON('/api/metrics/history?type=' + metricType + '&hours=' + hours).then(function(data) {
|
||||
if (!data.points || data.points.length < 2) {
|
||||
// Show "collecting data" message instead of hiding
|
||||
var canvas = document.getElementById(canvasId);
|
||||
if (!canvas) return;
|
||||
var container = canvas.parentElement;
|
||||
if (container) container.style.display = 'block';
|
||||
var dpr = window.devicePixelRatio || 1;
|
||||
var rect = canvas.getBoundingClientRect();
|
||||
canvas.width = (rect.width || 800) * dpr;
|
||||
canvas.height = (rect.height || 200) * dpr;
|
||||
var ctx = canvas.getContext('2d');
|
||||
ctx.scale(dpr, dpr);
|
||||
ctx.fillStyle = '#111';
|
||||
ctx.fillRect(0, 0, rect.width, rect.height);
|
||||
ctx.fillStyle = '#555';
|
||||
ctx.font = '12px Courier New';
|
||||
ctx.textAlign = 'center';
|
||||
var msg = data.points && data.points.length === 1
|
||||
? 'Collecting data... (1 snapshot, need 2+)'
|
||||
: 'Collecting data... (snapshots every 2 min)';
|
||||
ctx.fillText(msg, (rect.width || 800) / 2, (rect.height || 200) / 2);
|
||||
return;
|
||||
}
|
||||
|
||||
var container = document.getElementById(canvasId).parentElement;
|
||||
if (container) container.style.display = 'block';
|
||||
|
||||
var datasets = keys.map(function(key, i) {
|
||||
return {
|
||||
label: labels[i] || key,
|
||||
color: COLORS[i % COLORS.length],
|
||||
points: data.points.map(function(p) {
|
||||
return {
|
||||
x: new Date(p.timestamp).getTime(),
|
||||
y: p.data[key] || 0
|
||||
};
|
||||
})
|
||||
};
|
||||
});
|
||||
|
||||
drawLineChart(canvasId, datasets);
|
||||
}).catch(function() {});
|
||||
}
|
||||
|
||||
return {
|
||||
drawLineChart: drawLineChart,
|
||||
loadAndDraw: loadAndDraw
|
||||
};
|
||||
})();
|
||||
163
static/js/common.js
Normal file
163
static/js/common.js
Normal file
|
|
@ -0,0 +1,163 @@
|
|||
/* RECON Common Utilities
|
||||
* Shared fetch helpers, formatters, auto-refresh
|
||||
*/
|
||||
|
||||
var RECON = (function() {
|
||||
'use strict';
|
||||
|
||||
// Pipeline color/label maps
|
||||
var pipeColors = {
|
||||
queued: '#555', extracting: '#b45309', extracted: '#d97706',
|
||||
enriching: '#0284c7', enriched: '#0ea5e9', embedding: '#7c3aed',
|
||||
complete: '#16a34a', failed: '#dc2626'
|
||||
};
|
||||
var pipeLabels = {
|
||||
queued: 'Queued', extracting: 'Extracting', extracted: 'Extracted',
|
||||
enriching: 'Enriching', enriched: 'Enriched', embedding: 'Embedding',
|
||||
complete: 'Complete', failed: 'Failed'
|
||||
};
|
||||
|
||||
var _refreshTimers = [];
|
||||
var _heartbeatEl = null;
|
||||
|
||||
function fetchJSON(url) {
|
||||
return fetch(url).then(function(r) {
|
||||
if (!r.ok) throw new Error('HTTP ' + r.status);
|
||||
return r.json();
|
||||
});
|
||||
}
|
||||
|
||||
function postJSON(url, body) {
|
||||
return fetch(url, {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify(body || {})
|
||||
}).then(function(r) { return r.json(); });
|
||||
}
|
||||
|
||||
function set(id, text) {
|
||||
var el = document.getElementById(id);
|
||||
if (el) el.textContent = text;
|
||||
}
|
||||
|
||||
function setHTML(id, html) {
|
||||
var el = document.getElementById(id);
|
||||
if (el) el.innerHTML = html;
|
||||
}
|
||||
|
||||
function fmt(n) {
|
||||
if (typeof n !== 'number' || isNaN(n)) return '—';
|
||||
return n.toLocaleString();
|
||||
}
|
||||
|
||||
function fmtBytes(bytes) {
|
||||
if (!bytes || bytes === 0) return '0 B';
|
||||
var units = ['B', 'KB', 'MB', 'GB', 'TB'];
|
||||
var i = Math.floor(Math.log(bytes) / Math.log(1024));
|
||||
return (bytes / Math.pow(1024, i)).toFixed(1) + ' ' + units[i];
|
||||
}
|
||||
|
||||
function pct(n, total) {
|
||||
if (!total || total === 0) return '0';
|
||||
return (n / total * 100).toFixed(1);
|
||||
}
|
||||
|
||||
// Trend indicator: compare current to previous
|
||||
function trend(current, previous) {
|
||||
if (previous === undefined || previous === null) return '';
|
||||
var diff = current - previous;
|
||||
if (diff > 0) return '<span class="trend trend-up">+' + fmt(diff) + ' ▲</span>';
|
||||
if (diff < 0) return '<span class="trend trend-down">' + fmt(diff) + ' ▼</span>';
|
||||
return '<span class="trend trend-flat">— ▶</span>';
|
||||
}
|
||||
|
||||
// Build a segmented pipeline progress bar
|
||||
function progressBar(segments, total) {
|
||||
var html = '';
|
||||
segments.forEach(function(seg) {
|
||||
var w = total > 0 ? (seg.count / total * 100) : 0;
|
||||
if (w > 0) {
|
||||
html += '<div class="segment" style="width:' + w + '%;background:' +
|
||||
(seg.color || pipeColors[seg.status] || '#555') + ';" title="' +
|
||||
(seg.label || pipeLabels[seg.status] || seg.status) + ': ' + fmt(seg.count) + '"></div>';
|
||||
}
|
||||
});
|
||||
return html;
|
||||
}
|
||||
|
||||
// Build legend for pipeline bar
|
||||
function progressLegend(segments) {
|
||||
var html = '';
|
||||
segments.forEach(function(seg) {
|
||||
if (seg.count > 0) {
|
||||
html += '<span><span class="legend-dot" style="background:' +
|
||||
(seg.color || pipeColors[seg.status] || '#555') + ';"></span>' +
|
||||
(seg.label || pipeLabels[seg.status] || seg.status) + ': ' + fmt(seg.count) + '</span>';
|
||||
}
|
||||
});
|
||||
return html;
|
||||
}
|
||||
|
||||
// Auto-refresh with heartbeat
|
||||
function startRefresh(callback, intervalMs) {
|
||||
_heartbeatEl = document.getElementById('heartbeat');
|
||||
|
||||
function tick() {
|
||||
try {
|
||||
var result = callback();
|
||||
if (result && typeof result.then === 'function') {
|
||||
result.then(function() {
|
||||
if (_heartbeatEl) {
|
||||
_heartbeatEl.classList.remove('dead');
|
||||
}
|
||||
}).catch(function() {
|
||||
if (_heartbeatEl) {
|
||||
_heartbeatEl.classList.add('dead');
|
||||
}
|
||||
});
|
||||
} else {
|
||||
if (_heartbeatEl) _heartbeatEl.classList.remove('dead');
|
||||
}
|
||||
} catch(e) {
|
||||
if (_heartbeatEl) _heartbeatEl.classList.add('dead');
|
||||
}
|
||||
}
|
||||
|
||||
// Initial load
|
||||
tick();
|
||||
var timer = setInterval(tick, intervalMs || 30000);
|
||||
_refreshTimers.push(timer);
|
||||
return timer;
|
||||
}
|
||||
|
||||
function stopRefresh(timer) {
|
||||
if (timer) clearInterval(timer);
|
||||
}
|
||||
|
||||
// Quick-stats loader for header
|
||||
function loadQuickStats() {
|
||||
fetchJSON('/api/quick-stats').then(function(data) {
|
||||
setHTML('qs-docs', fmt(data.catalogued));
|
||||
setHTML('qs-vectors', fmt(data.vectors));
|
||||
setHTML('qs-pipeline', fmt(data.in_pipeline));
|
||||
}).catch(function() {});
|
||||
}
|
||||
|
||||
return {
|
||||
fetchJSON: fetchJSON,
|
||||
postJSON: postJSON,
|
||||
set: set,
|
||||
setHTML: setHTML,
|
||||
fmt: fmt,
|
||||
fmtBytes: fmtBytes,
|
||||
pct: pct,
|
||||
trend: trend,
|
||||
progressBar: progressBar,
|
||||
progressLegend: progressLegend,
|
||||
startRefresh: startRefresh,
|
||||
stopRefresh: stopRefresh,
|
||||
loadQuickStats: loadQuickStats,
|
||||
pipeColors: pipeColors,
|
||||
pipeLabels: pipeLabels
|
||||
};
|
||||
})();
|
||||
232
static/js/dashboard.js
Normal file
232
static/js/dashboard.js
Normal file
|
|
@ -0,0 +1,232 @@
|
|||
/* RECON Knowledge Dashboard */
|
||||
(function() {
|
||||
'use strict';
|
||||
|
||||
var pipeColors = RECON.pipeColors;
|
||||
var pipeLabels = RECON.pipeLabels;
|
||||
|
||||
function loadDashboard() {
|
||||
return RECON.fetchJSON('/api/knowledge-stats').then(function(data) {
|
||||
var t = data.totals;
|
||||
|
||||
// Top cards
|
||||
RECON.set('kv-catalogued', RECON.fmt(t.catalogued || 0));
|
||||
RECON.set('kv-pipeline', RECON.fmt(t.in_pipeline || 0));
|
||||
var pipeSub = document.getElementById('kv-pipeline-sub');
|
||||
if (t.in_pipeline > 0) {
|
||||
var active = data.pipeline.filter(function(p) { return ['extracting','enriching','embedding'].indexOf(p.status) >= 0; });
|
||||
var activeText = active.map(function(p) { return p.count + ' ' + p.status; }).join(', ');
|
||||
pipeSub.textContent = activeText || 'processing';
|
||||
} else { pipeSub.textContent = 'idle'; }
|
||||
RECON.set('kv-complete', RECON.fmt(t.complete || 0));
|
||||
var failEl = document.getElementById('kv-failed');
|
||||
failEl.textContent = RECON.fmt(t.failed || 0);
|
||||
failEl.style.color = t.failed > 0 ? '#ff4444' : '#00ff41';
|
||||
RECON.set('kv-concepts', RECON.fmt(t.concepts || 0));
|
||||
RECON.set('kv-vectors', RECON.fmt(t.vectors || 0));
|
||||
RECON.set('kv-pages', RECON.fmt(t.pages_processed || 0));
|
||||
|
||||
// Progress bar
|
||||
var total = t.catalogued || 1;
|
||||
var notYetQueued = total - (t.documents || 0);
|
||||
var segments = [];
|
||||
if (notYetQueued > 0) {
|
||||
segments.push({status: 'unqueued', count: notYetQueued, color: '#1a1a1a', label: 'Not queued'});
|
||||
}
|
||||
data.pipeline.forEach(function(p) {
|
||||
if (p.count > 0) segments.push(p);
|
||||
});
|
||||
RECON.setHTML('progress-bar', RECON.progressBar(segments, total));
|
||||
var completePct = total > 0 ? (t.complete / total * 100).toFixed(1) : 0;
|
||||
RECON.set('progress-pct', completePct + '% complete (' + RECON.fmt(t.complete || 0) + ' / ' + RECON.fmt(total) + ')');
|
||||
|
||||
// Legend
|
||||
var legendSegments = [];
|
||||
if (notYetQueued > 0) legendSegments.push({status: 'unqueued', count: notYetQueued, color: '#1a1a1a', label: 'Not queued'});
|
||||
data.pipeline.forEach(function(p) { if (p.count > 0) legendSegments.push(p); });
|
||||
RECON.setHTML('progress-legend', RECON.progressLegend(legendSegments));
|
||||
|
||||
// Pipeline activity
|
||||
var activeStatuses = data.pipeline.filter(function(p) { return ['extracting','enriching','embedding'].indexOf(p.status) >= 0 && p.count > 0; });
|
||||
var actDiv = document.getElementById('pipeline-activity');
|
||||
if (activeStatuses.length > 0) {
|
||||
actDiv.style.display = 'block';
|
||||
var actHtml = '';
|
||||
activeStatuses.forEach(function(p) {
|
||||
actHtml += '<div style="margin:4px 0;"><span style="color:' + (pipeColors[p.status]||'#ffa500') + ';">● ' + (pipeLabels[p.status]||p.status) + ':</span> ' + p.count + ' documents</div>';
|
||||
});
|
||||
if (data.active_titles) {
|
||||
Object.keys(data.active_titles).forEach(function(st) {
|
||||
var titles = data.active_titles[st];
|
||||
if (titles.length > 0) actHtml += '<div style="color:#666;font-size:11px;margin-left:16px;">' + titles.slice(0,3).join(', ') + (titles.length > 3 ? ', ...' : '') + '</div>';
|
||||
});
|
||||
}
|
||||
RECON.setHTML('activity-content', actHtml);
|
||||
} else { actDiv.style.display = 'none'; }
|
||||
|
||||
// Qdrant health
|
||||
var q = data.qdrant;
|
||||
var qEl = document.getElementById('qdrant-status');
|
||||
if (q.error) {
|
||||
qEl.innerHTML = '<span style="color:#ff4444;">● Offline</span> — ' + q.error;
|
||||
} else {
|
||||
var idxType = q.index_type || (q.vectors >= 20000 ? 'HNSW' : 'brute-force');
|
||||
var idxColor = idxType === 'HNSW' ? '#00ff41' : '#ffa500';
|
||||
qEl.innerHTML = '<span style="color:#00ff41;">● Online</span> | ' +
|
||||
RECON.fmt(q.vectors) + ' vectors | ' +
|
||||
'<span style="color:' + idxColor + ';">' + idxType + '</span>' +
|
||||
(idxType === 'HNSW' ? ' (' + RECON.fmt(q.indexed||0) + ' indexed)' : ' (HNSW auto-builds at 20K)') +
|
||||
' | <span style="color:#555;">recon_knowledge</span>';
|
||||
}
|
||||
|
||||
// Sources table
|
||||
var tbody = document.getElementById('sources-tbody');
|
||||
var totalCat = 0, totalComp = 0, totalPipe = 0, totalConcepts = 0, totalVectors = 0;
|
||||
tbody.innerHTML = data.sources.map(function(s) {
|
||||
var catCount = s.catalogued || 0;
|
||||
var compCount = s.complete || 0;
|
||||
var pipeCount = s.in_pipeline || 0;
|
||||
totalCat += catCount; totalComp += compCount; totalPipe += pipeCount;
|
||||
totalConcepts += s.concepts; totalVectors += s.vectors;
|
||||
var badge = s.type === 'web' ? '<span class="badge-web">WEB</span>' : '<span class="badge-pdf">PDF</span>';
|
||||
var compPct = catCount > 0 ? (compCount / catCount * 100) : 0;
|
||||
var pipePct = catCount > 0 ? (pipeCount / catCount * 100) : 0;
|
||||
var compColor = compPct >= 100 ? '#00ff41' : compPct > 0 ? '#ffa500' : '#666';
|
||||
var pipeColor = pipeCount > 0 ? '#0ea5e9' : '#555';
|
||||
var barW = 80;
|
||||
var compW = (compPct / 100 * barW).toFixed(1);
|
||||
var pipeW = (pipePct / 100 * barW).toFixed(1);
|
||||
var miniBar = '<div style="display:flex;align-items:center;gap:6px;">' +
|
||||
'<div style="width:' + barW + 'px;height:10px;background:#1a1a1a;border-radius:3px;overflow:hidden;display:flex;">' +
|
||||
'<div style="width:' + compW + 'px;background:#16a34a;height:100%;"></div>' +
|
||||
'<div style="width:' + pipeW + 'px;background:#0284c7;height:100%;"></div>' +
|
||||
'</div><span style="color:#888;font-size:10px;">' + compPct.toFixed(0) + '%</span></div>';
|
||||
return '<tr><td>' + s.name + '</td><td>' + badge + '</td><td>' +
|
||||
RECON.fmt(catCount) + '</td><td><span style="color:' + compColor + ';">' +
|
||||
RECON.fmt(compCount) + '</span></td><td><span style="color:' + pipeColor + ';">' +
|
||||
RECON.fmt(pipeCount) + '</span></td><td>' + miniBar + '</td><td>' +
|
||||
RECON.fmt(s.concepts) + '</td><td>' + RECON.fmt(s.vectors) + '</td></tr>';
|
||||
}).join('');
|
||||
RECON.setHTML('sources-tfoot',
|
||||
'<tr style="border-top:1px solid #333;font-weight:bold;"><td>TOTAL</td><td></td><td>' +
|
||||
RECON.fmt(totalCat) + '</td><td>' + RECON.fmt(totalComp) + '</td><td>' +
|
||||
RECON.fmt(totalPipe) + '</td><td></td><td>' +
|
||||
RECON.fmt(totalConcepts) + '</td><td>' + RECON.fmt(totalVectors) + '</td></tr>');
|
||||
|
||||
// Domain bars
|
||||
var dc = document.getElementById('domain-bars');
|
||||
var domEntries = Object.entries(data.domains);
|
||||
if (domEntries.length === 0) {
|
||||
dc.innerHTML = '<span class="text-dim">No domain data</span>';
|
||||
} else {
|
||||
var maxD = Math.max.apply(null, domEntries.map(function(e) { return e[1]; }));
|
||||
dc.innerHTML = domEntries.map(function(entry) {
|
||||
var name = entry[0], count = entry[1];
|
||||
var pct = (count / maxD * 100).toFixed(1);
|
||||
return '<div style="display:flex;align-items:center;gap:10px;margin:5px 0;">' +
|
||||
'<span style="width:160px;text-align:right;color:#aaa;white-space:nowrap;overflow:hidden;text-overflow:ellipsis;">' + name + '</span>' +
|
||||
'<div style="flex:1;height:18px;background:#1a1a1a;border-radius:3px;overflow:hidden;">' +
|
||||
'<div style="height:100%;background:#00cc66;border-radius:3px;width:' + pct + '%;"></div></div>' +
|
||||
'<span style="width:50px;color:#ccc;text-align:right;">' + RECON.fmt(count) + '</span></div>';
|
||||
}).join('');
|
||||
}
|
||||
|
||||
// Knowledge Type bars
|
||||
var ktEl = document.getElementById('knowledge-type-bars');
|
||||
var ktEntries = Object.entries(data.knowledge_types || {});
|
||||
var totalKt = ktEntries.reduce(function(a, e) { return a + e[1]; }, 0);
|
||||
if (ktEntries.length === 0) {
|
||||
ktEl.innerHTML = '<span class="text-dim">No data yet (migration in progress)</span>';
|
||||
} else {
|
||||
var ktColors = {foundational: '#60a5fa', procedural: '#4ade80', operational: '#fbbf24'};
|
||||
var maxKt = Math.max.apply(null, ktEntries.map(function(e) { return e[1]; }));
|
||||
ktEl.innerHTML = ktEntries.map(function(entry) {
|
||||
var name = entry[0], count = entry[1];
|
||||
var pctVal = totalKt > 0 ? (count / totalKt * 100).toFixed(0) : 0;
|
||||
var barPct = (count / maxKt * 100).toFixed(1);
|
||||
var color = ktColors[name] || '#888';
|
||||
return '<div style="display:flex;align-items:center;gap:10px;margin:5px 0;">' +
|
||||
'<span style="width:100px;text-align:right;color:' + color + ';">' + name + '</span>' +
|
||||
'<div style="flex:1;height:18px;background:#1a1a1a;border-radius:3px;overflow:hidden;">' +
|
||||
'<div style="height:100%;background:' + color + ';opacity:0.6;border-radius:3px;width:' + barPct + '%;"></div></div>' +
|
||||
'<span style="width:80px;color:#ccc;text-align:right;">' + RECON.fmt(count) + ' (' + pctVal + '%)</span></div>';
|
||||
}).join('');
|
||||
}
|
||||
var ktMig = document.getElementById('knowledge-type-migration');
|
||||
ktMig.textContent = RECON.fmt(totalKt) + ' / ' + RECON.fmt(data.sample_size) + ' migrated';
|
||||
|
||||
// Complexity bars
|
||||
var cxEl = document.getElementById('complexity-bars');
|
||||
var cxEntries = Object.entries(data.complexities || {});
|
||||
var totalCx = cxEntries.reduce(function(a, e) { return a + e[1]; }, 0);
|
||||
if (cxEntries.length === 0) {
|
||||
cxEl.innerHTML = '<span class="text-dim">No data yet (migration in progress)</span>';
|
||||
} else {
|
||||
var cxColors = {basic: '#4ade80', intermediate: '#fbbf24', advanced: '#f87171'};
|
||||
var maxCx = Math.max.apply(null, cxEntries.map(function(e) { return e[1]; }));
|
||||
cxEl.innerHTML = cxEntries.map(function(entry) {
|
||||
var name = entry[0], count = entry[1];
|
||||
var pctVal = totalCx > 0 ? (count / totalCx * 100).toFixed(0) : 0;
|
||||
var barPct = (count / maxCx * 100).toFixed(1);
|
||||
var color = cxColors[name] || '#888';
|
||||
return '<div style="display:flex;align-items:center;gap:10px;margin:5px 0;">' +
|
||||
'<span style="width:100px;text-align:right;color:' + color + ';">' + name + '</span>' +
|
||||
'<div style="flex:1;height:18px;background:#1a1a1a;border-radius:3px;overflow:hidden;">' +
|
||||
'<div style="height:100%;background:' + color + ';opacity:0.6;border-radius:3px;width:' + barPct + '%;"></div></div>' +
|
||||
'<span style="width:80px;color:#ccc;text-align:right;">' + RECON.fmt(count) + ' (' + pctVal + '%)</span></div>';
|
||||
}).join('');
|
||||
}
|
||||
var cxMig = document.getElementById('complexity-migration');
|
||||
cxMig.textContent = RECON.fmt(totalCx) + ' / ' + RECON.fmt(data.sample_size) + ' migrated';
|
||||
|
||||
// Recent completions
|
||||
var rtb = document.getElementById('recent-tbody');
|
||||
if (data.recent_complete.length === 0) {
|
||||
rtb.innerHTML = '<tr><td colspan="4" class="text-dim">None yet</td></tr>';
|
||||
} else {
|
||||
rtb.innerHTML = data.recent_complete.map(function(r) {
|
||||
var badge = r.type === 'web' ? '<span class="badge-web">WEB</span>' : '<span class="badge-pdf">PDF</span>';
|
||||
return '<tr><td>' + r.title + '</td><td>' + badge + '</td><td>' +
|
||||
r.concepts + '</td><td>' + r.vectors + '</td></tr>';
|
||||
}).join('');
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
function loadCharts() {
|
||||
if (typeof ReconChart !== 'undefined') {
|
||||
ReconChart.loadAndDraw('kb-chart', 'knowledge',
|
||||
['complete', 'concepts'], ['Completed', 'Concepts'], 24);
|
||||
}
|
||||
}
|
||||
|
||||
function initSourcesToggle() {
|
||||
var toggle = document.getElementById('sources-toggle');
|
||||
var arrow = document.getElementById('sources-arrow');
|
||||
var thead = document.getElementById('sources-thead');
|
||||
var tbody = document.getElementById('sources-tbody');
|
||||
var expanded = localStorage.getItem('recon-sources-expanded') === 'true';
|
||||
|
||||
function apply() {
|
||||
var show = expanded ? '' : 'none';
|
||||
thead.style.display = show;
|
||||
tbody.style.display = show;
|
||||
arrow.innerHTML = expanded ? '▼' : '▶';
|
||||
}
|
||||
|
||||
toggle.addEventListener('click', function() {
|
||||
expanded = !expanded;
|
||||
localStorage.setItem('recon-sources-expanded', expanded);
|
||||
apply();
|
||||
});
|
||||
|
||||
apply();
|
||||
}
|
||||
|
||||
document.addEventListener('DOMContentLoaded', function() {
|
||||
initSourcesToggle();
|
||||
RECON.startRefresh(loadDashboard, 30000);
|
||||
loadCharts();
|
||||
setInterval(loadCharts, 300000); // refresh charts every 5 min
|
||||
});
|
||||
})();
|
||||
106
static/js/peertube.js
Normal file
106
static/js/peertube.js
Normal file
|
|
@ -0,0 +1,106 @@
|
|||
/* RECON PeerTube Dashboard JS */
|
||||
(function() {
|
||||
'use strict';
|
||||
|
||||
function loadPTDashboard() {
|
||||
return RECON.fetchJSON('/api/peertube/dashboard').then(function(data) {
|
||||
// Video states
|
||||
var vs = data.video_states || {};
|
||||
// PeerTube state codes: 1=published, 2=to_transcode, 3=to_import, 4=waiting_for_live, 5=live_ended, 6=to_move_to_external_storage, 7=transcoding_failed, 8=to_edit, 9=waiting_for_live_to_end
|
||||
var published = vs['1'] || 0;
|
||||
var inPipeline = (vs['2'] || 0) + (vs['3'] || 0) + (vs['6'] || 0) + (vs['8'] || 0);
|
||||
var failed = vs['7'] || 0;
|
||||
RECON.set('pt-published', RECON.fmt(published));
|
||||
RECON.set('pt-in-pipeline', RECON.fmt(inPipeline));
|
||||
var failEl = document.getElementById('pt-failed');
|
||||
failEl.textContent = RECON.fmt(failed);
|
||||
failEl.style.color = failed > 0 ? '#ff4444' : '#00ff41';
|
||||
|
||||
// Import rate from downloader state
|
||||
var ds = data.downloader_state || {};
|
||||
var rate = ds.imports_last_hour || 0;
|
||||
RECON.set('pt-import-rate', RECON.fmt(rate));
|
||||
|
||||
// GPU
|
||||
var gpu = data.gpu || {};
|
||||
if (gpu.name) {
|
||||
RECON.set('pt-gpu-util', gpu.utilization_gpu || '—');
|
||||
RECON.set('pt-gpu-temp', gpu.temperature_gpu || '—');
|
||||
var gpuPanel = document.getElementById('pt-gpu-panel');
|
||||
gpuPanel.style.display = 'block';
|
||||
document.getElementById('pt-gpu-detail').innerHTML =
|
||||
'<strong>' + gpu.name + '</strong> | VRAM: ' +
|
||||
RECON.fmt(parseInt(gpu.memory_used || 0)) + ' / ' + RECON.fmt(parseInt(gpu.memory_total || 0)) + ' MiB | ' +
|
||||
'Util: ' + (gpu.utilization_gpu || '?') + '% | ' +
|
||||
'Temp: ' + (gpu.temperature_gpu || '?') + '°C';
|
||||
} else {
|
||||
RECON.set('pt-gpu-util', '—');
|
||||
RECON.set('pt-gpu-temp', '—');
|
||||
document.getElementById('pt-gpu-panel').style.display = 'none';
|
||||
}
|
||||
|
||||
// Services
|
||||
var svcs = data.services || {};
|
||||
['downloader', 'importer', 'transcoder', 'runner'].forEach(function(s) {
|
||||
var el = document.getElementById('svc-' + s);
|
||||
el.className = 'svc-dot ' + (svcs[s] === 'active' ? 'active' : svcs[s] === 'inactive' ? 'inactive' : 'unknown');
|
||||
});
|
||||
|
||||
// Pipeline dirs
|
||||
var dirs = data.pipeline_dirs || {};
|
||||
var storageHtml = '';
|
||||
var dirOrder = ['staging', 'completed', 'transcoded', 'failed'];
|
||||
var dirLabels = {staging: 'Downloaded', completed: 'Awaiting Transcode', transcoded: 'Ready to Import', failed: 'Failed'};
|
||||
var dirColors = {staging: '#b45309', completed: '#0284c7', transcoded: '#7c3aed', failed: '#dc2626'};
|
||||
var totalVideos = 0;
|
||||
dirOrder.forEach(function(d) {
|
||||
var info = dirs[d] || {};
|
||||
var videos = info.videos || 0;
|
||||
var bytes = info.bytes || 0;
|
||||
totalVideos += videos;
|
||||
storageHtml += '<div class="flex-between" style="margin:4px 0;">' +
|
||||
'<span><span class="legend-dot" style="background:' + (dirColors[d] || '#555') + ';"></span>' + (dirLabels[d] || d) + '</span>' +
|
||||
'<span>' + videos + ' videos / ' + RECON.fmtBytes(bytes) + '</span></div>';
|
||||
});
|
||||
RECON.setHTML('pt-storage-content', storageHtml);
|
||||
|
||||
// Pipeline bar (using video counts)
|
||||
var segments = dirOrder.map(function(d) {
|
||||
return {status: d, count: (dirs[d] || {}).videos || 0, color: dirColors[d], label: dirLabels[d] || d};
|
||||
});
|
||||
RECON.setHTML('pt-pipeline-bar', RECON.progressBar(segments, totalVideos || 1));
|
||||
RECON.setHTML('pt-pipeline-legend', RECON.progressLegend(segments));
|
||||
RECON.set('pt-pipeline-summary', totalVideos + ' videos in pipeline');
|
||||
|
||||
// Errors
|
||||
var errors = data.recent_errors || [];
|
||||
var errPanel = document.getElementById('pt-errors-panel');
|
||||
RECON.set('pt-error-count', errors.length);
|
||||
if (errors.length > 0) {
|
||||
errPanel.classList.add('has-errors');
|
||||
var errHtml = '';
|
||||
errors.forEach(function(e) {
|
||||
errHtml += '<div class="error-line">' + e + '</div>';
|
||||
});
|
||||
RECON.setHTML('pt-errors-content', errHtml);
|
||||
} else {
|
||||
errPanel.classList.remove('has-errors');
|
||||
}
|
||||
}).catch(function(err) {
|
||||
console.error('PT dashboard error:', err);
|
||||
});
|
||||
}
|
||||
|
||||
function loadCharts() {
|
||||
if (typeof ReconChart !== 'undefined') {
|
||||
ReconChart.loadAndDraw('pt-chart', 'peertube',
|
||||
['published', 'backlog'], ['Published', 'Backlog'], 24);
|
||||
}
|
||||
}
|
||||
|
||||
document.addEventListener('DOMContentLoaded', function() {
|
||||
RECON.startRefresh(loadPTDashboard, 30000);
|
||||
loadCharts();
|
||||
setInterval(loadCharts, 300000);
|
||||
});
|
||||
})();
|
||||
193
static/js/web-ingest.js
Normal file
193
static/js/web-ingest.js
Normal file
|
|
@ -0,0 +1,193 @@
|
|||
/* RECON Web Ingest page JS */
|
||||
(function() {
|
||||
'use strict';
|
||||
|
||||
window.showSection = function(name) {
|
||||
document.getElementById('section-single').style.display = name === 'single' ? '' : 'none';
|
||||
document.getElementById('section-crawl').style.display = name === 'crawl' ? '' : 'none';
|
||||
document.getElementById('tab-single').className = 'btn' + (name === 'single' ? ' active' : '');
|
||||
document.getElementById('tab-crawl').className = 'btn' + (name === 'crawl' ? ' active' : '');
|
||||
};
|
||||
|
||||
window.doWebIngest = async function() {
|
||||
var btn = document.getElementById('wi-btn');
|
||||
var status = document.getElementById('wi-status');
|
||||
var resultsDiv = document.getElementById('wi-results');
|
||||
var urlText = document.getElementById('wi-urls').value.trim();
|
||||
var category = document.getElementById('wi-category').value.trim() || 'Web';
|
||||
|
||||
if (!urlText) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'Enter at least one URL';
|
||||
return;
|
||||
}
|
||||
|
||||
var urls = urlText.split('\n').map(function(u) { return u.trim(); }).filter(function(u) { return u && !u.startsWith('#'); });
|
||||
if (urls.length === 0) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'No valid URLs';
|
||||
return;
|
||||
}
|
||||
|
||||
btn.disabled = true;
|
||||
status.style.color = '#ffa500';
|
||||
resultsDiv.style.display = 'none';
|
||||
|
||||
if (urls.length === 1) {
|
||||
status.textContent = 'Fetching and extracting...';
|
||||
try {
|
||||
var resp = await fetch('/api/ingest-url', {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify({ url: urls[0], category: category, process: true })
|
||||
});
|
||||
var data = await resp.json();
|
||||
if (resp.ok || resp.status === 409) {
|
||||
var color = data.status === 'duplicate' ? '#888' : '#00ff41';
|
||||
status.style.color = color;
|
||||
status.textContent = data.status.toUpperCase() + ': ' + (data.title || urls[0]);
|
||||
resultsDiv.style.display = 'block';
|
||||
resultsDiv.innerHTML = '<span style="color:' + color + ';">' + data.status.toUpperCase() + '</span><br>' +
|
||||
'<span class="text-dim">Hash: ' + data.hash + '</span><br>' +
|
||||
(data.page_count ? '<span class="text-dim">Pages: ' + data.page_count + '</span><br>' : '') +
|
||||
(data.title ? '<span class="text-dim">Title: ' + data.title + '</span><br>' : '') +
|
||||
(data.pipeline ? '<span style="color:#00ff41;">Pipeline: enriched ' + (data.pipeline.enriched || 0) + ', embedded ' + (data.pipeline.embedded || 0) + '</span>' : '');
|
||||
} else {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = data.error || 'Ingestion failed';
|
||||
}
|
||||
} catch (err) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'Network error: ' + err.message;
|
||||
}
|
||||
} else {
|
||||
status.textContent = 'Processing ' + urls.length + ' URLs...';
|
||||
try {
|
||||
var resp = await fetch('/api/ingest-urls', {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify({ urls: urls, category: category, process: true })
|
||||
});
|
||||
var data = await resp.json();
|
||||
if (resp.ok) {
|
||||
var s = data.summary;
|
||||
status.style.color = '#00ff41';
|
||||
var batchPipe = data.pipeline && data.pipeline.enriched ? ' | enriched: ' + data.pipeline.enriched + ', embedded: ' + data.pipeline.embedded : '';
|
||||
status.textContent = s.succeeded + ' new, ' + s.duplicates + ' dupes, ' + s.failed + ' failed' + batchPipe;
|
||||
resultsDiv.style.display = 'block';
|
||||
var html = '';
|
||||
for (var i = 0; i < data.results.length; i++) {
|
||||
var r = data.results[i];
|
||||
var c = r.status === 'failed' ? '#ff4444' : r.status === 'duplicate' ? '#888' : '#00ff41';
|
||||
html += '<div style="margin-bottom:4px;"><span style="color:' + c + ';">' +
|
||||
r.status.toUpperCase() + '</span> ' + (r.title || r.url) + '</div>';
|
||||
}
|
||||
resultsDiv.innerHTML = html;
|
||||
} else {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = data.error || 'Batch ingestion failed';
|
||||
}
|
||||
} catch (err) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'Network error: ' + err.message;
|
||||
}
|
||||
}
|
||||
btn.disabled = false;
|
||||
};
|
||||
|
||||
window.doCrawl = async function(dryRun) {
|
||||
var status = document.getElementById('crawl-status');
|
||||
var resultsDiv = document.getElementById('crawl-results');
|
||||
var url = document.getElementById('crawl-url').value.trim();
|
||||
var category = document.getElementById('crawl-category').value.trim() || 'Web';
|
||||
var maxPages = parseInt(document.getElementById('crawl-max-pages').value) || 500;
|
||||
var includeRaw = document.getElementById('crawl-include').value.trim();
|
||||
var excludeRaw = document.getElementById('crawl-exclude').value.trim();
|
||||
|
||||
if (!url) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'Enter a site URL';
|
||||
return;
|
||||
}
|
||||
|
||||
var include = includeRaw ? includeRaw.split(',').map(function(s) { return s.trim(); }).filter(Boolean) : null;
|
||||
var exclude = excludeRaw ? excludeRaw.split(',').map(function(s) { return s.trim(); }).filter(Boolean) : null;
|
||||
|
||||
var btnP = document.getElementById('crawl-preview-btn');
|
||||
var btnC = document.getElementById('crawl-btn');
|
||||
btnP.disabled = true;
|
||||
btnC.disabled = true;
|
||||
status.style.color = '#ffa500';
|
||||
status.textContent = dryRun ? 'Discovering URLs...' : 'Starting crawl...';
|
||||
resultsDiv.style.display = 'none';
|
||||
|
||||
try {
|
||||
var body = { url: url, category: category, max_pages: maxPages, dry_run: dryRun };
|
||||
if (include) body.include = include;
|
||||
if (exclude) body.exclude = exclude;
|
||||
|
||||
var resp = await fetch('/api/crawl', {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify(body)
|
||||
});
|
||||
var data = await resp.json();
|
||||
|
||||
if (dryRun) {
|
||||
var urls = data.urls || [];
|
||||
status.style.color = '#00ff41';
|
||||
status.textContent = urls.length + ' URLs found (' + (data.discovery_method || 'unknown') + ')';
|
||||
resultsDiv.style.display = 'block';
|
||||
var html = '<div style="color:#00ff41;margin-bottom:8px;">Discovery: ' + (data.discovery_method || 'unknown') + ' — ' + urls.length + ' URLs</div>';
|
||||
urls.forEach(function(u, i) {
|
||||
html += '<div class="text-muted">' + (i+1) + '. ' + u + '</div>';
|
||||
});
|
||||
resultsDiv.innerHTML = html;
|
||||
} else if (data.crawl_id) {
|
||||
status.style.color = '#00ff41';
|
||||
status.textContent = 'Crawl started — ID: ' + data.crawl_id;
|
||||
resultsDiv.style.display = 'block';
|
||||
resultsDiv.innerHTML = '<div style="color:#ffa500;">Crawl running in background...</div>' +
|
||||
'<div class="text-dim" style="margin-top:4px;">ID: ' + data.crawl_id + '</div>';
|
||||
pollCrawl(data.crawl_id, resultsDiv);
|
||||
} else {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = data.error || 'Crawl failed';
|
||||
}
|
||||
} catch (err) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'Network error: ' + err.message;
|
||||
}
|
||||
btnP.disabled = false;
|
||||
btnC.disabled = false;
|
||||
};
|
||||
|
||||
function pollCrawl(crawlId, resultsDiv) {
|
||||
var check = async function() {
|
||||
try {
|
||||
var resp = await fetch('/api/crawl/' + crawlId + '/status');
|
||||
var data = await resp.json();
|
||||
if (data.status === 'running') {
|
||||
var stageText = data.stage ? ' (' + data.stage + ')' : '';
|
||||
resultsDiv.innerHTML = '<div style="color:#ffa500;">Pipeline running' + stageText + '...</div>' +
|
||||
'<div class="text-dim">Site: ' + (data.site || '') + '</div>';
|
||||
setTimeout(check, 5000);
|
||||
} else if (data.summary) {
|
||||
var s = data.summary;
|
||||
var pipeInfo = data.pipeline ? ' | Enriched: ' + (data.pipeline.enriched || 0) + ' | Embedded: ' + (data.pipeline.embedded || 0) : '';
|
||||
resultsDiv.innerHTML = '<div style="color:#00ff41;">Pipeline complete!</div>' +
|
||||
'<div class="text-dim" style="margin-top:4px;">New: ' + s.succeeded + ' | Duplicates: ' + s.duplicates + ' | Failed: ' + s.failed + ' | Total: ' + s.total + pipeInfo + '</div>';
|
||||
document.getElementById('crawl-status').style.color = '#00ff41';
|
||||
document.getElementById('crawl-status').textContent = 'Complete: ' + s.succeeded + ' new' + pipeInfo;
|
||||
} else if (data.error) {
|
||||
resultsDiv.innerHTML = '<div style="color:#ff4444;">Crawl failed: ' + data.error + '</div>';
|
||||
}
|
||||
} catch (err) {
|
||||
resultsDiv.innerHTML += '<div style="color:#ff4444;">Poll error: ' + err.message + '</div>';
|
||||
}
|
||||
};
|
||||
setTimeout(check, 5000);
|
||||
}
|
||||
|
||||
showSection('single');
|
||||
})();
|
||||
115
sweep_gated.sh
Executable file
115
sweep_gated.sh
Executable file
|
|
@ -0,0 +1,115 @@
|
|||
#!/usr/bin/env bash
|
||||
# sweep_gated.sh — Qdrant-gated sweep wrapper for Stream B.2 Phase 4
|
||||
# Runs recon.py pipeline sweep in bounded chunks with Qdrant health checks
|
||||
# between each invocation. Aborts cleanly if Qdrant becomes unreachable.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
QDRANT_URL="${QDRANT_URL:-http://192.168.1.150:6333/collections/recon_knowledge_hybrid}"
|
||||
BATCH_SIZE="${BATCH_SIZE:-500}"
|
||||
MAX_ENTRIES="${MAX_ENTRIES:-500}"
|
||||
PLAN_FILE="${PLAN_FILE:-/opt/recon/data/sweep/sweep_plan.json}"
|
||||
RECON_DIR="/opt/recon"
|
||||
# Checkpoint co-locates with plan file: plan.json -> plan_checkpoint.json
|
||||
CHECKPOINT_FILE="${PLAN_FILE%.json}_checkpoint.json"
|
||||
|
||||
log() { echo "[$(date +%Y-%m-%dT%H:%M:%S)] $*"; }
|
||||
|
||||
probe_qdrant() {
|
||||
local resp
|
||||
resp=$(curl -sf -o /dev/null -w '%{http_code}' --connect-timeout 5 --max-time 10 "$QDRANT_URL" 2>/dev/null) || true
|
||||
if [ "$resp" = "200" ]; then
|
||||
return 0
|
||||
else
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
report_progress() {
|
||||
if [ -f "$CHECKPOINT_FILE" ]; then
|
||||
python3 -c "
|
||||
import json
|
||||
cp = json.load(open('$CHECKPOINT_FILE'))
|
||||
s = cp['stats']
|
||||
idx = cp['last_completed_index']
|
||||
print(f' last_completed_index={idx}')
|
||||
print(f' relocated={s[\"relocated\"]} rescued={s[\"rescued\"]} unclassified={s[\"unclassified_moved\"]}')
|
||||
print(f' noop={s[\"no_op_marked\"]} dup={s[\"duplicates\"]} skip={s[\"skipped\"]} fail={s[\"failed\"]}')
|
||||
print(f' qdrant_updated={s[\"qdrant_updated\"]}')
|
||||
" 2>/dev/null || log " (could not read checkpoint)"
|
||||
else
|
||||
log " no checkpoint file at $CHECKPOINT_FILE"
|
||||
fi
|
||||
}
|
||||
|
||||
parse_processed() {
|
||||
# Parse the sweep output to count total entries processed this iteration
|
||||
python3 -c "
|
||||
import sys, re
|
||||
lines = sys.stdin.read()
|
||||
total = 0
|
||||
for key in ['Relocated', 'Rescued', 'Unclassified moved', 'No-op .marked.', 'Duplicates', 'Skipped', 'Failed']:
|
||||
m = re.search(key + r':\s+(\d+)', lines)
|
||||
if m:
|
||||
total += int(m.group(1))
|
||||
print(total)
|
||||
" 2>/dev/null || echo "-1"
|
||||
}
|
||||
|
||||
log "Plan file: $PLAN_FILE"
|
||||
log "Batch size: $BATCH_SIZE, Max entries per chunk: $MAX_ENTRIES"
|
||||
|
||||
iteration=0
|
||||
|
||||
while true; do
|
||||
iteration=$((iteration + 1))
|
||||
log "=== Iteration $iteration ==="
|
||||
|
||||
# Pre-flight Qdrant probe
|
||||
log "Probing Qdrant at $QDRANT_URL ..."
|
||||
if ! probe_qdrant; then
|
||||
log "ABORT: Qdrant unreachable before iteration $iteration"
|
||||
report_progress
|
||||
exit 1
|
||||
fi
|
||||
log "Qdrant OK"
|
||||
|
||||
# Run sweep chunk
|
||||
log "Running: recon.py pipeline sweep --execute --resume --batch-size $BATCH_SIZE --max-entries $MAX_ENTRIES --plan-file $PLAN_FILE"
|
||||
set +e
|
||||
output=$(cd "$RECON_DIR" && python3 recon.py pipeline sweep --execute --resume \
|
||||
--batch-size "$BATCH_SIZE" --max-entries "$MAX_ENTRIES" --plan-file "$PLAN_FILE" 2>&1)
|
||||
rc=$?
|
||||
set -e
|
||||
|
||||
echo "$output"
|
||||
|
||||
if [ $rc -ne 0 ]; then
|
||||
log "ABORT: recon.py exited with code $rc"
|
||||
report_progress
|
||||
exit 2
|
||||
fi
|
||||
|
||||
# Check if sweep is done (all counters zero = nothing left to process)
|
||||
processed=$(echo "$output" | parse_processed)
|
||||
|
||||
if [ "$processed" = "0" ]; then
|
||||
log "Sweep complete — nothing left to process"
|
||||
report_progress
|
||||
exit 0
|
||||
fi
|
||||
|
||||
log "Chunk processed $processed entries"
|
||||
|
||||
# Post-flight Qdrant probe
|
||||
log "Post-flight Qdrant probe..."
|
||||
if ! probe_qdrant; then
|
||||
log "ABORT: Qdrant unreachable after iteration $iteration"
|
||||
log "Last chunk may have filesystem/Qdrant drift — verify with: recon.py pipeline sweep --verify"
|
||||
report_progress
|
||||
exit 3
|
||||
fi
|
||||
log "Qdrant still healthy, continuing..."
|
||||
report_progress
|
||||
echo
|
||||
done
|
||||
39
templates/base.html
Normal file
39
templates/base.html
Normal file
|
|
@ -0,0 +1,39 @@
|
|||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>RECON // Aurora Intelligence Pipeline{% if page_title %} — {{ page_title }}{% endif %}</title>
|
||||
<meta charset="utf-8">
|
||||
<link rel="stylesheet" href="/static/css/recon.css">
|
||||
</head>
|
||||
<body>
|
||||
<div class="header">
|
||||
<div class="header-left"><h1><span id="heartbeat" class="heartbeat"></span>RECON</h1><span class="header-subtitle">AURORA INTELLIGENCE PIPELINE</span></div>
|
||||
<div class="flex gap-16">
|
||||
<div class="quick-stats">
|
||||
<span>Docs: <span id="qs-docs">—</span></span>
|
||||
<span>Vectors: <span id="qs-vectors">—</span></span>
|
||||
<span>Pipeline: <span id="qs-pipeline">—</span></span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="nav-domain">
|
||||
<a href="/"{% if domain == 'knowledge' %} class="active"{% endif %}>Knowledge</a>
|
||||
<a href="/peertube"{% if domain == 'peertube' %} class="active"{% endif %}>PeerTube</a>
|
||||
<a href="/search"{% if domain == 'search' %} class="active"{% endif %}>Search</a>
|
||||
<a href="/settings/keys"{% if domain == 'settings' %} class="active"{% endif %}>Settings</a>
|
||||
</div>
|
||||
{% if subnav %}
|
||||
<div class="nav-sub">
|
||||
{% for item in subnav %}
|
||||
<a href="{{ item.href }}"{% if item.href == active_page %} class="active"{% endif %}>{{ item.label }}</a>
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% endif %}
|
||||
<div class="content" id="main">
|
||||
{% block content %}{% endblock %}
|
||||
</div>
|
||||
<script src="/static/js/common.js"></script>
|
||||
<script>document.addEventListener('DOMContentLoaded', function() { RECON.loadQuickStats(); });</script>
|
||||
{% block scripts %}{% endblock %}
|
||||
</body>
|
||||
</html>
|
||||
53
templates/knowledge/catalogue.html
Normal file
53
templates/knowledge/catalogue.html
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<h3 class="section-title mb-16">Document Catalogue</h3>
|
||||
|
||||
{% if sources %}
|
||||
<div class="mb-16">
|
||||
<a href="/catalogue" class="btn{% if not current_source %} active{% endif %}" style="margin-right:4px;">All</a>
|
||||
{% for s in sources %}
|
||||
<a href="/catalogue?source={{ s }}" class="btn{% if current_source == s %} active{% endif %}" style="margin-right:4px;">{{ s }}</a>
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<div class="text-dim text-xs mb-16">
|
||||
Showing {{ docs|length }}{% if total_count %} of {{ total_count }}{% endif %} documents
|
||||
{% if current_source %} in <strong>{{ current_source }}</strong>{% endif %}
|
||||
(page {{ page }} of {{ total_pages }})
|
||||
</div>
|
||||
|
||||
<table>
|
||||
<tr><th>Filename</th><th>Source</th><th>Status</th><th>Pages</th><th>Concepts</th><th>Vectors</th></tr>
|
||||
{% for d in docs %}
|
||||
<tr>
|
||||
<td>{{ d.filename or '?' }}</td>
|
||||
<td>{{ d.source or '' }}</td>
|
||||
<td><span class="status status-{{ d.status or 'unknown' }}">{{ d.status or 'unknown' }}</span></td>
|
||||
<td>{{ d.pages_extracted or 0 }}</td>
|
||||
<td>{{ d.concepts_extracted or 0 }}</td>
|
||||
<td>{{ d.vectors_inserted or 0 }}</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</table>
|
||||
|
||||
{% if total_pages > 1 %}
|
||||
<div class="pagination">
|
||||
{% if page > 1 %}
|
||||
<a href="/catalogue?page={{ page - 1 }}{% if current_source %}&source={{ current_source }}{% endif %}&per_page={{ per_page }}">«</a>
|
||||
{% endif %}
|
||||
{% for p in range(1, total_pages + 1) %}
|
||||
{% if p == page %}
|
||||
<span class="current">{{ p }}</span>
|
||||
{% elif p <= 3 or p > total_pages - 3 or (p >= page - 2 and p <= page + 2) %}
|
||||
<a href="/catalogue?page={{ p }}{% if current_source %}&source={{ current_source }}{% endif %}&per_page={{ per_page }}">{{ p }}</a>
|
||||
{% elif p == 4 or p == total_pages - 3 %}
|
||||
<span class="text-dim">...</span>
|
||||
{% endif %}
|
||||
{% endfor %}
|
||||
{% if page < total_pages %}
|
||||
<a href="/catalogue?page={{ page + 1 }}{% if current_source %}&source={{ current_source }}{% endif %}&per_page={{ per_page }}">»</a>
|
||||
{% endif %}
|
||||
</div>
|
||||
{% endif %}
|
||||
{% endblock %}
|
||||
72
templates/knowledge/dashboard.html
Normal file
72
templates/knowledge/dashboard.html
Normal file
|
|
@ -0,0 +1,72 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<div id="kb-dashboard">
|
||||
<div class="stat-grid">
|
||||
<div class="stat-card"><div class="label">Catalogued</div><div class="value" id="kv-catalogued">—</div><div class="sublabel">total known documents</div></div>
|
||||
<div class="stat-card"><div class="label">In Pipeline</div><div class="value" id="kv-pipeline">—</div><div class="sublabel" id="kv-pipeline-sub">processing</div></div>
|
||||
<div class="stat-card"><div class="label">Complete</div><div class="value" id="kv-complete">—</div><div class="sublabel">in Qdrant</div></div>
|
||||
<div class="stat-card"><div class="label">Failed</div><div class="value" id="kv-failed">—</div><div class="sublabel"> </div></div>
|
||||
</div>
|
||||
|
||||
<div class="mb-24">
|
||||
<div class="flex-between mb-16" style="margin-bottom:4px;font-size:11px;color:#888;">
|
||||
<span id="progress-label">Pipeline Progress</span>
|
||||
<span id="progress-pct"></span>
|
||||
</div>
|
||||
<div id="progress-bar" class="pipeline-bar"></div>
|
||||
<div id="progress-legend" class="pipeline-legend"></div>
|
||||
</div>
|
||||
|
||||
<div class="stat-grid grid-3">
|
||||
<div class="stat-card"><div class="label">Concepts</div><div class="value" id="kv-concepts">—</div><div class="sublabel">extracted</div></div>
|
||||
<div class="stat-card"><div class="label">Vectors</div><div class="value" id="kv-vectors">—</div><div class="sublabel">in Qdrant</div></div>
|
||||
<div class="stat-card"><div class="label">Pages</div><div class="value" id="kv-pages">—</div><div class="sublabel">processed</div></div>
|
||||
</div>
|
||||
|
||||
<div id="pipeline-activity" class="panel" style="display:none;">
|
||||
<h3 style="color:#ffa500;font-size:13px;margin-bottom:8px;">Pipeline Activity</h3>
|
||||
<div id="activity-content" style="font-size:12px;color:#ccc;"></div>
|
||||
</div>
|
||||
|
||||
<div id="qdrant-health" class="panel" style="padding:10px 16px;font-size:12px;color:#888;">
|
||||
Qdrant: <span id="qdrant-status">checking...</span>
|
||||
</div>
|
||||
|
||||
<div id="kb-chart-container" class="panel" style="display:none;">
|
||||
<h3 class="section-title" style="margin-bottom:8px;">Pipeline Activity (24h)</h3>
|
||||
<canvas id="kb-chart" width="800" height="200" style="width:100%;height:200px;"></canvas>
|
||||
</div>
|
||||
|
||||
<h3 class="section-title" id="sources-toggle" style="cursor:pointer;user-select:none;"><span id="sources-arrow">▶</span> Sources</h3>
|
||||
<table>
|
||||
<thead id="sources-thead" style="display:none;"><tr><th>Source</th><th>Type</th><th>Catalogued</th><th>Complete</th><th>In Pipeline</th><th>Progress</th><th>Concepts</th><th>Vectors</th></tr></thead>
|
||||
<tbody id="sources-tbody" style="display:none;"><tr><td colspan="8" class="text-dim">Loading...</td></tr></tbody>
|
||||
<tfoot id="sources-tfoot"></tfoot>
|
||||
</table>
|
||||
|
||||
<div class="grid-2 mt-24">
|
||||
<div>
|
||||
<h3 class="section-title">Domain Distribution</h3>
|
||||
<div id="domain-bars" class="text-small">Loading...</div>
|
||||
</div>
|
||||
<div>
|
||||
<h3 class="section-title">Knowledge Type</h3>
|
||||
<div id="knowledge-type-bars" class="text-small">Loading...</div>
|
||||
<div id="knowledge-type-migration" class="text-small" style="margin-top:6px;color:#666;font-size:11px;"></div>
|
||||
<h3 class="section-title" style="margin-top:16px;">Complexity</h3>
|
||||
<div id="complexity-bars" class="text-small">Loading...</div>
|
||||
<div id="complexity-migration" class="text-small" style="margin-top:6px;color:#666;font-size:11px;"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3 class="section-title mt-24">Recently Completed</h3>
|
||||
<table>
|
||||
<thead><tr><th>Title</th><th>Type</th><th>Concepts</th><th>Vectors</th></tr></thead>
|
||||
<tbody id="recent-tbody"><tr><td colspan="4" class="text-dim">Loading...</td></tr></tbody>
|
||||
</table>
|
||||
</div>
|
||||
{% endblock %}
|
||||
{% block scripts %}
|
||||
<script src="/static/js/charts.js"></script>
|
||||
<script src="/static/js/dashboard.js"></script>
|
||||
{% endblock %}
|
||||
56
templates/knowledge/failures.html
Normal file
56
templates/knowledge/failures.html
Normal file
|
|
@ -0,0 +1,56 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<h3 style="color:#ff4444;margin-bottom:16px;">Failed Documents</h3>
|
||||
{% if not failures %}
|
||||
<p class="text-dim">No failures.</p>
|
||||
{% else %}
|
||||
<div style="margin-bottom:16px;">
|
||||
<button class="btn" id="retry-all-btn" onclick="retryAll()">Retry All ({{ failures|length }})</button>
|
||||
<span id="retry-all-status" style="margin-left:12px;font-size:12px;"></span>
|
||||
</div>
|
||||
<table>
|
||||
<tr><th>Filename</th><th>Error</th><th>Age</th><th>Retries</th><th>Actions</th></tr>
|
||||
{% for f in failures %}
|
||||
<tr>
|
||||
<td>{{ f.filename or '?' }}</td>
|
||||
<td style="color:#ff4444;font-size:11px;">{{ (f.error_message or 'unknown')[:100] }}</td>
|
||||
<td class="text-dim text-xs">{{ f.discovered_at or '' }}</td>
|
||||
<td>{{ f.retry_count or 0 }}</td>
|
||||
<td>
|
||||
<form method="post" action="/api/retry/{{ f.hash }}" style="display:inline;">
|
||||
<button class="btn" type="submit">Retry</button>
|
||||
</form>
|
||||
</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</table>
|
||||
{% endif %}
|
||||
{% endblock %}
|
||||
{% block scripts %}
|
||||
<script>
|
||||
async function retryAll() {
|
||||
var btn = document.getElementById('retry-all-btn');
|
||||
var status = document.getElementById('retry-all-status');
|
||||
if (!confirm('Retry all {{ failures|length }} failed documents?')) return;
|
||||
btn.disabled = true;
|
||||
status.style.color = '#ffa500';
|
||||
status.textContent = 'Retrying...';
|
||||
try {
|
||||
var resp = await fetch('/api/retry-all', {method: 'POST'});
|
||||
var data = await resp.json();
|
||||
if (resp.ok) {
|
||||
status.style.color = '#00ff41';
|
||||
status.textContent = 'Retried ' + data.count + ' documents';
|
||||
setTimeout(function() { location.reload(); }, 2000);
|
||||
} else {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = data.error || 'Failed';
|
||||
}
|
||||
} catch(e) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'Error: ' + e.message;
|
||||
}
|
||||
btn.disabled = false;
|
||||
}
|
||||
</script>
|
||||
{% endblock %}
|
||||
83
templates/knowledge/upload.html
Normal file
83
templates/knowledge/upload.html
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<h3 class="section-title mb-16">Upload PDF</h3>
|
||||
<div class="panel">
|
||||
<form id="upload-form" enctype="multipart/form-data">
|
||||
<div class="mb-16">
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">PDF File</label>
|
||||
<input type="file" name="file" accept=".pdf" id="upload-file"
|
||||
style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:8px;width:100%;font-family:inherit;">
|
||||
</div>
|
||||
<div class="mb-16">
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
|
||||
<input type="text" name="category" id="upload-category" list="cat-list" class="search-box"
|
||||
placeholder="Select or type a category..." style="margin-bottom:0;">
|
||||
<datalist id="cat-list">{{ options_html|safe }}</datalist>
|
||||
</div>
|
||||
<button type="submit" class="btn" id="upload-btn">Upload</button>
|
||||
<span id="upload-status" style="margin-left:12px;font-size:12px;"></span>
|
||||
</form>
|
||||
</div>
|
||||
<div id="upload-result" style="display:none;" class="panel"></div>
|
||||
|
||||
<h3 class="section-title">Recent Documents</h3>
|
||||
<table>
|
||||
<tr><th>Filename</th><th>Source</th><th>Status</th></tr>
|
||||
{% for d in recent %}
|
||||
<tr>
|
||||
<td>{{ d.filename or '?' }}</td>
|
||||
<td>{{ d.source or '' }}</td>
|
||||
<td><span class="status status-{{ d.status or 'unknown' }}">{{ d.status or 'unknown' }}</span></td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</table>
|
||||
{% endblock %}
|
||||
{% block scripts %}
|
||||
<script>
|
||||
document.getElementById('upload-form').addEventListener('submit', async function(e) {
|
||||
e.preventDefault();
|
||||
var btn = document.getElementById('upload-btn');
|
||||
var status = document.getElementById('upload-status');
|
||||
var result = document.getElementById('upload-result');
|
||||
var fileInput = document.getElementById('upload-file');
|
||||
var category = document.getElementById('upload-category').value;
|
||||
|
||||
if (!fileInput.files.length) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'No file selected';
|
||||
return;
|
||||
}
|
||||
|
||||
btn.disabled = true;
|
||||
status.style.color = '#ffa500';
|
||||
status.textContent = 'Uploading...';
|
||||
result.style.display = 'none';
|
||||
|
||||
var formData = new FormData();
|
||||
formData.append('file', fileInput.files[0]);
|
||||
formData.append('category', category);
|
||||
|
||||
try {
|
||||
var resp = await fetch('/api/upload', { method: 'POST', body: formData });
|
||||
var data = await resp.json();
|
||||
if (resp.ok) {
|
||||
status.style.color = '#00ff41';
|
||||
status.textContent = 'Upload successful';
|
||||
result.style.display = 'block';
|
||||
result.innerHTML = '<span style="color:#00ff41;">Queued for processing</span><br>' +
|
||||
'<span class="text-dim">Hash: ' + data.hash + '</span><br>' +
|
||||
'<span class="text-dim">File: ' + data.filename + '</span><br>' +
|
||||
'<span class="text-dim">Category: ' + data.source + '/' + data.category + '</span>';
|
||||
fileInput.value = '';
|
||||
} else {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = data.error || 'Upload failed';
|
||||
}
|
||||
} catch (err) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'Network error: ' + err.message;
|
||||
}
|
||||
btn.disabled = false;
|
||||
});
|
||||
</script>
|
||||
{% endblock %}
|
||||
76
templates/knowledge/web_ingest.html
Normal file
76
templates/knowledge/web_ingest.html
Normal file
|
|
@ -0,0 +1,76 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<h3 class="section-title mb-16">Web Ingest</h3>
|
||||
<div style="margin-bottom:8px;">
|
||||
<a href="#single" class="btn active" onclick="showSection('single')" id="tab-single">Single/Batch URL</a>
|
||||
<a href="#crawl" class="btn" onclick="showSection('crawl')" id="tab-crawl">Site Crawl</a>
|
||||
</div>
|
||||
|
||||
<div id="section-single">
|
||||
<div class="panel">
|
||||
<div class="mb-16">
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">URL(s) — one per line for batch</label>
|
||||
<textarea id="wi-urls" class="search-box" rows="4" placeholder="https://example.com/article" style="resize:vertical;margin-bottom:0;"></textarea>
|
||||
</div>
|
||||
<div class="mb-16">
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
|
||||
<input type="text" id="wi-category" list="wi-cat-list" class="search-box" value="Web"
|
||||
placeholder="Category..." style="margin-bottom:0;">
|
||||
<datalist id="wi-cat-list">{{ options_html|safe }}</datalist>
|
||||
</div>
|
||||
<button class="btn" id="wi-btn" onclick="doWebIngest()">Ingest</button>
|
||||
<span id="wi-status" style="margin-left:12px;font-size:12px;"></span>
|
||||
</div>
|
||||
<div id="wi-results" style="display:none;" class="panel" style="max-height:300px;overflow-y:auto;"></div>
|
||||
</div>
|
||||
|
||||
<div id="section-crawl" style="display:none;">
|
||||
<div class="panel">
|
||||
<div class="mb-16">
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Site URL</label>
|
||||
<input type="text" id="crawl-url" class="search-box" placeholder="https://example.com" style="margin-bottom:0;">
|
||||
</div>
|
||||
<div class="grid-2 mb-16">
|
||||
<div>
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
|
||||
<input type="text" id="crawl-category" list="wi-cat-list" class="search-box" value="Web" style="margin-bottom:0;">
|
||||
</div>
|
||||
<div>
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Max Pages</label>
|
||||
<input type="number" id="crawl-max-pages" class="search-box" value="500" min="1" max="5000" style="margin-bottom:0;">
|
||||
</div>
|
||||
</div>
|
||||
<div class="grid-2 mb-16">
|
||||
<div>
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Include Paths (comma-separated)</label>
|
||||
<input type="text" id="crawl-include" class="search-box" placeholder="/docs/, /blog/" style="margin-bottom:0;">
|
||||
</div>
|
||||
<div>
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Exclude Paths (comma-separated)</label>
|
||||
<input type="text" id="crawl-exclude" class="search-box" placeholder="/search, /login" style="margin-bottom:0;">
|
||||
</div>
|
||||
</div>
|
||||
<button class="btn" id="crawl-preview-btn" onclick="doCrawl(true)">Preview</button>
|
||||
<button class="btn" id="crawl-btn" onclick="doCrawl(false)" style="margin-left:8px;">Crawl & Ingest</button>
|
||||
<span id="crawl-status" style="margin-left:12px;font-size:12px;"></span>
|
||||
</div>
|
||||
<div id="crawl-results" style="display:none;" class="panel" style="max-height:400px;overflow-y:auto;font-size:12px;"></div>
|
||||
</div>
|
||||
|
||||
<h3 class="section-title mt-24">Recent Web Ingestions</h3>
|
||||
<table>
|
||||
<tr><th>Title</th><th>Source/Category</th><th>Status</th><th>Pages</th><th>Concepts</th></tr>
|
||||
{% for d in web_docs %}
|
||||
<tr>
|
||||
<td title="{{ d.path or '' }}" style="max-width:400px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;">{{ d.book_title or d.filename or '?' }}</td>
|
||||
<td>{{ d.source or '' }}/{{ d.category or '' }}</td>
|
||||
<td><span class="status status-{{ d.status or 'unknown' }}">{{ d.status or 'unknown' }}</span></td>
|
||||
<td>{{ d.pages_extracted or 0 }}</td>
|
||||
<td>{{ d.concepts_extracted or 0 }}</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</table>
|
||||
{% endblock %}
|
||||
{% block scripts %}
|
||||
<script src="/static/js/web-ingest.js"></script>
|
||||
{% endblock %}
|
||||
53
templates/peertube/channels.html
Normal file
53
templates/peertube/channels.html
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<h3 class="section-title mb-16">PeerTube Channels</h3>
|
||||
|
||||
<div class="stat-grid" id="pt-stats" style="margin-bottom:24px;">
|
||||
<div class="stat-card"><div class="value" id="pt-total-ch">—</div><div class="label">Channels</div></div>
|
||||
<div class="stat-card"><div class="value" id="pt-total-vid">—</div><div class="label">Videos</div></div>
|
||||
<div class="stat-card"><div class="value" id="pt-dl-status">—</div><div class="label">Downloader</div></div>
|
||||
</div>
|
||||
|
||||
<div class="panel">
|
||||
<div class="flex gap-8" style="flex-wrap:wrap;align-items:flex-end;margin-bottom:12px;">
|
||||
<div style="flex:1;min-width:250px;">
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">YouTube URL</label>
|
||||
<input type="text" id="pt-yt-url" class="search-box" placeholder="https://www.youtube.com/@ChannelName" style="margin-bottom:0;width:100%;">
|
||||
</div>
|
||||
<div style="min-width:150px;">
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Category</label>
|
||||
<input type="text" id="pt-category" list="pt-cat-list" class="search-box" placeholder="e.g. OPSEC/Privacy" style="margin-bottom:0;width:100%;">
|
||||
<datalist id="pt-cat-list"></datalist>
|
||||
</div>
|
||||
<div style="min-width:60px;">
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Priority</label>
|
||||
<select id="pt-priority" style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:6px 10px;font-family:inherit;font-size:12px;width:100%;">
|
||||
<option value="M">M</option>
|
||||
<option value="H">H</option>
|
||||
<option value="L">L</option>
|
||||
</select>
|
||||
</div>
|
||||
<button class="btn" id="pt-add-btn" onclick="addChannel()">Add Channel</button>
|
||||
</div>
|
||||
<div id="pt-feedback" style="font-size:12px;min-height:18px;"></div>
|
||||
</div>
|
||||
|
||||
<div style="background:#111;border:1px solid #222;overflow-x:auto;">
|
||||
<table style="width:100%;border-collapse:collapse;font-size:12px;" id="pt-channel-table">
|
||||
<thead>
|
||||
<tr style="border-bottom:1px solid #222;">
|
||||
<th style="text-align:left;padding:10px;">Channel</th>
|
||||
<th style="text-align:center;padding:10px;">Videos</th>
|
||||
<th style="text-align:left;padding:10px;">Category</th>
|
||||
<th style="text-align:center;padding:10px;">Pri</th>
|
||||
<th style="text-align:center;padding:10px;">Status</th>
|
||||
<th style="text-align:center;padding:10px;width:60px;"></th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody id="pt-channel-tbody"><tr><td colspan="6" style="text-align:center;padding:20px;color:#555;">Loading...</td></tr></tbody>
|
||||
</table>
|
||||
</div>
|
||||
{% endblock %}
|
||||
{% block scripts %}
|
||||
<script src="/static/js/channels.js"></script>
|
||||
{% endblock %}
|
||||
53
templates/peertube/dashboard.html
Normal file
53
templates/peertube/dashboard.html
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<div id="pt-dashboard">
|
||||
<div class="stat-grid" style="grid-template-columns:repeat(6, 1fr);">
|
||||
<div class="stat-card"><div class="label">Published</div><div class="value" id="pt-published">—</div></div>
|
||||
<div class="stat-card"><div class="label">In Pipeline</div><div class="value" id="pt-in-pipeline">—</div></div>
|
||||
<div class="stat-card"><div class="label">Failed</div><div class="value" id="pt-failed">—</div></div>
|
||||
<div class="stat-card"><div class="label">Import Rate</div><div class="value" id="pt-import-rate">—</div><div class="sublabel">/hour</div></div>
|
||||
<div class="stat-card"><div class="label">GPU Util</div><div class="value" id="pt-gpu-util">—</div><div class="sublabel">%</div></div>
|
||||
<div class="stat-card"><div class="label">GPU Temp</div><div class="value" id="pt-gpu-temp">—</div><div class="sublabel">°C</div></div>
|
||||
</div>
|
||||
|
||||
<div class="mb-24">
|
||||
<div class="flex-between" style="margin-bottom:4px;font-size:11px;color:#888;">
|
||||
<span>Pipeline Flow</span>
|
||||
<span id="pt-pipeline-summary"></span>
|
||||
</div>
|
||||
<div id="pt-pipeline-bar" class="pipeline-bar"></div>
|
||||
<div id="pt-pipeline-legend" class="pipeline-legend"></div>
|
||||
</div>
|
||||
|
||||
<div class="svc-row">
|
||||
<div class="svc-item"><span class="svc-dot unknown" id="svc-downloader"></span>Downloader</div>
|
||||
<div class="svc-item"><span class="svc-dot unknown" id="svc-importer"></span>Importer</div>
|
||||
<div class="svc-item"><span class="svc-dot unknown" id="svc-transcoder"></span>Transcoder</div>
|
||||
<div class="svc-item"><span class="svc-dot unknown" id="svc-runner"></span>Runner</div>
|
||||
</div>
|
||||
|
||||
<div id="pt-gpu-panel" class="panel" style="display:none;">
|
||||
<h3 class="section-title" style="margin-bottom:8px;">GPU Status</h3>
|
||||
<div id="pt-gpu-detail" class="text-small text-muted"></div>
|
||||
</div>
|
||||
|
||||
<div id="pt-chart-container" class="panel" style="display:none;">
|
||||
<h3 class="section-title" style="margin-bottom:8px;">Pipeline Activity (24h)</h3>
|
||||
<canvas id="pt-chart" width="800" height="200" style="width:100%;height:200px;"></canvas>
|
||||
</div>
|
||||
|
||||
<div id="pt-storage" class="panel">
|
||||
<h3 class="section-title" style="margin-bottom:12px;">Pipeline Storage</h3>
|
||||
<div id="pt-storage-content" class="text-small text-muted">Loading...</div>
|
||||
</div>
|
||||
|
||||
<details id="pt-errors-panel" class="errors-panel panel">
|
||||
<summary>Recent Errors (<span id="pt-error-count">0</span>)</summary>
|
||||
<div id="pt-errors-content" style="margin-top:8px;"></div>
|
||||
</details>
|
||||
</div>
|
||||
{% endblock %}
|
||||
{% block scripts %}
|
||||
<script src="/static/js/charts.js"></script>
|
||||
<script src="/static/js/peertube.js"></script>
|
||||
{% endblock %}
|
||||
41
templates/search.html
Normal file
41
templates/search.html
Normal file
|
|
@ -0,0 +1,41 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<h3 class="section-title mb-16">Semantic Search</h3>
|
||||
<form method="get" action="/search">
|
||||
<input type="text" name="q" class="search-box" placeholder="Search the knowledge base..." value="{{ query or '' }}" autofocus>
|
||||
</form>
|
||||
|
||||
{% if not query %}
|
||||
<p class="text-dim text-small" style="margin-top:8px;">Enter a query to search across all embedded concepts.</p>
|
||||
{% elif results is defined %}
|
||||
<p class="text-dim text-small mb-16">{{ results|length }} results for: <strong class="text-green">{{ query }}</strong></p>
|
||||
|
||||
{% for r in results %}
|
||||
<div class="result">
|
||||
<span class="score">{{ '%.4f'|format(r.score) }}</span>
|
||||
<div class="title">{{ r.title }}</div>
|
||||
<div class="meta">
|
||||
{{ r.citation }}
|
||||
{% if r.download_url %}
|
||||
{% if r.source_type == 'web' or (r.download_url.startswith('http') and 'files.echo6.co' not in r.download_url) %}
|
||||
| <a href="{{ r.download_url }}" target="_blank" style="color:#00bfff;text-decoration:none;">Web</a>
|
||||
{% else %}
|
||||
| <a href="{{ r.download_url }}" style="color:#00bfff;text-decoration:none;">PDF</a>
|
||||
{% endif %}
|
||||
{% endif %}
|
||||
{% if r.knowledge_type %}| {{ r.knowledge_type }}{% endif %}
|
||||
{% if r.complexity %}/ {{ r.complexity }}{% endif %}
|
||||
</div>
|
||||
<div class="content-text">{{ r.summary }}</div>
|
||||
<div style="margin-top:6px;">
|
||||
{% for d in r.domains %}
|
||||
<span class="domain-tag">{{ d }}</span>
|
||||
{% endfor %}
|
||||
</div>
|
||||
</div>
|
||||
{% endfor %}
|
||||
|
||||
{% elif error %}
|
||||
<p style="color:#ff4444;">Search error: {{ error }}</p>
|
||||
{% endif %}
|
||||
{% endblock %}
|
||||
94
templates/settings/cookies.html
Normal file
94
templates/settings/cookies.html
Normal file
|
|
@ -0,0 +1,94 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<h3 class="section-title mb-16">YouTube Cookies</h3>
|
||||
<div class="panel">
|
||||
<div id="cookie-status" style="margin-bottom:16px;font-size:12px;color:#666;">Loading cookie status...</div>
|
||||
<div class="mb-16">
|
||||
<label class="text-dim text-xs" style="text-transform:uppercase;display:block;margin-bottom:4px;">Cookies.txt File (Netscape format)</label>
|
||||
<input type="file" id="cookie-file" accept=".txt"
|
||||
style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:8px;width:100%;font-family:inherit;">
|
||||
</div>
|
||||
<button class="btn" id="cookie-btn" onclick="uploadCookies()">Upload Cookies</button>
|
||||
<span id="cookie-upload-status" style="margin-left:12px;font-size:12px;"></span>
|
||||
<div id="cookie-result" style="display:none;background:#0a0a0a;border:1px solid #222;padding:12px;margin-top:16px;font-size:11px;white-space:pre-wrap;color:#888;max-height:200px;overflow-y:auto;"></div>
|
||||
</div>
|
||||
{% endblock %}
|
||||
{% block scripts %}
|
||||
<script>
|
||||
async function loadCookieStatus() {
|
||||
try {
|
||||
var resp = await fetch('/api/cookies/status');
|
||||
var data = await resp.json();
|
||||
if (resp.ok) {
|
||||
var age = data.age_hours;
|
||||
var ageStr, ageColor;
|
||||
if (age < 24) {
|
||||
ageStr = Math.round(age) + ' hours ago';
|
||||
ageColor = '#00ff41';
|
||||
} else {
|
||||
var days = Math.round(age / 24);
|
||||
ageStr = days + ' days ago';
|
||||
ageColor = days > 14 ? '#ff4444' : days > 7 ? '#ffa500' : '#00ff41';
|
||||
}
|
||||
var html = '<span style="color:' + ageColor + ';">Last updated: ' + ageStr + '</span>';
|
||||
if (data.is_stale) {
|
||||
html += ' <span style="color:#ff4444;font-weight:bold;">[STALE - cookies likely expired]</span>';
|
||||
}
|
||||
if (data.recent_rate_limits > 0) {
|
||||
html += '<br><span style="color:#ffa500;">YouTube rate limits in last 30min: ' + data.recent_rate_limits + '</span>';
|
||||
}
|
||||
html += '<br><span class="text-faint">Downloader: ' + (data.downloader_active ? 'active' : 'stopped') + '</span>';
|
||||
document.getElementById('cookie-status').innerHTML = html;
|
||||
} else {
|
||||
document.getElementById('cookie-status').innerHTML = '<span class="text-red">Could not check cookie status</span>';
|
||||
}
|
||||
} catch(e) {
|
||||
document.getElementById('cookie-status').innerHTML = '<span class="text-red">Error: ' + e.message + '</span>';
|
||||
}
|
||||
}
|
||||
|
||||
async function uploadCookies() {
|
||||
var fileInput = document.getElementById('cookie-file');
|
||||
var btn = document.getElementById('cookie-btn');
|
||||
var status = document.getElementById('cookie-upload-status');
|
||||
var result = document.getElementById('cookie-result');
|
||||
if (!fileInput.files.length) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'No file selected';
|
||||
return;
|
||||
}
|
||||
btn.disabled = true;
|
||||
status.style.color = '#ffa500';
|
||||
status.textContent = 'Uploading and testing cookies...';
|
||||
result.style.display = 'none';
|
||||
var formData = new FormData();
|
||||
formData.append('file', fileInput.files[0]);
|
||||
try {
|
||||
var resp = await fetch('/api/cookies/upload', { method: 'POST', body: formData });
|
||||
var data = await resp.json();
|
||||
if (data.ok) {
|
||||
status.style.color = '#00ff41';
|
||||
status.textContent = 'Cookies updated and verified';
|
||||
result.style.display = 'block';
|
||||
result.style.borderColor = '#00ff41';
|
||||
result.innerHTML = '<span style="color:#00ff41;">SUCCESS</span><br>' + (data.test_output || '') + '<br>Data lines: ' + data.data_lines;
|
||||
loadCookieStatus();
|
||||
} else {
|
||||
status.style.color = data.error ? '#ff4444' : '#ffa500';
|
||||
status.textContent = data.error || data.message || 'Upload issue';
|
||||
if (data.test_output) {
|
||||
result.style.display = 'block';
|
||||
result.style.borderColor = '#ff4444';
|
||||
result.textContent = data.test_output;
|
||||
}
|
||||
}
|
||||
} catch(e) {
|
||||
status.style.color = '#ff4444';
|
||||
status.textContent = 'Network error: ' + e.message;
|
||||
}
|
||||
btn.disabled = false;
|
||||
}
|
||||
|
||||
loadCookieStatus();
|
||||
</script>
|
||||
{% endblock %}
|
||||
68
templates/settings/health.html
Normal file
68
templates/settings/health.html
Normal file
|
|
@ -0,0 +1,68 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<h3 class="section-title mb-16">Service Health</h3>
|
||||
|
||||
<div id="health-grid" class="stat-grid" style="grid-template-columns:repeat(auto-fit, minmax(250px, 1fr));">
|
||||
<div class="stat-card">
|
||||
<div class="label">Qdrant</div>
|
||||
<div class="value text-small" id="h-qdrant"><span class="svc-dot unknown"></span>Checking...</div>
|
||||
</div>
|
||||
<div class="stat-card">
|
||||
<div class="label">TEI Embeddings</div>
|
||||
<div class="value text-small" id="h-tei"><span class="svc-dot unknown"></span>Checking...</div>
|
||||
</div>
|
||||
<div class="stat-card">
|
||||
<div class="label">NFS Mount</div>
|
||||
<div class="value text-small" id="h-nfs"><span class="svc-dot unknown"></span>Checking...</div>
|
||||
</div>
|
||||
<div class="stat-card">
|
||||
<div class="label">Gemini API</div>
|
||||
<div class="value text-small" id="h-gemini"><span class="svc-dot unknown"></span>Checking...</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3 class="section-title mt-24">Pipeline Status</h3>
|
||||
<div id="h-pipeline" class="panel text-small text-dim">Loading...</div>
|
||||
{% endblock %}
|
||||
{% block scripts %}
|
||||
<script>
|
||||
async function loadHealth() {
|
||||
try {
|
||||
var resp = await fetch('/api/health');
|
||||
var data = await resp.json();
|
||||
var c = data.components || {};
|
||||
|
||||
function dot(status) {
|
||||
var cls = status === 'up' ? 'active' : (status === 'configured' ? 'active' : 'inactive');
|
||||
return '<span class="svc-dot ' + cls + '"></span>';
|
||||
}
|
||||
|
||||
var q = c.qdrant || {};
|
||||
document.getElementById('h-qdrant').innerHTML = dot(q.status) + (q.status === 'up' ? 'Online — ' + RECON.fmt(q.vectors) + ' vectors' : 'Offline' + (q.error ? ' — ' + q.error : ''));
|
||||
|
||||
var t = c.tei || {};
|
||||
document.getElementById('h-tei').innerHTML = dot(t.status) + (t.status === 'up' ? 'Online' : 'Offline' + (t.error ? ' — ' + t.error : ''));
|
||||
|
||||
var n = c.nfs || {};
|
||||
document.getElementById('h-nfs').innerHTML = dot(n.status) + (n.status === 'up' ? 'Mounted' : 'Not mounted');
|
||||
|
||||
var g = c.gemini || {};
|
||||
document.getElementById('h-gemini').innerHTML = dot(g.status === 'configured' ? 'up' : 'down') + (g.status === 'configured' ? g.keys + ' keys configured' : 'No keys');
|
||||
|
||||
// Pipeline
|
||||
var p = data.pipeline || {};
|
||||
var html = '';
|
||||
Object.keys(p).forEach(function(k) {
|
||||
html += '<div style="margin:4px 0;"><span class="status status-' + k + '">' + k + '</span>: ' + p[k] + '</div>';
|
||||
});
|
||||
document.getElementById('h-pipeline').innerHTML = html || '<span class="text-dim">No pipeline data</span>';
|
||||
} catch(e) {
|
||||
document.getElementById('h-qdrant').innerHTML = '<span class="svc-dot inactive"></span>Error: ' + e.message;
|
||||
}
|
||||
}
|
||||
|
||||
document.addEventListener('DOMContentLoaded', function() {
|
||||
RECON.startRefresh(loadHealth, 30000);
|
||||
});
|
||||
</script>
|
||||
{% endblock %}
|
||||
137
templates/settings/keys.html
Normal file
137
templates/settings/keys.html
Normal file
|
|
@ -0,0 +1,137 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<h3 class="section-title mb-16">API Keys</h3>
|
||||
<div style="margin-bottom:20px;">
|
||||
<button class="btn" onclick="validateAll()" id="btn-validate">Validate All</button>
|
||||
<button class="btn" onclick="reloadKeys()" style="margin-left:8px;">Reload from .env</button>
|
||||
<button class="btn btn-warn" onclick="restartService()" style="margin-left:8px;">Restart Service</button>
|
||||
<span id="validate-status" style="margin-left:12px;color:#666;font-size:12px;"></span>
|
||||
</div>
|
||||
<table id="keys-table">
|
||||
<tr><th>#</th><th>Key</th><th>Status</th><th>Calls</th><th>Errors</th><th>Last Used</th><th>Actions</th></tr>
|
||||
{% for k in keys_data %}
|
||||
<tr id="key-row-{{ k.index }}">
|
||||
<td>{{ k.index + 1 }}</td>
|
||||
<td class="mono text-small">{{ k.masked }}</td>
|
||||
<td>
|
||||
{% if k.valid is true %}
|
||||
<span class="text-green">Valid</span>
|
||||
{% elif k.valid is false %}
|
||||
<span class="text-red">Invalid</span>
|
||||
{% else %}
|
||||
<span class="text-dim">—</span>
|
||||
{% endif %}
|
||||
</td>
|
||||
<td>{{ k.calls }}</td>
|
||||
<td class="{% if k.errors %}text-red{% else %}text-muted{% endif %}">{{ k.errors }}</td>
|
||||
<td class="text-dim text-xs">{{ k.last_used or '—' }}</td>
|
||||
<td>
|
||||
<button class="btn text-xs" onclick="validateKey({{ k.index }})">Test</button>
|
||||
<button class="btn btn-danger text-xs" onclick="removeKey({{ k.index }})">Remove</button>
|
||||
</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</table>
|
||||
|
||||
<div style="margin-top:24px;border-top:1px solid #222;padding-top:16px;">
|
||||
<h4 class="text-muted" style="margin-bottom:12px;">Add Key</h4>
|
||||
<div class="flex gap-8" style="align-items:center;">
|
||||
<input type="text" id="new-key" placeholder="Paste Gemini API key..."
|
||||
style="flex:1;background:#1a1a1a;border:1px solid #333;color:#ccc;padding:8px 12px;border-radius:4px;font-family:monospace;font-size:13px;">
|
||||
<button class="btn" onclick="addKey()">Add</button>
|
||||
</div>
|
||||
<div id="add-result" style="margin-top:8px;font-size:12px;"></div>
|
||||
</div>
|
||||
|
||||
<div style="margin-top:24px;border-top:1px solid #222;padding-top:16px;">
|
||||
<h4 class="text-muted" style="margin-bottom:12px;">Replace Key</h4>
|
||||
<div class="flex gap-8" style="align-items:center;">
|
||||
<input type="number" id="replace-index" placeholder="#" min="0" max="9"
|
||||
style="width:50px;background:#1a1a1a;border:1px solid #333;color:#ccc;padding:8px;border-radius:4px;text-align:center;">
|
||||
<input type="text" id="replace-key" placeholder="New Gemini API key..."
|
||||
style="flex:1;background:#1a1a1a;border:1px solid #333;color:#ccc;padding:8px 12px;border-radius:4px;font-family:monospace;font-size:13px;">
|
||||
<button class="btn" onclick="replaceKey()">Replace</button>
|
||||
</div>
|
||||
<div id="replace-result" style="margin-top:8px;font-size:12px;"></div>
|
||||
</div>
|
||||
{% endblock %}
|
||||
{% block scripts %}
|
||||
<script>
|
||||
async function validateAll() {
|
||||
document.getElementById('btn-validate').disabled = true;
|
||||
document.getElementById('validate-status').textContent = 'Validating...';
|
||||
try {
|
||||
var r = await fetch('/api/keys/validate', {method:'POST'});
|
||||
var data = await r.json();
|
||||
document.getElementById('validate-status').textContent = 'Done — ' + data.results.filter(function(r){return r.valid;}).length + '/' + data.results.length + ' valid';
|
||||
setTimeout(function() { location.reload(); }, 1000);
|
||||
} catch(e) {
|
||||
document.getElementById('validate-status').textContent = 'Error: ' + e;
|
||||
}
|
||||
document.getElementById('btn-validate').disabled = false;
|
||||
}
|
||||
|
||||
async function validateKey(idx) {
|
||||
try {
|
||||
var r = await fetch('/api/keys/' + idx + '/validate', {method:'POST'});
|
||||
var data = await r.json();
|
||||
alert('Key ' + (idx+1) + ': ' + data.message);
|
||||
location.reload();
|
||||
} catch(e) { alert('Error: ' + e); }
|
||||
}
|
||||
|
||||
async function removeKey(idx) {
|
||||
if (!confirm('Remove key ' + (idx+1) + '? Pipeline needs at least 1 key.')) return;
|
||||
try {
|
||||
var r = await fetch('/api/keys/' + idx, {method:'DELETE'});
|
||||
var data = await r.json();
|
||||
if (data.error) { alert(data.error); return; }
|
||||
location.reload();
|
||||
} catch(e) { alert('Error: ' + e); }
|
||||
}
|
||||
|
||||
async function addKey() {
|
||||
var key = document.getElementById('new-key').value.trim();
|
||||
if (!key) return;
|
||||
try {
|
||||
var r = await fetch('/api/keys', {method:'POST', headers:{'Content-Type':'application/json'}, body:JSON.stringify({key:key})});
|
||||
var data = await r.json();
|
||||
if (data.error) { document.getElementById('add-result').innerHTML = '<span class="text-red">' + data.error + '</span>'; return; }
|
||||
document.getElementById('add-result').innerHTML = '<span class="text-green">Added at position ' + (data.index+1) + '</span>';
|
||||
setTimeout(function() { location.reload(); }, 1000);
|
||||
} catch(e) { document.getElementById('add-result').innerHTML = '<span class="text-red">' + e + '</span>'; }
|
||||
}
|
||||
|
||||
async function replaceKey() {
|
||||
var idx = parseInt(document.getElementById('replace-index').value) - 1;
|
||||
var key = document.getElementById('replace-key').value.trim();
|
||||
if (isNaN(idx) || !key) return;
|
||||
try {
|
||||
var r = await fetch('/api/keys/' + idx, {method:'PUT', headers:{'Content-Type':'application/json'}, body:JSON.stringify({key:key})});
|
||||
var data = await r.json();
|
||||
if (data.error) { document.getElementById('replace-result').innerHTML = '<span class="text-red">' + data.error + '</span>'; return; }
|
||||
document.getElementById('replace-result').innerHTML = '<span class="text-green">Replaced key ' + (idx+1) + '</span>';
|
||||
setTimeout(function() { location.reload(); }, 1000);
|
||||
} catch(e) { document.getElementById('replace-result').innerHTML = '<span class="text-red">' + e + '</span>'; }
|
||||
}
|
||||
|
||||
async function restartService() {
|
||||
if (!confirm('Restart RECON service? Pipeline will pause for ~10 seconds.')) return;
|
||||
document.getElementById('validate-status').textContent = 'Restarting...';
|
||||
try {
|
||||
await fetch('/api/service/restart', {method:'POST'});
|
||||
} catch(e) {}
|
||||
document.getElementById('validate-status').innerHTML = '<span style="color:#ff8800;">Restarting... page will reload in 10s</span>';
|
||||
setTimeout(function() { location.reload(); }, 30000);
|
||||
}
|
||||
|
||||
async function reloadKeys() {
|
||||
try {
|
||||
var r = await fetch('/api/keys/reload', {method:'POST'});
|
||||
var data = await r.json();
|
||||
alert('Reloaded ' + data.count + ' key(s) from .env');
|
||||
location.reload();
|
||||
} catch(e) { alert('Error: ' + e); }
|
||||
}
|
||||
</script>
|
||||
{% endblock %}
|
||||
97
templates/settings/vpn.html
Normal file
97
templates/settings/vpn.html
Normal file
|
|
@ -0,0 +1,97 @@
|
|||
{% extends "base.html" %}
|
||||
{% block content %}
|
||||
<h3 class="section-title mb-16">NordVPN</h3>
|
||||
<div class="panel">
|
||||
<div id="vpn-status" style="margin-bottom:16px;font-size:12px;color:#666;">Loading VPN status...</div>
|
||||
<div class="flex gap-8" style="flex-wrap:wrap;margin-bottom:12px;">
|
||||
<button class="btn" onclick="vpnRotate()" id="vpn-rotate-btn">Rotate</button>
|
||||
<button class="btn" onclick="vpnDisconnect()" id="vpn-disconnect-btn">Disconnect</button>
|
||||
<select id="vpn-country" style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:6px;font-family:inherit;font-size:12px;">
|
||||
<option value="United_States">United States</option>
|
||||
<option value="Canada">Canada</option>
|
||||
<option value="United_Kingdom">United Kingdom</option>
|
||||
<option value="Germany">Germany</option>
|
||||
<option value="Netherlands">Netherlands</option>
|
||||
<option value="Sweden">Sweden</option>
|
||||
</select>
|
||||
<button class="btn" onclick="vpnConnect()" id="vpn-connect-btn">Connect</button>
|
||||
</div>
|
||||
<span id="vpn-action-status" style="font-size:12px;"></span>
|
||||
<details style="margin-top:16px;">
|
||||
<summary class="text-faint" style="cursor:pointer;font-size:11px;">Setup (one-time)</summary>
|
||||
<div style="margin-top:8px;">
|
||||
<input type="password" id="vpn-token" placeholder="NordVPN token"
|
||||
style="background:#0a0a0a;border:1px solid #333;color:#c0c0c0;padding:6px;width:300px;font-family:inherit;font-size:12px;">
|
||||
<button class="btn" onclick="vpnLogin()">Login</button>
|
||||
<span id="vpn-login-status" style="font-size:11px;margin-left:8px;"></span>
|
||||
</div>
|
||||
</details>
|
||||
</div>
|
||||
{% endblock %}
|
||||
{% block scripts %}
|
||||
<script>
|
||||
async function loadVpnStatus() {
|
||||
try {
|
||||
var resp = await fetch('/api/vpn/status');
|
||||
var data = await resp.json();
|
||||
if (resp.ok) {
|
||||
var dot = data.connected ? '<span style="color:#00ff41;">●</span>' : '<span style="color:#ff4444;">●</span>';
|
||||
var html = dot + ' ' + (data.connected ? 'Connected' : 'Disconnected');
|
||||
if (data.connected) {
|
||||
html += ' — <span style="color:#00ff41;">' + data.country + '</span>';
|
||||
html += ' <span class="text-faint">(' + data.ip + ')</span>';
|
||||
}
|
||||
if (data.rotations_today > 0) {
|
||||
html += '<br><span class="text-faint">Rotations today: ' + data.rotations_today + '</span>';
|
||||
}
|
||||
document.getElementById('vpn-status').innerHTML = html;
|
||||
}
|
||||
} catch(e) {
|
||||
document.getElementById('vpn-status').innerHTML = '<span class="text-red">Error: ' + e.message + '</span>';
|
||||
}
|
||||
}
|
||||
|
||||
async function vpnAction(url, opts, statusEl) {
|
||||
var el = document.getElementById(statusEl || 'vpn-action-status');
|
||||
el.style.color = '#ffa500';
|
||||
el.textContent = 'Working...';
|
||||
try {
|
||||
var resp = await fetch(url, opts);
|
||||
var data = await resp.json();
|
||||
if (data.ok) {
|
||||
el.style.color = '#00ff41';
|
||||
el.textContent = data.country ? (data.country + ' (' + data.ip + ')') : (data.message || 'Done');
|
||||
} else {
|
||||
el.style.color = '#ff4444';
|
||||
el.textContent = data.error || data.message || 'Failed';
|
||||
}
|
||||
loadVpnStatus();
|
||||
} catch(e) {
|
||||
el.style.color = '#ff4444';
|
||||
el.textContent = 'Error: ' + e.message;
|
||||
}
|
||||
}
|
||||
|
||||
function vpnRotate() { vpnAction('/api/vpn/rotate', {method:'POST'}); }
|
||||
function vpnDisconnect() { vpnAction('/api/vpn/disconnect', {method:'POST'}); }
|
||||
function vpnConnect() {
|
||||
var country = document.getElementById('vpn-country').value;
|
||||
vpnAction('/api/vpn/connect', {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify({country: country})
|
||||
});
|
||||
}
|
||||
function vpnLogin() {
|
||||
var token = document.getElementById('vpn-token').value;
|
||||
if (!token) return;
|
||||
vpnAction('/api/vpn/login', {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify({token: token})
|
||||
}, 'vpn-login-status');
|
||||
}
|
||||
|
||||
loadVpnStatus();
|
||||
</script>
|
||||
{% endblock %}
|
||||
Loading…
Add table
Add a link
Reference in a new issue