No description
  • Python 86.8%
  • HTML 6.1%
  • JavaScript 5.4%
  • CSS 1%
  • Shell 0.7%
Find a file
Matt efae4023f6 Phase 6c: remove vestigial extract worker, dead crawler, .bak files
recon.py:
- Remove extract stage_loop thread from cmd_service(). Confirmed
  vestigial: 0 queued items, silent logs over 24+ hour run. The new
  processors do extraction inline in pre_flight().
- Remove cmd_crawl CLI subcommand and its argparse registration.
- Clean up associated imports and variables.

Deleted:
- lib/crawler.py (432 lines) -- old web crawler subsystem, only
  referenced by the removed CLI subcommand.
- 24 .bak files (untracked pre-edit safety backups, originals
  preserved in git history).

Investigation found the four old loop function definitions
(scanner_loop, peertube_scanner_loop, crawler_scheduler_loop,
organizer_loop) were already deleted in Phase 5c-1.

Modules investigated and KEPT:
- lib/web_scraper.py -- exports chunk_text() used by transcript_processor
- lib/new_pipeline.py -- active Stream B library management CLI tool
- lib/peertube_scraper.py -- only mechanism for transcript ingestion
- lib/extractor.py -- would activate for new PDFs via cmd_run CLI

Service restart verified: 6 threads (dispatcher, enrich, embed,
filing, progress, dashboard), no extract worker, zero errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 23:46:00 +00:00
lib Phase 6c: remove vestigial extract worker, dead crawler, .bak files 2026-04-14 23:46:00 +00:00
scripts Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
static Phase 6b: fix dashboard Untitled/WEB bug for transcripts 2026-04-14 23:05:29 +00:00
templates Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
.gitignore Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
api.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
config.yaml Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
enricher.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
migrate_paths.py Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
PROJECT-BIBLE.md Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
README.md Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
recon.py Phase 6c: remove vestigial extract worker, dead crawler, .bak files 2026-04-14 23:46:00 +00:00
requirements.txt Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
run-pipeline-now.sh Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00
sweep_gated.sh Initial commit: RECON codebase baseline 2026-04-14 14:57:23 +00:00

RECON -- Knowledge Extraction Pipeline

Extracts structured knowledge from PDFs and web content into a Qdrant vector database for RAG retrieval by Aurora.

Quick Start

# Activate
cd /opt/recon && source venv/bin/activate

# Scan library for new PDFs
recon scan

# Queue and process
recon queue
recon extract
recon enrich
recon embed

# Or run full pipeline
recon run

# Ingest a web page
recon ingest-url "https://example.com/article" --category "Category" --process

# Crawl an entire docs site
recon crawl "https://docs.example.com" --include /docs/ --category "Category" --process

# Upload a PDF
recon upload --file /path/to/document.pdf --category "Category"

# Search
recon search "water purification methods"

# Check status
recon status
recon failures

Dashboard

http://100.64.0.24:8420

Services

Service Location Purpose
RECON Dashboard recon:8420 Pipeline management + API
Qdrant cortex:6333 Vector database
TEI cortex:8090 Embeddings (1,711/sec)
Ollama cortex:11434 Chat + fallback embeddings
OpenWebUI cortex:8080 (ai.echo6.co) Aurora chat with RAG
File Server recon:8888 (files.echo6.co) PDF downloads

Key Paths

Path Contents
/opt/recon/ Application code
/opt/recon/data/concepts/ Gemini extractions (CRITICAL -- back these up)
/opt/recon/data/text/ Extracted text
/opt/recon/data/recon.db SQLite status DB
/mnt/library/ PDF library (NFS from pi-nas)

Backups

Automated every 6 hours to Contabo VPS via /opt/recon/scripts/backup.sh. Concept JSONs are the most valuable data ($130+ of Gemini API work). Qdrant is NOT backed up -- rebuilt from JSONs in ~10 minutes via recon rebuild.

Monitoring

# Pipeline status
recon status

# Tail logs
tail -f /opt/recon/logs/recon.log

# Pipeline run log
tail -f /opt/recon/pipeline.log

# Validate consistency
recon validate --deep

Full Documentation

See PROJECT-BIBLE.md for complete system documentation.