recon

matt/recon

Fork 0

mirror of https://github.com/zvx-echo6/recon.git synced 2026-05-20 14:44:54 +02:00

Commit graph

Author	SHA1	Message	Date
Matt	83a21854c3	fix: PDF extraction quality — word-boundary checks and layout mode Adds _text_quality_ok() gate that replaces the bare 50-char length check at each stage of the extraction fallback chain. Checks: - Word-boundary ratio (≥60% of tokens must be real words) - Concatenation ratio (lc→UC transitions must be <10% of word count) When PyPDF2 default extraction fails quality check, retries with space_width=100 for tighter word-boundary detection. This fixes Haynes/workshop manuals where tight kerning produces concatenated words like 'byMike' and 'oftheGuild'. Also adds -layout flag to pdftotext subprocess calls for better spatial awareness in the poppler fallback stage. Note: PyPDF2 3.0.1 does not support layout=True parameter. The space_width parameter serves the same purpose. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-07 01:36:23 +00:00
Matt	563c16bb71	Initial commit: RECON codebase baseline Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete). Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 14:57:23 +00:00

Author

SHA1

Message

Date

Matt

83a21854c3

fix: PDF extraction quality — word-boundary checks and layout mode

Adds _text_quality_ok() gate that replaces the bare 50-char length
check at each stage of the extraction fallback chain. Checks:
- Word-boundary ratio (≥60% of tokens must be real words)
- Concatenation ratio (lc→UC transitions must be <10% of word count)

When PyPDF2 default extraction fails quality check, retries with
space_width=100 for tighter word-boundary detection. This fixes
Haynes/workshop manuals where tight kerning produces concatenated
words like 'byMike' and 'oftheGuild'.

Also adds -layout flag to pdftotext subprocess calls for better
spatial awareness in the poppler fallback stage.

Note: PyPDF2 3.0.1 does not support layout=True parameter.
The space_width parameter serves the same purpose.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-05-07 01:36:23 +00:00

Matt

563c16bb71

Initial commit: RECON codebase baseline

Current state of the pipeline code as of 2026-04-14 (Phase 1 scaffolding complete).
Config has new_pipeline.enabled=false and crawler.sites=[] per refactor plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-14 14:57:23 +00:00

2 commits