mirror of
https://github.com/zvx-echo6/recon.git
synced 2026-05-20 06:34:40 +02:00
Pre-processes HTML tree before lxml .text_content() to prevent element concatenation: - <table> cells joined with ' | ' delimiter, rows with newlines - <br> tags produce newlines - <li> items get '- ' prefix and newline separation - <dt>/<dd> definition list items get newline separation Fixes ~868 mangled Qdrant points where table content was jammed together (e.g. 'Freq51Primary1A==' instead of 'Freq51 | Primary | 1A=='). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| pdf_processor.py | ||
| text_processor.py | ||
| transcript_processor.py | ||
| zim_processor.py | ||