# LLM Memory - Quick Reference Card ## The Problem Current MeshAI sends full conversation history every request → wastes tokens, slow, expensive. ## The Solution **Rolling Summary Memory**: Keep recent messages + LLM-generated summary of older messages. ## Results - 70-80% token reduction for long conversations - Zero dependencies - Works with existing stack (AsyncOpenAI + SQLite) - ~100 lines of code --- ## How It Works (5-Second Version) ``` Long conversation (30 messages): Messages 1-22: "User discussed weather and hiking trails" (summary) Messages 23-30: [sent in full] Total tokens: ~600 instead of ~2400 (75% savings) ``` --- ## Implementation Checklist - [ ] Create `meshai/memory.py` - RollingSummaryMemory class - [ ] Modify `meshai/history.py` - Add summary table + storage methods - [ ] Modify `meshai/backends/openai_backend.py` - Integrate memory manager - [ ] Modify `meshai/responder.py` - Pass user_id, persist summaries - [ ] Modify `meshai/commands/reset.py` - Clear summaries on reset --- ## Configuration ```python # In memory.py initialization RollingSummaryMemory( client=self._client, model=config.model, window_size=4, # Keep last 4 exchanges (8 messages) summarize_threshold=8, # Re-summarize after 8 new messages ) ``` **Tune based on:** - `window_size`: Smaller = more summarization, larger = more recent context - `summarize_threshold`: Smaller = more frequent re-summarization --- ## Database Schema Addition ```sql CREATE TABLE conversation_summaries ( user_id TEXT PRIMARY KEY, summary TEXT NOT NULL, message_count INTEGER NOT NULL, updated_at REAL NOT NULL ); ``` --- ## Testing ```bash # Run proof-of-concept comparison python examples/memory_comparison.py # Update these first: # - BASE_URL (your LLM endpoint) # - API_KEY (your key) # - MODEL (your model name) ``` **Expected output:** ``` Approach Tokens Savings ---------------------------------------------- Full History 1847 (baseline) Rolling Summary 512 72.3% Window Only 398 78.4% ``` --- ## Key Code Snippets ### Memory Manager Usage ```python # Get optimized context summary, recent_messages = await memory.get_context_messages( user_id=user_id, full_history=all_messages, ) # Build message list if summary: system_prompt += f"\n\nPrevious conversation: {summary}" context = [system] + recent_messages else: context = [system] + all_messages ``` ### Store Summary ```python await history.store_summary( user_id=user_id, summary=summary_text, message_count=len(old_messages) ) ``` ### Load Summary on Startup ```python summary_data = await history.get_summary(user_id) if summary_data: backend.load_summary_cache(user_id, summary_data) ``` --- ## Performance Metrics | Messages | Full History | With Summary | Savings | |----------|--------------|--------------|---------| | 10 | 800 tokens | 800 tokens | 0% | | 20 | 1600 tokens | 550 tokens | 66% | | 30 | 2400 tokens | 600 tokens | 75% | | 50 | 4000 tokens | 650 tokens | 84% | **Cost Impact** (at $0.50/1M input tokens, 1000 requests/day): - Before: $36/month - After: $9/month - **Savings: $27/month** --- ## When to Use Alternatives | Use Case | Recommendation | |----------|----------------| | Simple stateless chat | Window-only memory | | MeshAI (your project) | **Rolling Summary** | | Want library solution | LangChain SummaryMemory | | Need semantic search | ChromaDB vector store | | Complex multi-day agent | MemGPT/Letta | --- ## Troubleshooting **Summary too short/long?** → Adjust `max_tokens` in `_summarize()` method (default: 150) **Summary quality poor?** → Modify prompt in `_summarize()`, lower temperature **Too much overhead?** → Increase `summarize_threshold` (re-summarize less often) **Want more context?** → Increase `window_size` (keep more recent messages) --- ## Documentation Files 1. **MEMORY_SUMMARY.md** - Overview and recommendation (this started here) 2. **MEMORY_RESEARCH.md** - Detailed evaluation of all 5 approaches 3. **MEMORY_IMPLEMENTATION_GUIDE.md** - Complete step-by-step implementation 4. **examples/memory_comparison.py** - Runnable proof-of-concept 5. **docs/memory_approaches_comparison.txt** - Visual comparison diagrams 6. **docs/QUICK_REFERENCE.md** - This cheat sheet --- ## One-Liner Summary **Use Rolling Summary**: Zero deps, 75% token savings, 100 lines of code, works with your stack.