# LLM Conversation Memory Research for MeshAI ## Current Implementation Analysis **Current approach:** MeshAI stuffs full conversation history into every LLM API call - Storage: SQLite via aiosqlite - Retrieval: `get_history_for_llm()` returns all messages (up to `max_messages_per_user * 2`) - Backend: OpenAI-compatible API (works with LiteLLM, local models) - Context: 150 char max per message, per-user conversations **Problem:** Inefficient - sends entire history even when unnecessary, wastes tokens and latency. --- ## 1. LangChain Memory Modules ### Installation ```bash pip install langchain langchain-community langchain-openai ``` ### A. ConversationBufferMemory (Simplest) **What it does:** Stores raw messages in memory, returns all messages. ```python from langchain.memory import ConversationBufferMemory from langchain_openai import ChatOpenAI from langchain.chains import ConversationChain # Initialize llm = ChatOpenAI( base_url="http://192.168.1.239:8000/v1", # LiteLLM api_key="your-key", model="gpt-4o-mini" ) memory = ConversationBufferMemory() chain = ConversationChain( llm=llm, memory=memory, verbose=False ) # Use it response = chain.predict(input="What's the weather?") print(response) # Access history print(memory.load_memory_variables({})) # {'history': 'Human: What's the weather?\nAI: ...'} ``` **Integration with MeshAI:** ```python # In meshai/backends/openai_backend.py from langchain.memory import ConversationBufferMemory from langchain_openai import ChatOpenAI from langchain.chains import ConversationChain class OpenAIBackendWithMemory(LLMBackend): def __init__(self, config: LLMConfig, api_key: str): self.config = config self._llm = ChatOpenAI( base_url=config.base_url, api_key=api_key, model=config.model, temperature=0.7, max_tokens=300 ) # Per-user memory storage self._user_memories: dict[str, ConversationBufferMemory] = {} def _get_memory(self, user_id: str) -> ConversationBufferMemory: if user_id not in self._user_memories: self._user_memories[user_id] = ConversationBufferMemory() return self._user_memories[user_id] async def generate( self, messages: list[dict], system_prompt: str, user_id: str, # NEW: need user_id for memory max_tokens: int = 300, ) -> str: memory = self._get_memory(user_id) # Create chain with memory chain = ConversationChain( llm=self._llm, memory=memory, verbose=False ) # Extract last user message last_msg = messages[-1]["content"] # Generate with memory response = await chain.apredict(input=last_msg) return response.strip() ``` **Pros:** - Dead simple, drop-in replacement - Works with any OpenAI-compatible API - No external dependencies - LangChain handles message formatting **Cons:** - Still sends full history (no real efficiency gain) - Stores everything in RAM (lost on restart) - Need to manage per-user memory dicts - Adds LangChain dependency (~50MB) **Verdict:** Not worth it - adds complexity without solving core problem. --- ### B. ConversationBufferWindowMemory (Better) **What it does:** Only keeps last N messages in context. ```python from langchain.memory import ConversationBufferWindowMemory # Keep only last 5 interactions (10 messages = 5 pairs) memory = ConversationBufferWindowMemory(k=5) chain = ConversationChain( llm=llm, memory=memory ) # Only last 5 exchanges sent to LLM response = chain.predict(input="Hello") ``` **Integration:** ```python class OpenAIBackendWithWindow(LLMBackend): def __init__(self, config: LLMConfig, api_key: str): self.config = config self._llm = ChatOpenAI( base_url=config.base_url, api_key=api_key, model=config.model ) # Per-user windowed memory self._user_memories: dict[str, ConversationBufferWindowMemory] = {} self._window_size = 5 # Last 5 exchanges def _get_memory(self, user_id: str) -> ConversationBufferWindowMemory: if user_id not in self._user_memories: self._user_memories[user_id] = ConversationBufferWindowMemory( k=self._window_size ) return self._user_memories[user_id] ``` **Pros:** - Simple sliding window approach - Reduces token usage automatically - Works with any OpenAI-compatible API - Configurable window size **Cons:** - Still in-memory only (lost on restart) - Forgets old context completely - Need to integrate with existing SQLite storage - Adds LangChain dependency **Verdict:** Better than full buffer, but loses long-term context. --- ### C. ConversationSummaryMemory (Most Interesting) **What it does:** Uses LLM to summarize conversation, keeps summary + recent messages. ```python from langchain.memory import ConversationSummaryMemory memory = ConversationSummaryMemory(llm=llm) chain = ConversationChain( llm=llm, memory=memory ) # After multiple messages, memory contains: # - Summary of old conversation # - Recent raw messages response = chain.predict(input="What did we talk about?") # AI can reference both summary and recent context ``` **Integration with SQLite persistence:** ```python from langchain.memory import ConversationSummaryMemory from langchain_openai import ChatOpenAI class OpenAIBackendWithSummary(LLMBackend): def __init__(self, config: LLMConfig, api_key: str, history: ConversationHistory): self.config = config self.history = history # Existing SQLite history self._llm = ChatOpenAI( base_url=config.base_url, api_key=api_key, model=config.model ) # Per-user summaries (load from DB) self._user_summaries: dict[str, str] = {} self._window_size = 4 # Keep last 4 messages raw async def generate( self, messages: list[dict], system_prompt: str, user_id: str, max_tokens: int = 300, ) -> str: # Get full history from SQLite full_history = await self.history.get_history(user_id) if len(full_history) <= self._window_size * 2: # Small conversation, just use raw messages context_messages = messages else: # Large conversation: summarize old + keep recent old_messages = full_history[:-self._window_size * 2] recent_messages = full_history[-self._window_size * 2:] # Get or create summary summary = await self._get_summary(user_id, old_messages) # Build context: system + summary + recent messages context_messages = [ {"role": "system", "content": f"{system_prompt}\n\nConversation summary: {summary}"} ] context_messages.extend([ {"role": msg.role, "content": msg.content} for msg in recent_messages ]) # Generate response response = await self._client.chat.completions.create( model=self.config.model, messages=context_messages, max_tokens=max_tokens, temperature=0.7, ) return response.choices[0].message.content.strip() async def _get_summary(self, user_id: str, messages: list) -> str: """Summarize old messages using LLM.""" if user_id in self._user_summaries: return self._user_summaries[user_id] # Create summary prompt conversation_text = "\n".join([ f"{msg.role}: {msg.content}" for msg in messages ]) summary_prompt = f"""Summarize this conversation in 2-3 sentences, focusing on key topics and user preferences: {conversation_text} Summary:""" response = await self._client.chat.completions.create( model=self.config.model, messages=[{"role": "user", "content": summary_prompt}], max_tokens=150, temperature=0.3, ) summary = response.choices[0].message.content.strip() # Store in SQLite await self._store_summary(user_id, summary) self._user_summaries[user_id] = summary return summary async def _store_summary(self, user_id: str, summary: str): """Store summary in SQLite for persistence.""" # Add new table for summaries await self.history._db.execute(""" CREATE TABLE IF NOT EXISTS conversation_summaries ( user_id TEXT PRIMARY KEY, summary TEXT NOT NULL, updated_at REAL NOT NULL ) """) await self.history._db.execute(""" INSERT OR REPLACE INTO conversation_summaries (user_id, summary, updated_at) VALUES (?, ?, ?) """, (user_id, summary, time.time())) await self.history._db.commit() ``` **Pros:** - Best balance: compact summary + recent context - Significantly reduces token usage for long conversations - Works with existing OpenAI-compatible APIs - Preserves long-term context - Can persist summaries in SQLite **Cons:** - Costs extra tokens to generate summaries - Adds latency when summarizing - Need to decide when to re-summarize - Still requires LangChain **Verdict:** BEST LANGCHAIN OPTION for MeshAI - balances efficiency and context retention. --- ## 2. LlamaIndex ### Installation ```bash pip install llama-index llama-index-llms-openai ``` ### Chat Memory ```python from llama_index.core.memory import ChatMemoryBuffer from llama_index.llms.openai import OpenAI from llama_index.core.llms import ChatMessage # Initialize llm = OpenAI( api_base="http://192.168.1.239:8000/v1", api_key="your-key", model="gpt-4o-mini" ) # Create memory buffer memory = ChatMemoryBuffer.from_defaults(token_limit=1500) # Add messages memory.put(ChatMessage(role="user", content="Hello")) memory.put(ChatMessage(role="assistant", content="Hi there!")) # Get messages for LLM messages = memory.get() # Generate with context response = llm.chat(messages) ``` **Integration:** ```python from llama_index.core.memory import ChatMemoryBuffer from llama_index.llms.openai import OpenAI from llama_index.core.llms import ChatMessage class LlamaIndexBackend(LLMBackend): def __init__(self, config: LLMConfig, api_key: str): self.config = config self._llm = OpenAI( api_base=config.base_url, api_key=api_key, model=config.model ) # Per-user memory buffers self._user_memories: dict[str, ChatMemoryBuffer] = {} self._token_limit = 1500 def _get_memory(self, user_id: str) -> ChatMemoryBuffer: if user_id not in self._user_memories: self._user_memories[user_id] = ChatMemoryBuffer.from_defaults( token_limit=self._token_limit ) return self._user_memories[user_id] async def generate( self, messages: list[dict], system_prompt: str, user_id: str, max_tokens: int = 300, ) -> str: memory = self._get_memory(user_id) # Add new message to memory user_msg = messages[-1]["content"] memory.put(ChatMessage(role="user", content=user_msg)) # Get messages within token limit context_messages = memory.get() # Add system prompt full_messages = [ChatMessage(role="system", content=system_prompt)] full_messages.extend(context_messages) # Generate response = self._llm.chat(full_messages) # Store assistant response memory.put(ChatMessage(role="assistant", content=response.message.content)) return response.message.content ``` **Pros:** - Token-aware buffering (auto-prunes to stay under limit) - Simple API - Works with OpenAI-compatible backends - Better than manual message counting **Cons:** - In-memory only (need custom persistence) - Heavy dependency (~100MB) - Overkill for simple chat - Less mature than LangChain **Verdict:** Token limiting is nice, but not worth the dependency weight. --- ## 3. MemGPT / Letta (Self-Editing Memory) ### Installation ```bash pip install letta ``` ### Usage **What it does:** Agent manages its own memory, decides what to keep/forget/summarize. ```python from letta import create_client client = create_client() # Create agent with memory management agent = client.create_agent( name="meshai_agent", llm_config={ "model": "gpt-4o-mini", "model_endpoint": "http://192.168.1.239:8000/v1" }, embedding_config={ "embedding_endpoint_type": "openai", "embedding_model": "text-embedding-ada-002" } ) # Agent manages memory automatically response = client.send_message( agent_id=agent.id, message="What's the weather?", role="user" ) print(response.messages[-1].text) ``` **Architecture:** - Core memory: Persistent facts the agent always sees - Recall memory: Searchable vector store of past conversations - Archival memory: Long-term storage **Pros:** - Most sophisticated memory system - Agent decides what's important - Built-in vector search - Handles very long conversations **Cons:** - HEAVY (~200MB+ with dependencies) - Requires vector embeddings (extra API calls/costs) - Complex setup and learning curve - Overkill for 150-char mesh messages - Opinionated architecture (hard to integrate) **Verdict:** Way too heavy for MeshAI. Only worth it for complex, long-form agents. --- ## 4. Vector Stores (Semantic Memory) ### ChromaDB (Simplest) ```bash pip install chromadb ``` ```python import chromadb from chromadb.config import Settings # Initialize client = chromadb.Client(Settings( persist_directory="/path/to/meshai/memory", anonymized_telemetry=False )) # Create collection per user collection = client.get_or_create_collection( name=f"user_{user_id}", metadata={"user_id": user_id} ) # Add messages collection.add( documents=["What's the weather in Seattle?"], metadatas=[{"role": "user", "timestamp": time.time()}], ids=["msg_1"] ) # Semantic search for relevant past messages results = collection.query( query_texts=["weather"], n_results=3 ) # Use retrieved messages as context relevant_context = results['documents'][0] ``` **Integration:** ```python import chromadb from chromadb.config import Settings class VectorMemoryBackend(LLMBackend): def __init__(self, config: LLMConfig, api_key: str, db_path: str): self.config = config self._client = AsyncOpenAI( api_key=api_key, base_url=config.base_url, ) # ChromaDB for semantic memory self._chroma = chromadb.Client(Settings( persist_directory=db_path, anonymized_telemetry=False )) self._window_size = 4 # Keep last 4 messages raw def _get_collection(self, user_id: str): return self._chroma.get_or_create_collection( name=f"user_{user_id.replace('!', '_')}" # Sanitize ID ) async def generate( self, messages: list[dict], system_prompt: str, user_id: str, max_tokens: int = 300, ) -> str: collection = self._get_collection(user_id) # Get current query current_query = messages[-1]["content"] # Search for semantically similar past messages try: results = collection.query( query_texts=[current_query], n_results=3, where={"role": "assistant"} # Get past responses ) relevant_history = results['documents'][0] if results['documents'] else [] except: relevant_history = [] # Build context: system + relevant history + recent messages context = system_prompt if relevant_history: context += "\n\nRelevant past exchanges:\n" context += "\n".join(relevant_history[:2]) # Top 2 relevant context_messages = [{"role": "system", "content": context}] context_messages.extend(messages[-self._window_size*2:]) # Recent messages # Generate response = await self._client.chat.completions.create( model=self.config.model, messages=context_messages, max_tokens=max_tokens, temperature=0.7, ) reply = response.choices[0].message.content.strip() # Store in vector DB msg_id = f"{user_id}_{int(time.time()*1000)}" collection.add( documents=[f"User: {current_query}\nAssistant: {reply}"], metadatas=[{"role": "assistant", "timestamp": time.time()}], ids=[msg_id] ) return reply ``` **Pros:** - Semantic search - finds relevant past context - Works great for sparse conversations - Persistent storage - Lightweight (~20MB) - No extra API calls (uses local embeddings) **Cons:** - Adds dependency - Embedding computation overhead - May surface irrelevant "similar" messages - Overkill for very short conversations **Verdict:** Interesting for long-term memory, but maybe overkill for 150-char messages. --- ### Qdrant (Production Alternative) ```bash pip install qdrant-client ``` ```python from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct # Can run in-memory or with server client = QdrantClient(path="/path/to/meshai/qdrant") # Create collection client.create_collection( collection_name="meshai_memory", vectors_config=VectorParams(size=1536, distance=Distance.COSINE), ) # Store with embedding (from OpenAI or local model) client.upsert( collection_name="meshai_memory", points=[ PointStruct( id=msg_id, vector=embedding, # 1536-dim from text-embedding-ada-002 payload={"user_id": user_id, "content": content, "role": role} ) ] ) # Search results = client.search( collection_name="meshai_memory", query_vector=query_embedding, query_filter={"user_id": user_id}, limit=3 ) ``` **Pros:** - Production-ready, fast - Better than ChromaDB for scale - Rich filtering options - Can run in-memory or server mode **Cons:** - More complex than ChromaDB - Still requires embeddings - Heavier dependency **Verdict:** Better than ChromaDB for production, but still overkill for MeshAI's use case. --- ## 5. Simple Rolling Summary (RECOMMENDED) **The lightest, most practical approach for MeshAI.** ### Implementation ```python import asyncio import time from dataclasses import dataclass from typing import Optional from openai import AsyncOpenAI @dataclass class ConversationSummary: """Summary of conversation history.""" summary: str last_updated: float message_count: int class SimpleRollingSummary: """Lightweight rolling summary memory manager.""" def __init__( self, client: AsyncOpenAI, model: str, window_size: int = 4, # Recent messages to keep raw summarize_threshold: int = 10, # Messages before summarizing ): self._client = client self._model = model self._window_size = window_size self._summarize_threshold = summarize_threshold # Per-user summaries (would be in SQLite in production) self._summaries: dict[str, ConversationSummary] = {} async def get_context_messages( self, user_id: str, full_history: list[dict], # From SQLite ) -> list[dict]: """Get optimized context messages (summary + recent).""" # If conversation is short, just return it if len(full_history) <= self._window_size * 2: return full_history # Split into old and recent old_messages = full_history[:-self._window_size * 2] recent_messages = full_history[-self._window_size * 2:] # Get or create summary of old messages summary = await self._get_or_create_summary(user_id, old_messages) # Return summary as system message + recent raw messages context = [ {"role": "system", "content": f"Previous conversation summary: {summary.summary}"} ] context.extend(recent_messages) return context async def _get_or_create_summary( self, user_id: str, messages: list[dict], ) -> ConversationSummary: """Get existing summary or create new one.""" # Check if we have a recent summary if user_id in self._summaries: existing = self._summaries[user_id] # If summary covers roughly the same messages, reuse it if abs(existing.message_count - len(messages)) < self._summarize_threshold: return existing # Create new summary summary_text = await self._summarize(messages) summary = ConversationSummary( summary=summary_text, last_updated=time.time(), message_count=len(messages) ) self._summaries[user_id] = summary return summary async def _summarize(self, messages: list[dict]) -> str: """Summarize a list of messages using the LLM.""" # Format conversation conversation = "\n".join([ f"{msg['role'].upper()}: {msg['content']}" for msg in messages ]) prompt = f"""Summarize this conversation in 2-3 concise sentences. Focus on: - Main topics discussed - Any important user preferences or context - Key information that should be remembered Conversation: {conversation} Summary (2-3 sentences):""" try: response = await self._client.chat.completions.create( model=self._model, messages=[{"role": "user", "content": prompt}], max_tokens=150, temperature=0.3, ) return response.choices[0].message.content.strip() except Exception as e: # Fallback: simple truncation if summarization fails return f"Previous conversation covered {len(messages)} messages." ``` ### Integration with MeshAI ```python # In meshai/backends/openai_backend.py class OpenAIBackend(LLMBackend): """OpenAI-compatible backend with rolling summary memory.""" def __init__(self, config: LLMConfig, api_key: str): self.config = config self._client = AsyncOpenAI( api_key=api_key, base_url=config.base_url, ) # Add rolling summary manager self._memory = SimpleRollingSummary( client=self._client, model=config.model, window_size=4, # Keep last 4 exchanges (8 messages) summarize_threshold=10, # Summarize after 10 messages ) async def generate( self, messages: list[dict], system_prompt: str, user_id: str, # NEW: need user_id max_tokens: int = 300, ) -> str: """Generate with optimized context.""" # Get optimized context (summary + recent) context_messages = await self._memory.get_context_messages( user_id=user_id, full_history=messages, ) # Add system prompt full_messages = [{"role": "system", "content": system_prompt}] full_messages.extend(context_messages) # Generate response = await self._client.chat.completions.create( model=self.config.model, messages=full_messages, max_tokens=max_tokens, temperature=0.7, ) return response.choices[0].message.content.strip() ``` ### Persist Summaries in SQLite ```python # Add to meshai/history.py async def store_summary(self, user_id: str, summary: str, message_count: int) -> None: """Store conversation summary.""" if not self._db: raise RuntimeError("Database not initialized") async with self._lock: await self._db.execute(""" CREATE TABLE IF NOT EXISTS conversation_summaries ( user_id TEXT PRIMARY KEY, summary TEXT NOT NULL, message_count INTEGER NOT NULL, updated_at REAL NOT NULL ) """) await self._db.execute(""" INSERT OR REPLACE INTO conversation_summaries (user_id, summary, message_count, updated_at) VALUES (?, ?, ?, ?) """, (user_id, summary, message_count, time.time())) await self._db.commit() async def get_summary(self, user_id: str) -> Optional[ConversationSummary]: """Retrieve conversation summary.""" if not self._db: raise RuntimeError("Database not initialized") async with self._lock: cursor = await self._db.execute(""" SELECT summary, message_count, updated_at FROM conversation_summaries WHERE user_id = ? """, (user_id,)) row = await cursor.fetchone() if not row: return None return ConversationSummary( summary=row[0], message_count=row[1], last_updated=row[2] ) ``` **Pros:** - NO external dependencies - Works with existing SQLite storage - Significantly reduces token usage - Simple to understand and maintain - Preserves recent context + summarized history - Configurable window and threshold **Cons:** - Costs tokens to generate summaries - Slight latency when summarizing - Need to tune window/threshold params **Verdict:** BEST OPTION for MeshAI - simple, effective, no dependencies. --- ## Comparison Matrix | Approach | Dependencies | Complexity | Token Savings | Persistence | OpenAI-Compatible | |----------|-------------|------------|---------------|-------------|-------------------| | **LangChain BufferMemory** | langchain (~50MB) | Low | None | No | Yes | | **LangChain WindowMemory** | langchain (~50MB) | Low | Medium | No | Yes | | **LangChain SummaryMemory** | langchain (~50MB) | Medium | High | No (DIY) | Yes | | **LlamaIndex** | llama-index (~100MB) | Medium | Medium | No (DIY) | Yes | | **MemGPT/Letta** | letta (~200MB) | Very High | Very High | Yes | Yes (complex) | | **ChromaDB** | chromadb (~20MB) | Medium | Medium | Yes | Yes | | **Qdrant** | qdrant (~30MB) | High | Medium | Yes | Yes | | **Rolling Summary (DIY)** | None | Low | High | Yes (SQLite) | Yes | --- ## RECOMMENDATION **Use Simple Rolling Summary (Option 5)** for MeshAI because: 1. **Zero dependencies** - No LangChain, LlamaIndex, or vector stores 2. **Works with current stack** - Uses existing AsyncOpenAI client and SQLite 3. **Significant efficiency gains** - Keeps last 4-6 exchanges + summary of older messages 4. **Persistent** - Summaries stored in SQLite, survive restarts 5. **Simple to tune** - Two params: `window_size` and `summarize_threshold` 6. **OpenAI-compatible** - Works with LiteLLM, local models, anything 7. **Lightweight** - ~100 lines of code ### Implementation Steps 1. Add `SimpleRollingSummary` class (shown above) 2. Add summary table to SQLite schema 3. Modify `OpenAIBackend.generate()` to use `_memory.get_context_messages()` 4. Add summary storage methods to `ConversationHistory` 5. Configure: `window_size=4` (8 messages), `summarize_threshold=10` ### Expected Performance **Before (full history):** - 20 message pairs = ~3000 tokens sent every request - Latency: higher, costs more **After (rolling summary):** - Summary (~100 tokens) + 4 recent pairs (~400 tokens) = ~500 tokens - **83% token reduction** for long conversations - Faster responses, lower costs ### When to Consider Alternatives - **Vector stores (ChromaDB)**: If you need semantic search across users or topics - **LangChain SummaryMemory**: If you want a batteries-included solution (accept dependency) - **MemGPT**: If conversations become complex multi-day dialogues (they won't on mesh) --- ## Example Usage ```python # Initialize backend = OpenAIBackend(config, api_key) # First few messages - full history sent await backend.generate( messages=[ {"role": "user", "content": "What's the weather?"}, {"role": "assistant", "content": "It's sunny!"}, {"role": "user", "content": "Should I bring an umbrella?"}, {"role": "assistant", "content": "No need, it's clear!"}, # ... 6 more exchanges ... ], system_prompt="You are a helpful assistant.", user_id="!abc123", ) # After 10+ messages - summary + recent sent # Context sent to LLM: # [ # {"role": "system", "content": "Previous conversation summary: User asked about weather and outdoor activities. Confirmed sunny weather, no rain expected."}, # {"role": "user", "content": "Should I bring an umbrella?"}, # {"role": "assistant", "content": "No need, it's clear!"}, # ... (last 4 exchanges) # ] ``` --- ## Code Files to Modify 1. **`meshai/memory.py`** (NEW) - Add `SimpleRollingSummary` class 2. **`meshai/history.py`** - Add summary storage methods + table schema 3. **`meshai/backends/openai_backend.py`** - Integrate memory manager 4. **`meshai/responder.py`** - Pass `user_id` to backend.generate() 5. **`meshai/config.py`** - Add config for window_size, summarize_threshold Let me know if you want me to implement this!