mirror of
https://github.com/zvx-echo6/meshai.git
synced 2026-05-21 23:24:44 +02:00
1024 lines
29 KiB
Markdown
1024 lines
29 KiB
Markdown
|
|
# LLM Conversation Memory Research for MeshAI
|
||
|
|
|
||
|
|
## Current Implementation Analysis
|
||
|
|
|
||
|
|
**Current approach:** MeshAI stuffs full conversation history into every LLM API call
|
||
|
|
- Storage: SQLite via aiosqlite
|
||
|
|
- Retrieval: `get_history_for_llm()` returns all messages (up to `max_messages_per_user * 2`)
|
||
|
|
- Backend: OpenAI-compatible API (works with LiteLLM, local models)
|
||
|
|
- Context: 150 char max per message, per-user conversations
|
||
|
|
|
||
|
|
**Problem:** Inefficient - sends entire history even when unnecessary, wastes tokens and latency.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. LangChain Memory Modules
|
||
|
|
|
||
|
|
### Installation
|
||
|
|
```bash
|
||
|
|
pip install langchain langchain-community langchain-openai
|
||
|
|
```
|
||
|
|
|
||
|
|
### A. ConversationBufferMemory (Simplest)
|
||
|
|
|
||
|
|
**What it does:** Stores raw messages in memory, returns all messages.
|
||
|
|
|
||
|
|
```python
|
||
|
|
from langchain.memory import ConversationBufferMemory
|
||
|
|
from langchain_openai import ChatOpenAI
|
||
|
|
from langchain.chains import ConversationChain
|
||
|
|
|
||
|
|
# Initialize
|
||
|
|
llm = ChatOpenAI(
|
||
|
|
base_url="http://192.168.1.239:8000/v1", # LiteLLM
|
||
|
|
api_key="your-key",
|
||
|
|
model="gpt-4o-mini"
|
||
|
|
)
|
||
|
|
|
||
|
|
memory = ConversationBufferMemory()
|
||
|
|
|
||
|
|
chain = ConversationChain(
|
||
|
|
llm=llm,
|
||
|
|
memory=memory,
|
||
|
|
verbose=False
|
||
|
|
)
|
||
|
|
|
||
|
|
# Use it
|
||
|
|
response = chain.predict(input="What's the weather?")
|
||
|
|
print(response)
|
||
|
|
|
||
|
|
# Access history
|
||
|
|
print(memory.load_memory_variables({}))
|
||
|
|
# {'history': 'Human: What's the weather?\nAI: ...'}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Integration with MeshAI:**
|
||
|
|
```python
|
||
|
|
# In meshai/backends/openai_backend.py
|
||
|
|
from langchain.memory import ConversationBufferMemory
|
||
|
|
from langchain_openai import ChatOpenAI
|
||
|
|
from langchain.chains import ConversationChain
|
||
|
|
|
||
|
|
class OpenAIBackendWithMemory(LLMBackend):
|
||
|
|
def __init__(self, config: LLMConfig, api_key: str):
|
||
|
|
self.config = config
|
||
|
|
self._llm = ChatOpenAI(
|
||
|
|
base_url=config.base_url,
|
||
|
|
api_key=api_key,
|
||
|
|
model=config.model,
|
||
|
|
temperature=0.7,
|
||
|
|
max_tokens=300
|
||
|
|
)
|
||
|
|
# Per-user memory storage
|
||
|
|
self._user_memories: dict[str, ConversationBufferMemory] = {}
|
||
|
|
|
||
|
|
def _get_memory(self, user_id: str) -> ConversationBufferMemory:
|
||
|
|
if user_id not in self._user_memories:
|
||
|
|
self._user_memories[user_id] = ConversationBufferMemory()
|
||
|
|
return self._user_memories[user_id]
|
||
|
|
|
||
|
|
async def generate(
|
||
|
|
self,
|
||
|
|
messages: list[dict],
|
||
|
|
system_prompt: str,
|
||
|
|
user_id: str, # NEW: need user_id for memory
|
||
|
|
max_tokens: int = 300,
|
||
|
|
) -> str:
|
||
|
|
memory = self._get_memory(user_id)
|
||
|
|
|
||
|
|
# Create chain with memory
|
||
|
|
chain = ConversationChain(
|
||
|
|
llm=self._llm,
|
||
|
|
memory=memory,
|
||
|
|
verbose=False
|
||
|
|
)
|
||
|
|
|
||
|
|
# Extract last user message
|
||
|
|
last_msg = messages[-1]["content"]
|
||
|
|
|
||
|
|
# Generate with memory
|
||
|
|
response = await chain.apredict(input=last_msg)
|
||
|
|
return response.strip()
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- Dead simple, drop-in replacement
|
||
|
|
- Works with any OpenAI-compatible API
|
||
|
|
- No external dependencies
|
||
|
|
- LangChain handles message formatting
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- Still sends full history (no real efficiency gain)
|
||
|
|
- Stores everything in RAM (lost on restart)
|
||
|
|
- Need to manage per-user memory dicts
|
||
|
|
- Adds LangChain dependency (~50MB)
|
||
|
|
|
||
|
|
**Verdict:** Not worth it - adds complexity without solving core problem.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### B. ConversationBufferWindowMemory (Better)
|
||
|
|
|
||
|
|
**What it does:** Only keeps last N messages in context.
|
||
|
|
|
||
|
|
```python
|
||
|
|
from langchain.memory import ConversationBufferWindowMemory
|
||
|
|
|
||
|
|
# Keep only last 5 interactions (10 messages = 5 pairs)
|
||
|
|
memory = ConversationBufferWindowMemory(k=5)
|
||
|
|
|
||
|
|
chain = ConversationChain(
|
||
|
|
llm=llm,
|
||
|
|
memory=memory
|
||
|
|
)
|
||
|
|
|
||
|
|
# Only last 5 exchanges sent to LLM
|
||
|
|
response = chain.predict(input="Hello")
|
||
|
|
```
|
||
|
|
|
||
|
|
**Integration:**
|
||
|
|
```python
|
||
|
|
class OpenAIBackendWithWindow(LLMBackend):
|
||
|
|
def __init__(self, config: LLMConfig, api_key: str):
|
||
|
|
self.config = config
|
||
|
|
self._llm = ChatOpenAI(
|
||
|
|
base_url=config.base_url,
|
||
|
|
api_key=api_key,
|
||
|
|
model=config.model
|
||
|
|
)
|
||
|
|
# Per-user windowed memory
|
||
|
|
self._user_memories: dict[str, ConversationBufferWindowMemory] = {}
|
||
|
|
self._window_size = 5 # Last 5 exchanges
|
||
|
|
|
||
|
|
def _get_memory(self, user_id: str) -> ConversationBufferWindowMemory:
|
||
|
|
if user_id not in self._user_memories:
|
||
|
|
self._user_memories[user_id] = ConversationBufferWindowMemory(
|
||
|
|
k=self._window_size
|
||
|
|
)
|
||
|
|
return self._user_memories[user_id]
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- Simple sliding window approach
|
||
|
|
- Reduces token usage automatically
|
||
|
|
- Works with any OpenAI-compatible API
|
||
|
|
- Configurable window size
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- Still in-memory only (lost on restart)
|
||
|
|
- Forgets old context completely
|
||
|
|
- Need to integrate with existing SQLite storage
|
||
|
|
- Adds LangChain dependency
|
||
|
|
|
||
|
|
**Verdict:** Better than full buffer, but loses long-term context.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### C. ConversationSummaryMemory (Most Interesting)
|
||
|
|
|
||
|
|
**What it does:** Uses LLM to summarize conversation, keeps summary + recent messages.
|
||
|
|
|
||
|
|
```python
|
||
|
|
from langchain.memory import ConversationSummaryMemory
|
||
|
|
|
||
|
|
memory = ConversationSummaryMemory(llm=llm)
|
||
|
|
|
||
|
|
chain = ConversationChain(
|
||
|
|
llm=llm,
|
||
|
|
memory=memory
|
||
|
|
)
|
||
|
|
|
||
|
|
# After multiple messages, memory contains:
|
||
|
|
# - Summary of old conversation
|
||
|
|
# - Recent raw messages
|
||
|
|
response = chain.predict(input="What did we talk about?")
|
||
|
|
# AI can reference both summary and recent context
|
||
|
|
```
|
||
|
|
|
||
|
|
**Integration with SQLite persistence:**
|
||
|
|
```python
|
||
|
|
from langchain.memory import ConversationSummaryMemory
|
||
|
|
from langchain_openai import ChatOpenAI
|
||
|
|
|
||
|
|
class OpenAIBackendWithSummary(LLMBackend):
|
||
|
|
def __init__(self, config: LLMConfig, api_key: str, history: ConversationHistory):
|
||
|
|
self.config = config
|
||
|
|
self.history = history # Existing SQLite history
|
||
|
|
|
||
|
|
self._llm = ChatOpenAI(
|
||
|
|
base_url=config.base_url,
|
||
|
|
api_key=api_key,
|
||
|
|
model=config.model
|
||
|
|
)
|
||
|
|
|
||
|
|
# Per-user summaries (load from DB)
|
||
|
|
self._user_summaries: dict[str, str] = {}
|
||
|
|
self._window_size = 4 # Keep last 4 messages raw
|
||
|
|
|
||
|
|
async def generate(
|
||
|
|
self,
|
||
|
|
messages: list[dict],
|
||
|
|
system_prompt: str,
|
||
|
|
user_id: str,
|
||
|
|
max_tokens: int = 300,
|
||
|
|
) -> str:
|
||
|
|
# Get full history from SQLite
|
||
|
|
full_history = await self.history.get_history(user_id)
|
||
|
|
|
||
|
|
if len(full_history) <= self._window_size * 2:
|
||
|
|
# Small conversation, just use raw messages
|
||
|
|
context_messages = messages
|
||
|
|
else:
|
||
|
|
# Large conversation: summarize old + keep recent
|
||
|
|
old_messages = full_history[:-self._window_size * 2]
|
||
|
|
recent_messages = full_history[-self._window_size * 2:]
|
||
|
|
|
||
|
|
# Get or create summary
|
||
|
|
summary = await self._get_summary(user_id, old_messages)
|
||
|
|
|
||
|
|
# Build context: system + summary + recent messages
|
||
|
|
context_messages = [
|
||
|
|
{"role": "system", "content": f"{system_prompt}\n\nConversation summary: {summary}"}
|
||
|
|
]
|
||
|
|
context_messages.extend([
|
||
|
|
{"role": msg.role, "content": msg.content}
|
||
|
|
for msg in recent_messages
|
||
|
|
])
|
||
|
|
|
||
|
|
# Generate response
|
||
|
|
response = await self._client.chat.completions.create(
|
||
|
|
model=self.config.model,
|
||
|
|
messages=context_messages,
|
||
|
|
max_tokens=max_tokens,
|
||
|
|
temperature=0.7,
|
||
|
|
)
|
||
|
|
|
||
|
|
return response.choices[0].message.content.strip()
|
||
|
|
|
||
|
|
async def _get_summary(self, user_id: str, messages: list) -> str:
|
||
|
|
"""Summarize old messages using LLM."""
|
||
|
|
if user_id in self._user_summaries:
|
||
|
|
return self._user_summaries[user_id]
|
||
|
|
|
||
|
|
# Create summary prompt
|
||
|
|
conversation_text = "\n".join([
|
||
|
|
f"{msg.role}: {msg.content}" for msg in messages
|
||
|
|
])
|
||
|
|
|
||
|
|
summary_prompt = f"""Summarize this conversation in 2-3 sentences, focusing on key topics and user preferences:
|
||
|
|
|
||
|
|
{conversation_text}
|
||
|
|
|
||
|
|
Summary:"""
|
||
|
|
|
||
|
|
response = await self._client.chat.completions.create(
|
||
|
|
model=self.config.model,
|
||
|
|
messages=[{"role": "user", "content": summary_prompt}],
|
||
|
|
max_tokens=150,
|
||
|
|
temperature=0.3,
|
||
|
|
)
|
||
|
|
|
||
|
|
summary = response.choices[0].message.content.strip()
|
||
|
|
|
||
|
|
# Store in SQLite
|
||
|
|
await self._store_summary(user_id, summary)
|
||
|
|
self._user_summaries[user_id] = summary
|
||
|
|
|
||
|
|
return summary
|
||
|
|
|
||
|
|
async def _store_summary(self, user_id: str, summary: str):
|
||
|
|
"""Store summary in SQLite for persistence."""
|
||
|
|
# Add new table for summaries
|
||
|
|
await self.history._db.execute("""
|
||
|
|
CREATE TABLE IF NOT EXISTS conversation_summaries (
|
||
|
|
user_id TEXT PRIMARY KEY,
|
||
|
|
summary TEXT NOT NULL,
|
||
|
|
updated_at REAL NOT NULL
|
||
|
|
)
|
||
|
|
""")
|
||
|
|
|
||
|
|
await self.history._db.execute("""
|
||
|
|
INSERT OR REPLACE INTO conversation_summaries (user_id, summary, updated_at)
|
||
|
|
VALUES (?, ?, ?)
|
||
|
|
""", (user_id, summary, time.time()))
|
||
|
|
|
||
|
|
await self.history._db.commit()
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- Best balance: compact summary + recent context
|
||
|
|
- Significantly reduces token usage for long conversations
|
||
|
|
- Works with existing OpenAI-compatible APIs
|
||
|
|
- Preserves long-term context
|
||
|
|
- Can persist summaries in SQLite
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- Costs extra tokens to generate summaries
|
||
|
|
- Adds latency when summarizing
|
||
|
|
- Need to decide when to re-summarize
|
||
|
|
- Still requires LangChain
|
||
|
|
|
||
|
|
**Verdict:** BEST LANGCHAIN OPTION for MeshAI - balances efficiency and context retention.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. LlamaIndex
|
||
|
|
|
||
|
|
### Installation
|
||
|
|
```bash
|
||
|
|
pip install llama-index llama-index-llms-openai
|
||
|
|
```
|
||
|
|
|
||
|
|
### Chat Memory
|
||
|
|
|
||
|
|
```python
|
||
|
|
from llama_index.core.memory import ChatMemoryBuffer
|
||
|
|
from llama_index.llms.openai import OpenAI
|
||
|
|
from llama_index.core.llms import ChatMessage
|
||
|
|
|
||
|
|
# Initialize
|
||
|
|
llm = OpenAI(
|
||
|
|
api_base="http://192.168.1.239:8000/v1",
|
||
|
|
api_key="your-key",
|
||
|
|
model="gpt-4o-mini"
|
||
|
|
)
|
||
|
|
|
||
|
|
# Create memory buffer
|
||
|
|
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
|
||
|
|
|
||
|
|
# Add messages
|
||
|
|
memory.put(ChatMessage(role="user", content="Hello"))
|
||
|
|
memory.put(ChatMessage(role="assistant", content="Hi there!"))
|
||
|
|
|
||
|
|
# Get messages for LLM
|
||
|
|
messages = memory.get()
|
||
|
|
|
||
|
|
# Generate with context
|
||
|
|
response = llm.chat(messages)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Integration:**
|
||
|
|
```python
|
||
|
|
from llama_index.core.memory import ChatMemoryBuffer
|
||
|
|
from llama_index.llms.openai import OpenAI
|
||
|
|
from llama_index.core.llms import ChatMessage
|
||
|
|
|
||
|
|
class LlamaIndexBackend(LLMBackend):
|
||
|
|
def __init__(self, config: LLMConfig, api_key: str):
|
||
|
|
self.config = config
|
||
|
|
self._llm = OpenAI(
|
||
|
|
api_base=config.base_url,
|
||
|
|
api_key=api_key,
|
||
|
|
model=config.model
|
||
|
|
)
|
||
|
|
|
||
|
|
# Per-user memory buffers
|
||
|
|
self._user_memories: dict[str, ChatMemoryBuffer] = {}
|
||
|
|
self._token_limit = 1500
|
||
|
|
|
||
|
|
def _get_memory(self, user_id: str) -> ChatMemoryBuffer:
|
||
|
|
if user_id not in self._user_memories:
|
||
|
|
self._user_memories[user_id] = ChatMemoryBuffer.from_defaults(
|
||
|
|
token_limit=self._token_limit
|
||
|
|
)
|
||
|
|
return self._user_memories[user_id]
|
||
|
|
|
||
|
|
async def generate(
|
||
|
|
self,
|
||
|
|
messages: list[dict],
|
||
|
|
system_prompt: str,
|
||
|
|
user_id: str,
|
||
|
|
max_tokens: int = 300,
|
||
|
|
) -> str:
|
||
|
|
memory = self._get_memory(user_id)
|
||
|
|
|
||
|
|
# Add new message to memory
|
||
|
|
user_msg = messages[-1]["content"]
|
||
|
|
memory.put(ChatMessage(role="user", content=user_msg))
|
||
|
|
|
||
|
|
# Get messages within token limit
|
||
|
|
context_messages = memory.get()
|
||
|
|
|
||
|
|
# Add system prompt
|
||
|
|
full_messages = [ChatMessage(role="system", content=system_prompt)]
|
||
|
|
full_messages.extend(context_messages)
|
||
|
|
|
||
|
|
# Generate
|
||
|
|
response = self._llm.chat(full_messages)
|
||
|
|
|
||
|
|
# Store assistant response
|
||
|
|
memory.put(ChatMessage(role="assistant", content=response.message.content))
|
||
|
|
|
||
|
|
return response.message.content
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- Token-aware buffering (auto-prunes to stay under limit)
|
||
|
|
- Simple API
|
||
|
|
- Works with OpenAI-compatible backends
|
||
|
|
- Better than manual message counting
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- In-memory only (need custom persistence)
|
||
|
|
- Heavy dependency (~100MB)
|
||
|
|
- Overkill for simple chat
|
||
|
|
- Less mature than LangChain
|
||
|
|
|
||
|
|
**Verdict:** Token limiting is nice, but not worth the dependency weight.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. MemGPT / Letta (Self-Editing Memory)
|
||
|
|
|
||
|
|
### Installation
|
||
|
|
```bash
|
||
|
|
pip install letta
|
||
|
|
```
|
||
|
|
|
||
|
|
### Usage
|
||
|
|
|
||
|
|
**What it does:** Agent manages its own memory, decides what to keep/forget/summarize.
|
||
|
|
|
||
|
|
```python
|
||
|
|
from letta import create_client
|
||
|
|
|
||
|
|
client = create_client()
|
||
|
|
|
||
|
|
# Create agent with memory management
|
||
|
|
agent = client.create_agent(
|
||
|
|
name="meshai_agent",
|
||
|
|
llm_config={
|
||
|
|
"model": "gpt-4o-mini",
|
||
|
|
"model_endpoint": "http://192.168.1.239:8000/v1"
|
||
|
|
},
|
||
|
|
embedding_config={
|
||
|
|
"embedding_endpoint_type": "openai",
|
||
|
|
"embedding_model": "text-embedding-ada-002"
|
||
|
|
}
|
||
|
|
)
|
||
|
|
|
||
|
|
# Agent manages memory automatically
|
||
|
|
response = client.send_message(
|
||
|
|
agent_id=agent.id,
|
||
|
|
message="What's the weather?",
|
||
|
|
role="user"
|
||
|
|
)
|
||
|
|
|
||
|
|
print(response.messages[-1].text)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Architecture:**
|
||
|
|
- Core memory: Persistent facts the agent always sees
|
||
|
|
- Recall memory: Searchable vector store of past conversations
|
||
|
|
- Archival memory: Long-term storage
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- Most sophisticated memory system
|
||
|
|
- Agent decides what's important
|
||
|
|
- Built-in vector search
|
||
|
|
- Handles very long conversations
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- HEAVY (~200MB+ with dependencies)
|
||
|
|
- Requires vector embeddings (extra API calls/costs)
|
||
|
|
- Complex setup and learning curve
|
||
|
|
- Overkill for 150-char mesh messages
|
||
|
|
- Opinionated architecture (hard to integrate)
|
||
|
|
|
||
|
|
**Verdict:** Way too heavy for MeshAI. Only worth it for complex, long-form agents.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. Vector Stores (Semantic Memory)
|
||
|
|
|
||
|
|
### ChromaDB (Simplest)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
pip install chromadb
|
||
|
|
```
|
||
|
|
|
||
|
|
```python
|
||
|
|
import chromadb
|
||
|
|
from chromadb.config import Settings
|
||
|
|
|
||
|
|
# Initialize
|
||
|
|
client = chromadb.Client(Settings(
|
||
|
|
persist_directory="/path/to/meshai/memory",
|
||
|
|
anonymized_telemetry=False
|
||
|
|
))
|
||
|
|
|
||
|
|
# Create collection per user
|
||
|
|
collection = client.get_or_create_collection(
|
||
|
|
name=f"user_{user_id}",
|
||
|
|
metadata={"user_id": user_id}
|
||
|
|
)
|
||
|
|
|
||
|
|
# Add messages
|
||
|
|
collection.add(
|
||
|
|
documents=["What's the weather in Seattle?"],
|
||
|
|
metadatas=[{"role": "user", "timestamp": time.time()}],
|
||
|
|
ids=["msg_1"]
|
||
|
|
)
|
||
|
|
|
||
|
|
# Semantic search for relevant past messages
|
||
|
|
results = collection.query(
|
||
|
|
query_texts=["weather"],
|
||
|
|
n_results=3
|
||
|
|
)
|
||
|
|
|
||
|
|
# Use retrieved messages as context
|
||
|
|
relevant_context = results['documents'][0]
|
||
|
|
```
|
||
|
|
|
||
|
|
**Integration:**
|
||
|
|
```python
|
||
|
|
import chromadb
|
||
|
|
from chromadb.config import Settings
|
||
|
|
|
||
|
|
class VectorMemoryBackend(LLMBackend):
|
||
|
|
def __init__(self, config: LLMConfig, api_key: str, db_path: str):
|
||
|
|
self.config = config
|
||
|
|
self._client = AsyncOpenAI(
|
||
|
|
api_key=api_key,
|
||
|
|
base_url=config.base_url,
|
||
|
|
)
|
||
|
|
|
||
|
|
# ChromaDB for semantic memory
|
||
|
|
self._chroma = chromadb.Client(Settings(
|
||
|
|
persist_directory=db_path,
|
||
|
|
anonymized_telemetry=False
|
||
|
|
))
|
||
|
|
|
||
|
|
self._window_size = 4 # Keep last 4 messages raw
|
||
|
|
|
||
|
|
def _get_collection(self, user_id: str):
|
||
|
|
return self._chroma.get_or_create_collection(
|
||
|
|
name=f"user_{user_id.replace('!', '_')}" # Sanitize ID
|
||
|
|
)
|
||
|
|
|
||
|
|
async def generate(
|
||
|
|
self,
|
||
|
|
messages: list[dict],
|
||
|
|
system_prompt: str,
|
||
|
|
user_id: str,
|
||
|
|
max_tokens: int = 300,
|
||
|
|
) -> str:
|
||
|
|
collection = self._get_collection(user_id)
|
||
|
|
|
||
|
|
# Get current query
|
||
|
|
current_query = messages[-1]["content"]
|
||
|
|
|
||
|
|
# Search for semantically similar past messages
|
||
|
|
try:
|
||
|
|
results = collection.query(
|
||
|
|
query_texts=[current_query],
|
||
|
|
n_results=3,
|
||
|
|
where={"role": "assistant"} # Get past responses
|
||
|
|
)
|
||
|
|
relevant_history = results['documents'][0] if results['documents'] else []
|
||
|
|
except:
|
||
|
|
relevant_history = []
|
||
|
|
|
||
|
|
# Build context: system + relevant history + recent messages
|
||
|
|
context = system_prompt
|
||
|
|
if relevant_history:
|
||
|
|
context += "\n\nRelevant past exchanges:\n"
|
||
|
|
context += "\n".join(relevant_history[:2]) # Top 2 relevant
|
||
|
|
|
||
|
|
context_messages = [{"role": "system", "content": context}]
|
||
|
|
context_messages.extend(messages[-self._window_size*2:]) # Recent messages
|
||
|
|
|
||
|
|
# Generate
|
||
|
|
response = await self._client.chat.completions.create(
|
||
|
|
model=self.config.model,
|
||
|
|
messages=context_messages,
|
||
|
|
max_tokens=max_tokens,
|
||
|
|
temperature=0.7,
|
||
|
|
)
|
||
|
|
|
||
|
|
reply = response.choices[0].message.content.strip()
|
||
|
|
|
||
|
|
# Store in vector DB
|
||
|
|
msg_id = f"{user_id}_{int(time.time()*1000)}"
|
||
|
|
collection.add(
|
||
|
|
documents=[f"User: {current_query}\nAssistant: {reply}"],
|
||
|
|
metadatas=[{"role": "assistant", "timestamp": time.time()}],
|
||
|
|
ids=[msg_id]
|
||
|
|
)
|
||
|
|
|
||
|
|
return reply
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- Semantic search - finds relevant past context
|
||
|
|
- Works great for sparse conversations
|
||
|
|
- Persistent storage
|
||
|
|
- Lightweight (~20MB)
|
||
|
|
- No extra API calls (uses local embeddings)
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- Adds dependency
|
||
|
|
- Embedding computation overhead
|
||
|
|
- May surface irrelevant "similar" messages
|
||
|
|
- Overkill for very short conversations
|
||
|
|
|
||
|
|
**Verdict:** Interesting for long-term memory, but maybe overkill for 150-char messages.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Qdrant (Production Alternative)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
pip install qdrant-client
|
||
|
|
```
|
||
|
|
|
||
|
|
```python
|
||
|
|
from qdrant_client import QdrantClient
|
||
|
|
from qdrant_client.models import Distance, VectorParams, PointStruct
|
||
|
|
|
||
|
|
# Can run in-memory or with server
|
||
|
|
client = QdrantClient(path="/path/to/meshai/qdrant")
|
||
|
|
|
||
|
|
# Create collection
|
||
|
|
client.create_collection(
|
||
|
|
collection_name="meshai_memory",
|
||
|
|
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
|
||
|
|
)
|
||
|
|
|
||
|
|
# Store with embedding (from OpenAI or local model)
|
||
|
|
client.upsert(
|
||
|
|
collection_name="meshai_memory",
|
||
|
|
points=[
|
||
|
|
PointStruct(
|
||
|
|
id=msg_id,
|
||
|
|
vector=embedding, # 1536-dim from text-embedding-ada-002
|
||
|
|
payload={"user_id": user_id, "content": content, "role": role}
|
||
|
|
)
|
||
|
|
]
|
||
|
|
)
|
||
|
|
|
||
|
|
# Search
|
||
|
|
results = client.search(
|
||
|
|
collection_name="meshai_memory",
|
||
|
|
query_vector=query_embedding,
|
||
|
|
query_filter={"user_id": user_id},
|
||
|
|
limit=3
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- Production-ready, fast
|
||
|
|
- Better than ChromaDB for scale
|
||
|
|
- Rich filtering options
|
||
|
|
- Can run in-memory or server mode
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- More complex than ChromaDB
|
||
|
|
- Still requires embeddings
|
||
|
|
- Heavier dependency
|
||
|
|
|
||
|
|
**Verdict:** Better than ChromaDB for production, but still overkill for MeshAI's use case.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. Simple Rolling Summary (RECOMMENDED)
|
||
|
|
|
||
|
|
**The lightest, most practical approach for MeshAI.**
|
||
|
|
|
||
|
|
### Implementation
|
||
|
|
|
||
|
|
```python
|
||
|
|
import asyncio
|
||
|
|
import time
|
||
|
|
from dataclasses import dataclass
|
||
|
|
from typing import Optional
|
||
|
|
from openai import AsyncOpenAI
|
||
|
|
|
||
|
|
@dataclass
|
||
|
|
class ConversationSummary:
|
||
|
|
"""Summary of conversation history."""
|
||
|
|
summary: str
|
||
|
|
last_updated: float
|
||
|
|
message_count: int
|
||
|
|
|
||
|
|
class SimpleRollingSummary:
|
||
|
|
"""Lightweight rolling summary memory manager."""
|
||
|
|
|
||
|
|
def __init__(
|
||
|
|
self,
|
||
|
|
client: AsyncOpenAI,
|
||
|
|
model: str,
|
||
|
|
window_size: int = 4, # Recent messages to keep raw
|
||
|
|
summarize_threshold: int = 10, # Messages before summarizing
|
||
|
|
):
|
||
|
|
self._client = client
|
||
|
|
self._model = model
|
||
|
|
self._window_size = window_size
|
||
|
|
self._summarize_threshold = summarize_threshold
|
||
|
|
|
||
|
|
# Per-user summaries (would be in SQLite in production)
|
||
|
|
self._summaries: dict[str, ConversationSummary] = {}
|
||
|
|
|
||
|
|
async def get_context_messages(
|
||
|
|
self,
|
||
|
|
user_id: str,
|
||
|
|
full_history: list[dict], # From SQLite
|
||
|
|
) -> list[dict]:
|
||
|
|
"""Get optimized context messages (summary + recent)."""
|
||
|
|
|
||
|
|
# If conversation is short, just return it
|
||
|
|
if len(full_history) <= self._window_size * 2:
|
||
|
|
return full_history
|
||
|
|
|
||
|
|
# Split into old and recent
|
||
|
|
old_messages = full_history[:-self._window_size * 2]
|
||
|
|
recent_messages = full_history[-self._window_size * 2:]
|
||
|
|
|
||
|
|
# Get or create summary of old messages
|
||
|
|
summary = await self._get_or_create_summary(user_id, old_messages)
|
||
|
|
|
||
|
|
# Return summary as system message + recent raw messages
|
||
|
|
context = [
|
||
|
|
{"role": "system", "content": f"Previous conversation summary: {summary.summary}"}
|
||
|
|
]
|
||
|
|
context.extend(recent_messages)
|
||
|
|
|
||
|
|
return context
|
||
|
|
|
||
|
|
async def _get_or_create_summary(
|
||
|
|
self,
|
||
|
|
user_id: str,
|
||
|
|
messages: list[dict],
|
||
|
|
) -> ConversationSummary:
|
||
|
|
"""Get existing summary or create new one."""
|
||
|
|
|
||
|
|
# Check if we have a recent summary
|
||
|
|
if user_id in self._summaries:
|
||
|
|
existing = self._summaries[user_id]
|
||
|
|
|
||
|
|
# If summary covers roughly the same messages, reuse it
|
||
|
|
if abs(existing.message_count - len(messages)) < self._summarize_threshold:
|
||
|
|
return existing
|
||
|
|
|
||
|
|
# Create new summary
|
||
|
|
summary_text = await self._summarize(messages)
|
||
|
|
|
||
|
|
summary = ConversationSummary(
|
||
|
|
summary=summary_text,
|
||
|
|
last_updated=time.time(),
|
||
|
|
message_count=len(messages)
|
||
|
|
)
|
||
|
|
|
||
|
|
self._summaries[user_id] = summary
|
||
|
|
return summary
|
||
|
|
|
||
|
|
async def _summarize(self, messages: list[dict]) -> str:
|
||
|
|
"""Summarize a list of messages using the LLM."""
|
||
|
|
|
||
|
|
# Format conversation
|
||
|
|
conversation = "\n".join([
|
||
|
|
f"{msg['role'].upper()}: {msg['content']}"
|
||
|
|
for msg in messages
|
||
|
|
])
|
||
|
|
|
||
|
|
prompt = f"""Summarize this conversation in 2-3 concise sentences. Focus on:
|
||
|
|
- Main topics discussed
|
||
|
|
- Any important user preferences or context
|
||
|
|
- Key information that should be remembered
|
||
|
|
|
||
|
|
Conversation:
|
||
|
|
{conversation}
|
||
|
|
|
||
|
|
Summary (2-3 sentences):"""
|
||
|
|
|
||
|
|
try:
|
||
|
|
response = await self._client.chat.completions.create(
|
||
|
|
model=self._model,
|
||
|
|
messages=[{"role": "user", "content": prompt}],
|
||
|
|
max_tokens=150,
|
||
|
|
temperature=0.3,
|
||
|
|
)
|
||
|
|
|
||
|
|
return response.choices[0].message.content.strip()
|
||
|
|
|
||
|
|
except Exception as e:
|
||
|
|
# Fallback: simple truncation if summarization fails
|
||
|
|
return f"Previous conversation covered {len(messages)} messages."
|
||
|
|
```
|
||
|
|
|
||
|
|
### Integration with MeshAI
|
||
|
|
|
||
|
|
```python
|
||
|
|
# In meshai/backends/openai_backend.py
|
||
|
|
|
||
|
|
class OpenAIBackend(LLMBackend):
|
||
|
|
"""OpenAI-compatible backend with rolling summary memory."""
|
||
|
|
|
||
|
|
def __init__(self, config: LLMConfig, api_key: str):
|
||
|
|
self.config = config
|
||
|
|
self._client = AsyncOpenAI(
|
||
|
|
api_key=api_key,
|
||
|
|
base_url=config.base_url,
|
||
|
|
)
|
||
|
|
|
||
|
|
# Add rolling summary manager
|
||
|
|
self._memory = SimpleRollingSummary(
|
||
|
|
client=self._client,
|
||
|
|
model=config.model,
|
||
|
|
window_size=4, # Keep last 4 exchanges (8 messages)
|
||
|
|
summarize_threshold=10, # Summarize after 10 messages
|
||
|
|
)
|
||
|
|
|
||
|
|
async def generate(
|
||
|
|
self,
|
||
|
|
messages: list[dict],
|
||
|
|
system_prompt: str,
|
||
|
|
user_id: str, # NEW: need user_id
|
||
|
|
max_tokens: int = 300,
|
||
|
|
) -> str:
|
||
|
|
"""Generate with optimized context."""
|
||
|
|
|
||
|
|
# Get optimized context (summary + recent)
|
||
|
|
context_messages = await self._memory.get_context_messages(
|
||
|
|
user_id=user_id,
|
||
|
|
full_history=messages,
|
||
|
|
)
|
||
|
|
|
||
|
|
# Add system prompt
|
||
|
|
full_messages = [{"role": "system", "content": system_prompt}]
|
||
|
|
full_messages.extend(context_messages)
|
||
|
|
|
||
|
|
# Generate
|
||
|
|
response = await self._client.chat.completions.create(
|
||
|
|
model=self.config.model,
|
||
|
|
messages=full_messages,
|
||
|
|
max_tokens=max_tokens,
|
||
|
|
temperature=0.7,
|
||
|
|
)
|
||
|
|
|
||
|
|
return response.choices[0].message.content.strip()
|
||
|
|
```
|
||
|
|
|
||
|
|
### Persist Summaries in SQLite
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Add to meshai/history.py
|
||
|
|
|
||
|
|
async def store_summary(self, user_id: str, summary: str, message_count: int) -> None:
|
||
|
|
"""Store conversation summary."""
|
||
|
|
if not self._db:
|
||
|
|
raise RuntimeError("Database not initialized")
|
||
|
|
|
||
|
|
async with self._lock:
|
||
|
|
await self._db.execute("""
|
||
|
|
CREATE TABLE IF NOT EXISTS conversation_summaries (
|
||
|
|
user_id TEXT PRIMARY KEY,
|
||
|
|
summary TEXT NOT NULL,
|
||
|
|
message_count INTEGER NOT NULL,
|
||
|
|
updated_at REAL NOT NULL
|
||
|
|
)
|
||
|
|
""")
|
||
|
|
|
||
|
|
await self._db.execute("""
|
||
|
|
INSERT OR REPLACE INTO conversation_summaries
|
||
|
|
(user_id, summary, message_count, updated_at)
|
||
|
|
VALUES (?, ?, ?, ?)
|
||
|
|
""", (user_id, summary, message_count, time.time()))
|
||
|
|
|
||
|
|
await self._db.commit()
|
||
|
|
|
||
|
|
async def get_summary(self, user_id: str) -> Optional[ConversationSummary]:
|
||
|
|
"""Retrieve conversation summary."""
|
||
|
|
if not self._db:
|
||
|
|
raise RuntimeError("Database not initialized")
|
||
|
|
|
||
|
|
async with self._lock:
|
||
|
|
cursor = await self._db.execute("""
|
||
|
|
SELECT summary, message_count, updated_at
|
||
|
|
FROM conversation_summaries
|
||
|
|
WHERE user_id = ?
|
||
|
|
""", (user_id,))
|
||
|
|
|
||
|
|
row = await cursor.fetchone()
|
||
|
|
|
||
|
|
if not row:
|
||
|
|
return None
|
||
|
|
|
||
|
|
return ConversationSummary(
|
||
|
|
summary=row[0],
|
||
|
|
message_count=row[1],
|
||
|
|
last_updated=row[2]
|
||
|
|
)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Pros:**
|
||
|
|
- NO external dependencies
|
||
|
|
- Works with existing SQLite storage
|
||
|
|
- Significantly reduces token usage
|
||
|
|
- Simple to understand and maintain
|
||
|
|
- Preserves recent context + summarized history
|
||
|
|
- Configurable window and threshold
|
||
|
|
|
||
|
|
**Cons:**
|
||
|
|
- Costs tokens to generate summaries
|
||
|
|
- Slight latency when summarizing
|
||
|
|
- Need to tune window/threshold params
|
||
|
|
|
||
|
|
**Verdict:** BEST OPTION for MeshAI - simple, effective, no dependencies.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Comparison Matrix
|
||
|
|
|
||
|
|
| Approach | Dependencies | Complexity | Token Savings | Persistence | OpenAI-Compatible |
|
||
|
|
|----------|-------------|------------|---------------|-------------|-------------------|
|
||
|
|
| **LangChain BufferMemory** | langchain (~50MB) | Low | None | No | Yes |
|
||
|
|
| **LangChain WindowMemory** | langchain (~50MB) | Low | Medium | No | Yes |
|
||
|
|
| **LangChain SummaryMemory** | langchain (~50MB) | Medium | High | No (DIY) | Yes |
|
||
|
|
| **LlamaIndex** | llama-index (~100MB) | Medium | Medium | No (DIY) | Yes |
|
||
|
|
| **MemGPT/Letta** | letta (~200MB) | Very High | Very High | Yes | Yes (complex) |
|
||
|
|
| **ChromaDB** | chromadb (~20MB) | Medium | Medium | Yes | Yes |
|
||
|
|
| **Qdrant** | qdrant (~30MB) | High | Medium | Yes | Yes |
|
||
|
|
| **Rolling Summary (DIY)** | None | Low | High | Yes (SQLite) | Yes |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## RECOMMENDATION
|
||
|
|
|
||
|
|
**Use Simple Rolling Summary (Option 5)** for MeshAI because:
|
||
|
|
|
||
|
|
1. **Zero dependencies** - No LangChain, LlamaIndex, or vector stores
|
||
|
|
2. **Works with current stack** - Uses existing AsyncOpenAI client and SQLite
|
||
|
|
3. **Significant efficiency gains** - Keeps last 4-6 exchanges + summary of older messages
|
||
|
|
4. **Persistent** - Summaries stored in SQLite, survive restarts
|
||
|
|
5. **Simple to tune** - Two params: `window_size` and `summarize_threshold`
|
||
|
|
6. **OpenAI-compatible** - Works with LiteLLM, local models, anything
|
||
|
|
7. **Lightweight** - ~100 lines of code
|
||
|
|
|
||
|
|
### Implementation Steps
|
||
|
|
|
||
|
|
1. Add `SimpleRollingSummary` class (shown above)
|
||
|
|
2. Add summary table to SQLite schema
|
||
|
|
3. Modify `OpenAIBackend.generate()` to use `_memory.get_context_messages()`
|
||
|
|
4. Add summary storage methods to `ConversationHistory`
|
||
|
|
5. Configure: `window_size=4` (8 messages), `summarize_threshold=10`
|
||
|
|
|
||
|
|
### Expected Performance
|
||
|
|
|
||
|
|
**Before (full history):**
|
||
|
|
- 20 message pairs = ~3000 tokens sent every request
|
||
|
|
- Latency: higher, costs more
|
||
|
|
|
||
|
|
**After (rolling summary):**
|
||
|
|
- Summary (~100 tokens) + 4 recent pairs (~400 tokens) = ~500 tokens
|
||
|
|
- **83% token reduction** for long conversations
|
||
|
|
- Faster responses, lower costs
|
||
|
|
|
||
|
|
### When to Consider Alternatives
|
||
|
|
|
||
|
|
- **Vector stores (ChromaDB)**: If you need semantic search across users or topics
|
||
|
|
- **LangChain SummaryMemory**: If you want a batteries-included solution (accept dependency)
|
||
|
|
- **MemGPT**: If conversations become complex multi-day dialogues (they won't on mesh)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Example Usage
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Initialize
|
||
|
|
backend = OpenAIBackend(config, api_key)
|
||
|
|
|
||
|
|
# First few messages - full history sent
|
||
|
|
await backend.generate(
|
||
|
|
messages=[
|
||
|
|
{"role": "user", "content": "What's the weather?"},
|
||
|
|
{"role": "assistant", "content": "It's sunny!"},
|
||
|
|
{"role": "user", "content": "Should I bring an umbrella?"},
|
||
|
|
{"role": "assistant", "content": "No need, it's clear!"},
|
||
|
|
# ... 6 more exchanges ...
|
||
|
|
],
|
||
|
|
system_prompt="You are a helpful assistant.",
|
||
|
|
user_id="!abc123",
|
||
|
|
)
|
||
|
|
|
||
|
|
# After 10+ messages - summary + recent sent
|
||
|
|
# Context sent to LLM:
|
||
|
|
# [
|
||
|
|
# {"role": "system", "content": "Previous conversation summary: User asked about weather and outdoor activities. Confirmed sunny weather, no rain expected."},
|
||
|
|
# {"role": "user", "content": "Should I bring an umbrella?"},
|
||
|
|
# {"role": "assistant", "content": "No need, it's clear!"},
|
||
|
|
# ... (last 4 exchanges)
|
||
|
|
# ]
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Code Files to Modify
|
||
|
|
|
||
|
|
1. **`meshai/memory.py`** (NEW) - Add `SimpleRollingSummary` class
|
||
|
|
2. **`meshai/history.py`** - Add summary storage methods + table schema
|
||
|
|
3. **`meshai/backends/openai_backend.py`** - Integrate memory manager
|
||
|
|
4. **`meshai/responder.py`** - Pass `user_id` to backend.generate()
|
||
|
|
5. **`meshai/config.py`** - Add config for window_size, summarize_threshold
|
||
|
|
|
||
|
|
Let me know if you want me to implement this!
|