Unified Chat Memory Design
Overview
This document outlines the design for a unified chat memory system that consolidates linear and graph-based memory into a single, configurable class. The system separates storage and retrieval concerns while providing multiple API levels for different use cases.
Quick Start
import llamabot as lmb
# Simple linear memory (fast, no LLM calls)
memory = lmb.ChatMemory()
# Intelligent graph memory with threading
memory = lmb.ChatMemory.threaded(model="gpt-4o-mini")
# Use with any bot
bot = lmb.SimpleBot(system_prompt="You are helpful", memory=memory)
response = bot("Hello!") # Memory automatically stores and retrieves context
Core Design Principles
- Unified Interface: Single class handles both linear and graph-based memory
- Separation of Concerns: Storage operations (append) are separate from retrieval operations (search/context)
- Configuration at Instantiation: Memory mode and behavior are set once and never change
- NetworkX Backend: Direct use of NetworkX for graph operations without over-abstraction
- Optional Summarization: Summarization is optional and can be disabled for performance
Key Concepts
Linear vs Graph Memory
Feature | Linear Memory | Graph Memory |
---|---|---|
Speed | Fast (no LLM calls) | Slower (LLM for threading) |
Intelligence | Simple (last N messages) | Smart (semantic threading) |
Use Case | Simple conversations | Complex multi-threaded chats |
LLM Calls | None | 1-2 per message (optional) |
Conversation Threading
Linear Memory: Messages are stored in order, retrieved as recent history
H1 → A1 → H2 → A2 → H3 → A3
Graph Memory: Messages are intelligently connected based on content
H1: "Let's talk about Python"
└── A1: "Python is great for data science"
├── H2: "What about machine learning?" → A2: "ML libraries include..."
└── H3: "Tell me about databases" → A3: "SQL databases are..."
Architecture
API Levels
High-Level API (Opinionated)
# Default linear memory
memory = ChatMemory() # Uses LinearNodeSelector by default
# Graph memory with LLM-based threading
memory = ChatMemory.threaded(model="gpt-4o-mini")
Low-Level API (Full Configurability)
memory = ChatMemory(
node_selector=LLMNodeSelector(model="gpt-4o-mini"),
summarizer=LLMSummarizer(model="gpt-4o-mini"), # Optional
context_depth=5 # Default context depth for retrieval
)
Factory Methods Implementation
@classmethod
def threaded(cls, model: str = "gpt-4o-mini", **kwargs) -> "ChatMemory":
"""Create ChatMemory with LLM-based threading.
:param model: LLM model name for node selection and summarization
:param kwargs: Additional arguments passed to ChatMemory constructor
"""
return cls(
node_selector=LLMNodeSelector(model=model),
summarizer=LLMSummarizer(model=model), # Optional but recommended for threading
**kwargs
)
Under the Hood: What .threaded()
Actually Does
When you call ChatMemory.threaded(model="gpt-4o-mini")
, here's exactly what happens:
# 1. Factory method creates LLMNodeSelector
llm_selector = LLMNodeSelector(model="gpt-4o-mini")
# This creates a selector that will use GPT-4o-mini to choose conversation threads
# 2. Factory method creates LLMSummarizer
llm_summarizer = LLMSummarizer(model="gpt-4o-mini")
# This creates a summarizer that will generate message summaries for better threading
# 3. Factory method calls the main constructor
memory = ChatMemory(
node_selector=llm_selector,
summarizer=llm_summarizer,
context_depth=5 # Default value
)
# 4. Constructor initializes the memory system
def __init__(self, node_selector, summarizer, context_depth=5):
self.graph = nx.DiGraph() # Empty conversation graph
self.node_selector = llm_selector # Will use LLM for thread selection
self.summarizer = llm_summarizer # Will generate message summaries
self.context_depth = context_depth # How far back to look for context
self._next_node_id = 1 # Start numbering nodes from 1
Result: You get a ChatMemory
instance that:
- Uses LLM-based intelligent threading instead of linear memory
- Automatically generates message summaries for better thread selection
- Maintains a conversation graph with parent-child relationships
- Can retrieve context by traversing conversation threads
Equivalent Manual Creation:
# This is exactly what .threaded() does internally
memory = ChatMemory(
node_selector=LLMNodeSelector(model="gpt-4o-mini"),
summarizer=LLMSummarizer(model="gpt-4o-mini"),
context_depth=5
)
Note: We chose the factory method pattern over alternatives like constructor with mode parameters or separate classes. The factory pattern provides clearer intent through descriptive method names while keeping the __init__
method clean and focused on low-level configuration. This approach makes the API more readable and maintainable, especially as we add more memory modes and configuration options.
Data Model
ConversationNode
@dataclass
class ConversationNode:
id: int # Auto-incremented based on number of nodes in graph
message: BaseMessage # Single message (not conversation turn)
summary: Optional[MessageSummary] = None
parent_id: Optional[int] = None
timestamp: datetime = field(default_factory=datetime.now)
Key Points: - Each node represents a single message (human or assistant) - id: Auto-incremented integer providing natural ordering - parent_id: Creates threading relationships (None for root) - message: Contains role information (human/assistant) via BaseMessage - timestamp: Metadata for when the message was created - summary: Optional for performance - Immutable once created
MessageSummary
class MessageSummary(BaseModel):
title: str = Field(..., description="Title of the message")
summary: str = Field(..., description="Summary of the message. Two sentences max.")
Usage Examples
Basic SimpleBot with Linear Memory
import llamabot as lmb
# Create a bot with simple linear memory (fast, no LLM calls)
memory = lmb.ChatMemory() # Default linear
bot = lmb.SimpleBot(
system_prompt="You are a helpful assistant.",
model_name="gpt-4o-mini",
memory=memory
)
# Chat loop automatically uses memory
response1 = bot("Hello! How are you?")
response2 = bot("What did I just ask you?") # Bot can reference previous conversation
SimpleBot with Graph Memory
import llamabot as lmb
# Create a bot with intelligent threading (uses LLM for smart connections)
memory = lmb.ChatMemory.threaded(model="gpt-4o-mini")
bot = lmb.SimpleBot(
system_prompt="You are a helpful assistant.",
model_name="gpt-4o-mini",
memory=memory
)
# Bot can now handle conversation threading intelligently
response1 = bot("Let's talk about Python programming.")
response2 = bot("What are the benefits of using Python?") # Continues Python thread
response3 = bot("Now let's discuss machine learning.") # Starts new thread
response4 = bot("What libraries should I use for ML in Python?") # Connects back to Python thread
Custom Memory Configuration
import llamabot as lmb
# Custom memory configuration (advanced users)
memory = lmb.ChatMemory(
node_selector=lmb.LLMNodeSelector(model="gpt-4o-mini"),
summarizer=lmb.LLMSummarizer(model="gpt-4o-mini"), # Optional for better threading
context_depth=10 # How far back to look for context
)
bot = lmb.SimpleBot(
system_prompt="You are a coding assistant.",
model_name="gpt-4o-mini",
memory=memory
)
Memory Retrieval in Chat Loop
import llamabot as lmb
# Bot with memory that can retrieve context
memory = lmb.ChatMemory.threaded(model="gpt-4o-mini")
bot = lmb.SimpleBot(
system_prompt="You are a helpful assistant.",
model_name="gpt-4o-mini",
memory=memory
)
# Simulate a conversation
bot("I'm working on a Python project.")
bot("I need to handle file I/O.")
bot("What's the best way to read CSV files?")
bot("Can you remind me what we discussed about file I/O?") # Bot retrieves relevant context
Memory Export and Visualization
from llamabot.bot.simplebot import SimpleBot
from llamabot.components.chat_memory import ChatMemory
# Create bot with graph memory
memory = ChatMemory.threaded(model="gpt-4o-mini")
bot = SimpleBot(
system_prompt="You are a helpful assistant.",
model_name="gpt-4o-mini",
memory=memory
)
# Have a conversation
bot("Let's discuss Python.")
bot("What about data structures?")
bot("Now let's talk about machine learning.")
bot("What ML libraries work well with Python?")
# Export conversation graph
mermaid_diagram = memory.to_mermaid()
print(mermaid_diagram)
# Get conversation statistics
print(f"Total messages: {len(memory.graph.nodes())}")
print(f"Conversation threads: {len([n for n in memory.graph.nodes() if memory.graph.out_degree(n) == 0])}")
When to Use Each Memory Type
Use Case | Memory Type | Why |
---|---|---|
Simple Q&A | lmb.ChatMemory() |
Fast, no LLM calls needed |
Multi-topic conversations | lmb.ChatMemory.threaded() |
Smart threading connects related topics |
Performance critical | lmb.ChatMemory() |
No additional LLM latency |
Complex discussions | lmb.ChatMemory.threaded() |
Maintains conversation context across topics |
Real-time chat | lmb.ChatMemory() |
Immediate responses |
Research/analysis | lmb.ChatMemory.threaded() |
Can reference earlier parts of conversation |
Memory in Different Bot Types
import llamabot as lmb
# Linear memory for simple conversations (fast)
linear_memory = lmb.ChatMemory() # Default linear
simple_bot = lmb.SimpleBot(
system_prompt="You are a helpful assistant.",
memory=linear_memory
)
# Graph memory for complex conversations with threading (smart)
graph_memory = lmb.ChatMemory.threaded(model="gpt-4o-mini")
query_bot = lmb.QueryBot(
system_prompt="You are a helpful assistant.",
memory=graph_memory
)
# Custom memory for specific needs (advanced)
custom_memory = lmb.ChatMemory(
node_selector=lmb.LLMNodeSelector(model="gpt-4o-mini"),
summarizer=None # No summarization for performance
)
structured_bot = StructuredBot(
system_prompt="You are a helpful assistant.",
pydantic_model=SomeModel,
memory=custom_memory
)
Memory Reset and State Management
from llamabot.bot.simplebot import SimpleBot
from llamabot.components.chat_memory import ChatMemory
# Create bot with memory
memory = ChatMemory() # Default linear
bot = SimpleBot(
system_prompt="You are a helpful assistant.",
model_name="gpt-4o-mini",
memory=memory
)
# Have a conversation
bot("Hello!")
bot("How are you?")
# Reset memory for new conversation
memory.reset()
# Bot no longer remembers previous conversation
response = bot("What did we just talk about?") # Bot won't remember
Core Operations
Storage Operations
append(human_message: BaseMessage, assistant_message: BaseMessage)
- Adds both messages to the graph
- Creates parent-child relationship between messages
- Uses node selector to determine threading (linear vs graph mode)
- Prevents cycles and orphaned nodes
reset()
- Clears all stored messages
- Resets graph to empty state
Retrieval Operations
retrieve(query: str, n_results: int = 10, context_depth: int = 5) -> List[BaseMessage]
- Smart retrieval that adapts based on memory configuration
- Linear memory: Ignores query, returns recent messages (fast)
- Graph memory: Uses semantic search with BM25 or similar algorithm, then traverses up thread paths
- n_results: Number of relevant messages to find via semantic search
- context_depth: Number of nodes to traverse up each thread path for context
- Returns most relevant messages with their conversation context
- Works like existing docstore implementations
Context Depth Example:
H1: "Let's talk about Python" (root)
└── A1: "Python is great for data science"
├── H2: "What about machine learning?"
│ └── A2: "ML libraries include scikit-learn"
└── H3: "Tell me about databases"
└── A3: "SQL databases are..."
# Thread path for A2: A2 ← H2 ← A1 ← H1 (root)
memory.retrieve(query="machine learning", n_results=1, context_depth=2)
# Returns: [A2, H2, A1] (relevant message + 2 messages up thread path)
Threading Model
Conversation Structure
The graph memory uses a tree structure where all nodes are connected in a single conversation tree:
H1: "Let's talk about Python" (root - first human message)
└── A1: "Python is great for data science"
├── H2: "What about machine learning?"
│ └── A2: "ML libraries include scikit-learn"
└── H3: "Tell me about databases"
└── A3: "SQL databases are..."
# Thread paths:
# Thread 1: H1 → A1 → H2 → A2
# Thread 2: H1 → A1 → H3 → A3
Threading Rules
- Root Node: The first human message becomes the root of the conversation tree
- Branching Rules:
- Human messages can only branch from assistant messages
- Assistant messages can only branch from human messages
- This enforces the conversation turn structure: Human → Assistant → Human → Assistant...
- Thread Definition: Threads are paths from root to leaf nodes (active conversation endpoints)
- Leaf nodes: Nodes with no out-edges (no children)
- Root node: First human message with no parent (parent_id = None)
Node Selection Strategies
LinearNodeSelector
- Always selects the leaf assistant node (node with no out-edges that is an assistant message)
- Creates linear conversation flow
- No LLM calls required
- Used in linear mode
- Constraint: Can only select assistant messages as parents for human messages
LLMNodeSelector
- Uses LLM to intelligently select which assistant message to branch from
- Considers message content and conversation context
- Supports retry logic with feedback
- Used in graph mode
- Constraint: Can only select assistant messages as parents for human messages
- First message handling: If no assistant messages exist, creates root node (parent_id = None)
Node Selection Logic
- First message: If no assistant nodes exist, message becomes root (
parent_id = None
) - Subsequent messages: LLM selects best assistant message as parent
- No fallbacks: LLM selection should be reliable; if it fails, message becomes root
Usage Patterns
Memory Usage in Chat Loop
The following example shows how memory is used inside a bot's __call__
method. This is the standard pattern that all memory types should follow:
def __call__(self, *human_messages):
# 1. Process incoming messages
processed_messages = to_basemessage(human_messages)
# 2. RETRIEVAL: Get relevant context from memory
memory_messages = []
if self.memory:
# Memory system handles the complexity internally
memory_messages = self.memory.retrieve(
query=f"From our conversation history, give me the most relevant information to the query, {[p.content for p in processed_messages]}",
n_results=10,
context_depth=5
)
# 3. Build message list with context
messages = [self.system_prompt] + memory_messages + processed_messages
# 4. Generate response
response_message = AIMessage(content=content, tool_calls=tool_calls)
# 5. STORAGE: Save conversation turn to memory
if self.memory:
self.memory.append(processed_messages[-1], response_message)
return response_message
Key Points:
- Retrieval happens before response generation to provide context
- Storage happens after response generation to save the conversation turn
- Memory is self-aware - the retrieve()
method automatically chooses the best strategy based on memory type
- No mode checking needed - bot implementers don't need to know about memory internals
- Performance optimization is automatic - linear memory skips expensive semantic search
- Memory is optional - bot works without memory, just with less context
- Unified API - same method calls work for all memory types
Note: This example shows the core memory operations with logging stripped out for clarity. Real implementations should include appropriate logging and error handling.
Export and Visualization
Mermaid Export
# Export conversation graph
mermaid_diagram = memory.to_mermaid()
# Filter by role for cleaner visualization
assistant_nodes = [n for n in memory.graph.nodes()
if memory.graph.nodes[n]['node'].message.role == 'assistant']
Implementation Details
Modular Architecture
The implementation uses a modular approach to keep the main ChatMemory
class clean and focused:
llamabot/components/chat_memory/
├── __init__.py # Exports main classes and functions
├── memory.py # Main ChatMemory class
├── retrieval.py # Retrieval functions
├── storage.py # Storage functions
├── visualization.py # Visualization functions
└── selectors.py # Node selection strategies
Module Exports in __init__.py
:
# Main classes
from .memory import ChatMemory
from .selectors import LinearNodeSelector, LLMNodeSelector
from .storage import append_linear, append_with_threading
from .retrieval import get_recent_messages, semantic_search_with_context
from .visualization import to_mermaid
__all__ = [
"ChatMemory",
"LinearNodeSelector",
"LLMNodeSelector",
"append_linear",
"append_with_threading",
"get_recent_messages",
"semantic_search_with_context",
"to_mermaid"
]
Test Structure Should Mirror Components:
tests/components/chat_memory/
├── test_memory.py # Test main ChatMemory class
├── test_retrieval.py # Test retrieval functions
├── test_storage.py # Test storage functions
├── test_visualization.py # Test visualization functions
└── test_selectors.py # Test node selection strategies
Main Class (Clean and Focused)
class ChatMemory:
def __init__(self,
node_selector: Optional[NodeSelector] = None,
summarizer: Optional[Summarizer] = None,
context_depth: int = 5):
"""Initialize chat memory with configuration.
:param node_selector: Strategy for selecting parent nodes (None = LinearNodeSelector)
:param summarizer: Optional summarization strategy (None = no summarization)
:param context_depth: Default depth for context retrieval
"""
# Initialize NetworkX graph for storage
self.graph = nx.DiGraph()
# Set node selector (linear by default, LLM-based if provided)
self.node_selector = node_selector or LinearNodeSelector()
# Set optional summarizer
self.summarizer = summarizer
# Validate and store context depth
if context_depth < 0:
raise ValueError("context_depth must be non-negative")
self.context_depth = context_depth
# Track next node ID for auto-incrementing
self._next_node_id = 1
def retrieve(self, query: str, n_results: int = 10, context_depth: int = None) -> List[BaseMessage]:
"""Smart retrieval that adapts based on memory configuration."""
context_depth = context_depth or self.context_depth
if isinstance(self.node_selector, LinearNodeSelector):
return get_recent_messages(self.graph, n_results)
else:
return semantic_search_with_context(self.graph, query, n_results, context_depth)
def append(self, human_message: BaseMessage, assistant_message: BaseMessage):
"""Add conversation turn to memory."""
if isinstance(self.node_selector, LinearNodeSelector):
append_linear(self.graph, human_message, assistant_message, self._next_node_id)
self._next_node_id += 2 # Increment for both messages
else:
append_with_threading(self.graph, human_message, assistant_message, self.node_selector, self._next_node_id)
self._next_node_id += 2 # Increment for both messages
Separate Functions (Implementation Details)
# retrieval.py
def get_recent_messages(graph: nx.DiGraph, n_results: int) -> List[BaseMessage]:
"""Get the most recent N messages from linear memory."""
def semantic_search_with_context(graph: nx.DiGraph, query: str, n_results: int, context_depth: int) -> List[BaseMessage]:
"""Find relevant nodes via semantic search, then traverse up their thread paths for context."""
def traverse_thread_path(graph: nx.DiGraph, node: int, depth: int) -> List[BaseMessage]:
"""Traverse up a conversation thread path from a given node."""
# storage.py
def append_linear(graph: nx.DiGraph, human_message: BaseMessage, assistant_message: BaseMessage, next_node_id: int):
"""Append messages to linear memory."""
# Create human node
human_node = ConversationNode(
id=next_node_id,
message=human_message,
parent_id=find_leaf_assistant_node(graph) if graph.nodes() else None
)
graph.add_node(next_node_id, node=human_node)
# Create assistant node
assistant_node = ConversationNode(
id=next_node_id + 1,
message=assistant_message,
parent_id=next_node_id
)
graph.add_node(next_node_id + 1, node=assistant_node)
# Add edges
if human_node.parent_id:
graph.add_edge(human_node.parent_id, next_node_id)
graph.add_edge(next_node_id, next_node_id + 1)
def append_with_threading(graph: nx.DiGraph, human_message: BaseMessage, assistant_message: BaseMessage, node_selector, next_node_id: int):
"""Append messages with intelligent threading following conversation turn structure."""
# Use node selector to find best parent for human message
parent_id = node_selector.select_parent(graph, human_message)
# Create human node
human_node = ConversationNode(
id=next_node_id,
message=human_message,
parent_id=parent_id
)
graph.add_node(next_node_id, node=human_node)
# Create assistant node
assistant_node = ConversationNode(
id=next_node_id + 1,
message=assistant_message,
parent_id=next_node_id
)
graph.add_node(next_node_id + 1, node=assistant_node)
# Add edges
if parent_id:
graph.add_edge(parent_id, next_node_id)
graph.add_edge(next_node_id, next_node_id + 1)
# visualization.py
def to_mermaid(graph: nx.DiGraph, **kwargs) -> str:
"""Convert graph to Mermaid diagram."""
Benefits: - Cleaner main class - focuses on high-level API - Easier testing - can test functions independently - Better separation of concerns - each function has one job - More modular - functions can be reused or swapped - Easier to understand - main class shows the "what", functions show the "how"
Testing Benefits: - Unit tests for each function in isolation - Integration tests for the main ChatMemory class - Mock testing of LLM components without real API calls - Test coverage for each component independently - Regression testing when modifying individual functions
Import Benefits:
- Clean imports: from llamabot.components.chat_memory import ChatMemory
- Function access: from llamabot.components.chat_memory import append_linear
- Selector access: from llamabot.components.chat_memory import LLMNodeSelector
- Top-level exports: All main functionality available from module root
NetworkX Backend
- Direct use of NetworkX DiGraph for storage
- No abstraction layer needed
- Leverages NetworkX algorithms for graph operations
- Efficient for small to medium conversation graphs
Implementation Details
Auto-Incremented IDs
- Node IDs start at 1 and increment for each new node
- Provides natural chronological ordering
- Simple integer-based identification
- No UUID complexity or collision concerns
NetworkX Graph Storage
- Each node stores a
ConversationNode
object as node data - Node ID is the NetworkX node identifier
- Edges represent parent-child relationships
- Graph maintains conversation tree structure
Node Selection Process
- Linear Mode: Find leaf assistant node (no out-edges, role="assistant")
- Graph Mode:
- Get all assistant nodes as candidates
- Use LLM to select best parent based on message content
- Validate selection is an assistant node
- If no candidates exist, message becomes root
Error Handling
The system uses actionable error handling - only raising errors for issues that humans can actually fix:
Actionable Errors (User Can Fix)
- Configuration errors: Invalid parameters like negative
context_depth
- File system errors: Permission denied, disk full, invalid file paths
- Input validation: Wrong message types, empty message content
Graceful Handling (No Errors)
- Empty memory: Returns empty list instead of error
- LLM selection failures: Falls back to most recent valid node
- Summarization failures: Continues without summary
- Graph corruption: Clear error message with reset instruction
Error Message Examples
Configuration Error (Actionable):
# Validated at instantiation
if context_depth < 0:
raise ValueError("context_depth must be non-negative")
File System Error (Actionable):
if "Permission denied" in str(e):
raise PersistenceError(
f"Cannot save to {file_path}. Check file permissions or choose a different location."
)
elif "No space left" in str(e):
raise PersistenceError(
f"Disk is full. Free up space or choose a different location."
)
Graph Corruption (Actionable):
raise InvalidGraphStateError(
"Conversation graph has become corrupted. "
"This can happen if the same message was processed multiple times. "
"Use memory.reset() to clear the conversation and start fresh."
)
Graceful Handling Examples:
# Empty memory - no error, just empty result
if not graph.nodes():
return []
# LLM failure - fallback to most recent node
if llm_response not in valid_candidates:
return valid_candidates[-1] if valid_candidates else None
# Summarization failure - continue without summary
try:
summary = summarizer.summarize(message)
except Exception:
summary = None
Performance Considerations
- Optional summarization for linear mode
- Lazy loading of summaries when needed
- Efficient graph traversal for retrieval
- Memory-efficient storage of large conversations
Benefits
- Simplified API: Single class for all memory operations
- Better Performance: Optional summarization reduces LLM calls
- Clearer Separation: Storage and retrieval are distinct concerns
- Easier Testing: Smaller, focused components
- Future Extensibility: Pluggable node selectors and retrieval strategies
- Type Safety: Clear interfaces and error handling
Persistence Design
Storage Format
The conversation memory uses a JSON-based format for persistence that is easily parseable and human-readable:
{
"version": "1.0",
"metadata": {
"created_at": "2024-01-15T10:30:00Z",
"last_modified": "2024-01-15T14:45:00Z",
"mode": "graph",
"total_messages": 12
},
"nodes": [
{
"id": 1,
"role": "user",
"content": "Let's talk about Python",
"timestamp": "2024-01-15T10:30:00Z",
"summary": {
"title": "Python Discussion Start",
"summary": "User wants to discuss Python programming."
},
"parent_id": null
},
{
"id": 2,
"role": "assistant",
"content": "Python is great for data science",
"timestamp": "2024-01-15T10:30:05Z",
"summary": {
"title": "Python Benefits",
"summary": "Assistant explains Python's benefits for data science."
},
"parent_id": 1
}
],
"edges": [
{"from": 1, "to": 2},
{"from": 2, "to": 3},
{"from": 2, "to": 5}
]
}
Persistence Operations
save(file_path: str) -> None
- Serializes conversation memory to JSON file
- Includes metadata for versioning and tracking
- Preserves all node data and edge relationships
- Handles BaseMessage serialization
load(file_path: str) -> ChatMemory
- Deserializes JSON file to recreate memory
- Validates graph structure integrity
- Reconstructs NetworkX graph from JSON data
- Handles version compatibility
export(format: str = "json") -> str
- Exports conversation in various formats
- JSON: Full conversation with metadata
- JSONL: OpenAI-compatible format for fine-tuning
- Mermaid: Visualization format
- Plain text: Simple conversation transcript
Implementation Details
JSON Serialization Strategy
def to_json(self) -> dict:
"""Convert conversation memory to JSON-serializable dict."""
return {
"version": "1.0",
"metadata": {
"created_at": self.created_at.isoformat(),
"last_modified": datetime.now().isoformat(),
"mode": self.mode,
"total_messages": len(self.graph.nodes())
},
"nodes": [
{
"id": node_id,
"role": node_data["node"].message.role,
"content": node_data["node"].message.content,
"timestamp": node_data["node"].timestamp.isoformat(),
"summary": node_data["node"].summary.dict() if node_data["node"].summary else None,
"parent_id": node_data["node"].parent_id
}
for node_id, node_data in self.graph.nodes(data=True)
],
"edges": [
{"from": u, "to": v}
for u, v in self.graph.edges()
]
}
Graph Reconstruction
def from_json(data: dict) -> ChatMemory:
"""Reconstruct conversation memory from JSON data."""
memory = ChatMemory(mode=data["metadata"]["mode"])
# Reconstruct nodes
for node_data in data["nodes"]:
message = create_message(node_data["role"], node_data["content"])
node = ConversationNode(
id=node_data["id"],
message=message,
summary=MessageSummary(**node_data["summary"]) if node_data["summary"] else None,
parent_id=node_data["parent_id"],
timestamp=datetime.fromisoformat(node_data["timestamp"])
)
memory.graph.add_node(node_data["id"], node=node)
# Reconstruct edges
for edge in data["edges"]:
memory.graph.add_edge(edge["from"], edge["to"])
return memory
Benefits of JSON Format
- Human-readable: Easy to inspect and debug
- Version control friendly: Diff-able and merge-able
- Language agnostic: Can be parsed by any language
- Extensible: Easy to add new fields
- Standard format: Well-supported across tools
- No security risks: Unlike pickle, no code execution
File Naming Convention
conversations/
├── session_2024-01-15_10-30-00.json
├── session_2024-01-15_14-45-00.json
└── backup_2024-01-15_18-00-00.json
Open Questions and Future Enhancements
Concurrency Handling
- How should multiple threads/processes access the same memory file?
- Should we use file locking or database backend for concurrent access?
- What happens if two processes try to append simultaneously?
Advanced Retrieval Strategies
- Semantic search across message content
- Time-based retrieval (messages from last hour/day)
- User-specific retrieval (only messages from specific user)
- Context-aware retrieval (messages related to current topic)
Performance Optimizations
- Lazy loading of large conversation histories
- Caching frequently accessed message paths
- Compression for long conversations
- Incremental graph updates
Integration with External Systems
- Export to chat platforms (Slack, Discord, etc.)
- Integration with vector databases for semantic search
- Webhook support for real-time updates
- API endpoints for external access
Migration Strategy
Phase 1: Deprecation (v0.13.0)
- Add deprecation warnings to existing
ChatMemory
class - Document new
ChatMemory
API - Update examples to use new API
- Ensure
ChatMemory
is top-level inllamabot/__init__.py
✅ - Update tests to reflect modular component structure ✅
Phase 2: Transition (v0.14.0)
- Make
ChatMemory
the default - Keep
ChatMemory
as alias with deprecation warning - Update all internal usage
Phase 3: Removal (v0.15.0)
- Remove
ChatMemory
class entirely - Remove deprecated methods
- Clean up imports and references
Migration Guide
Old API:
from llamabot.components.chat_memory import ChatMemory
memory = ChatMemory()
memory.add_message("user", "Hello")
messages = memory.get_messages()
New API:
from llamabot.components.chat_memory import ChatMemory
memory = ChatMemory()
memory.append("user", "Hello")
messages = memory.retrieve()
Key Changes:
- ChatMemory
→ ChatMemory
(same name, new implementation)
- add_message()
→ append()
- get_messages()
→ retrieve()
- New threading support with ChatMemory.threaded()
- New persistence methods: save()
, load()
, export()
Imports:
import llamabot as lmb
memory = lmb.ChatMemory.threaded(model="gpt-4o-mini")
Conclusion
This unified design addresses the core issues with the current implementation while maintaining the flexibility needed for different use cases. The separation of storage and retrieval concerns makes the system more maintainable and easier to understand, while the multiple API levels provide the right level of abstraction for different users.