written by Eric J. Ma on 2024-01-10 | tags: llamabot api chromadb openai mistral anthropic claude mixtral simplebot chatbot querybot llm large language model
In my spare time (if I can find any), I've been hacking on LlamaBot to make a bunch of internal improvements to the package. It's about ready, and I'd like to document what's changed here.
The first thing worth discussing is how the text-based models now use LiteLLM behind the scenes. With the explosion of models, switching between them without building out extensive internal code infrastructure is something that's going to facilitate experimentation. In my case, I was initially building out against OpenAI's GPT-4 API. Later, I experimented with Ollama for local LLMs on my tiny MacBook Air. Later, I became curious about Claude (by Anthropic) and Mixtral (by Mistral) and realized what a headache it would be to maintain my own switchboard for dispatching out to different APIs. LiteLLM fixed that problem for me efficiently, providing a uniform API interface to the various models I wanted to try. In short, LiteLLM became the API switchboard I desperately needed, and I'd recommend checking it out!
The second thing I'd like to discuss here is the new DocumentStore class available in LlamaBot. I ripped out the internals of QueryBot, made DocumentStore an independent class, and reinstalled DocumentStore into the QueryBot class.
What was the motivation for doing so? It was primarily my realization that I needed an interface on document storage and retrieval that (1) could be consistent across different storage backends and (2) customizable internally to work with different forms of storage + retrieval logic.
As such, I started out with the following DocumentStore
API:
class DocumentStore: def __init__( self, collection_name: str, storage_path: Path = Path.home() / ".llamabot" / "chroma.db", ): client = chromadb.PersistentClient(path=str(storage_path)) collection = client.create_collection(collection_name, get_or_create=True) self.storage_path = storage_path self.client = client self.collection = collection self.collection_name = collection_name def append(self, document: str, metadata: dict = {}): doc_id = sha256(document.encode()).hexdigest() self.collection.add(documents=document, ids=doc_id, metadatas=metadata) def extend(self, documents: list[str]): for document in documents: self.append(document) def retrieve(self, query: str, n_results: int = 10) -> list[str]: results: QueryResult = self.collection.query( query_texts=query, n_results=n_results ) return results["documents"][0] def reset(self): self.client.delete_collection(self.collection_name) self.collection = self.client.create_collection( self.collection_name, get_or_create=True ) def add_documents( self, document_paths: Path | list[Path], chunk_size: int = 2_000, chunk_overlap: int = 500, ): if isinstance(document_paths, Path): document_paths = [document_paths] for document_path in document_paths: document = magic_load_doc(document_path) splitted_document = split_document( document, chunk_size=chunk_size, chunk_overlap=chunk_overlap ) splitted_document = [doc.text for doc in splitted_document] self.extend(splitted_document)
Here, append
, extend
, retrieve
, and reset
are the core APIs, while add_documents
is a higher-level API that lets us add documents to the DocumentStore
based on file paths. What's cool is that this core interface (init, append, extend, retrieve, and reset) applies to document storage and chat history! Let's think carefully about it, with chat history. We are always appending (or extending) it and retrieving (the last K messages) within a ChatBot setting. Re-imagining and treating chat history as a collection of documents allows us to build a uniform interface for both, simplifying the internals of our Bots. Here's what the History
class looks like, then, with the exact same interface and adding a __getitem__
method:
class History: """History of messages.""" def __init__(self, session_name: str): self.messages: list[BaseMessage] = [] self.session_name = session_name def append(self, message: BaseMessage): """Append a message to the history.""" self.messages.append(message) def retrieve(self, query: BaseMessage, character_budget: int) -> list[BaseMessage]: """Retrieve messages from the history up to the the character budget. We use the character budget in order to simplify how we retrieve messages. :param query: The query to use to retrieve messages. Not used in this class. :param character_budget: The number of characters to retrieve. """ return retrieve_messages_up_to_budget(self.messages, character_budget) def reset(self): self.messages = [] def __getitem__(self, index): """Get the message at the given index.""" return self.messages[index]
You may have noticed that I used ChromaDB instead of LlamaIndex's tooling. There were multiple reasons for doing so, but the primary drivers were as follows.
The first was the layers of complexity in LlamaIndex's tooling, which primarily revolves around a VectorStoreIndex
, an LLMPredictor
, a ServiceContext
, and a StorageContext
. The documentation does not clearly explain what each of these abstractions is all about. Additionally, it felt peculiar to pair an LLMPredictor with a VectorStoreIndex when the LLMPredictor was primarily present to synthesize answers - at most, we need an embedding function. By contrast, ChromaDB's abstractions are much more natural: we have documents (text) that are stored as vectors (numbers), and they are linked together; when we query
the vector database with a query_text
, we get back a collection of results that have the document and metadata linked together.
The other was the so-called "cost of sane defaults": ChromaDB defaults to using SentenceTransformer from HuggingFace to compute embeddings (which is zero-cost out of the box), while LlamaIndex's examples commonly default to using OpenAI's API, which costs some money.
Taken together, though I had already become somewhat familiar with LlamaIndex's API, ChromaDB felt much more natural for the way that LlamaBot's internal APIs were being designed -- bots that do text-in/text-out and document stores with customizable retrieval.
One critical insight I arrived at in building LlamaBot is that there is one general and valuable use case for LLMs: we can use natural language to build text-in, text-out robots. Granted, this is less rigorous than using formal programming languages, but this is really useful for applications that need human-sounding natural language outputs. I'm also not the first to arrive at this conclusion: OpenAI's "GPTs" feature is also the result of this insight!
Additionally, the mechanics of sending messages out to APIs often means that we need to compose a collection of messages that get sent to the API. The APIs (OpenAI, Mistral, etc.) are stateless, meaning they do not remember the previous context. This is a natural consequence of how the neural network models are trained behind the scenes. What's usually different between SimpleBot
, ChatBot
, QueryBot
, and perhaps future XBots
, would be how the messages are composed before being sent to the API.
Finally, the mechanics of streaming, which usually involve copying and pasting the same chunk of code, feel stable enough that it should be abstracted behind a boolean toggle.
Taking all of this together, I thought that if I could further use SimpleBot
to simplify the interface to LiteLLM, we could then use SimpleBot
in such a way that we could bang out specialized little bots that operate within a bigger XBot
for various other applications.
The result is a new SimpleBot
interface:
class SimpleBot: def __init__( self, system_prompt: str, temperature=0.0, model_name=default_language_model(), stream=True, json_mode: bool = False, ): self.system_prompt: SystemMessage = SystemMessage(content=system_prompt) self.model_name = model_name self.temperature = temperature self.stream = stream self.json_mode = json_mode def __call__(self, human_message: str) -> AIMessage: messages: list[BaseMessage] = [ self.system_prompt, HumanMessage(content=human_message), ] response = self.generate_response(messages) autorecord(human_message, response) return AIMessage(content=response) def generate_response(self, messages: list[BaseMessage]) -> AIMessage: """Generate a response from the given messages.""" messages_dumped: list[dict] = [m.model_dump() for m in messages] completion_kwargs = dict( model=self.model_name, messages=messages_dumped, temperature=self.temperature, stream=self.stream, ) if self.json_mode: completion_kwargs["response_format"] = {"type": "json_object"} response = completion(**completion_kwargs) if self.stream: ai_message = "" for chunk in response: delta = chunk.choices[0].delta.content if delta is not None: print(delta, end="") ai_message += delta return AIMessage(content=ai_message) return AIMessage(content=response.choices[0].message.content)
(I omitted the docstrings here for ease of reading, but the actual thing has full docstrings!)
Here, __call__
is the high-level interface that does str --> AIMessage
, while generate_response
is a lower-level interface that does list[{Human/System}Message] --> AIMessage
. The key difference here is what the input is. __call__
allows us to pass in a single string as input, while generate_response
allows a developer to compose together a more complex suite of messages that get sent to the downstream APIs. In both cases, to ensure uniformity across downstream bots, __call__
is intentionally designed to return an AIMessage
rather than a str
, but I may revisit that design decision in the future too. After all, turning __call__
into a str --> str
interface is attractive too!
Additionally, note how generate_response
contains the code for streaming. By abstracting that code out, we can toggle between streamed vs. non-streamed results at will without needing to bother with that chunk of code. As has been my rule of thumb, the time for a refactor is as soon as we copy/paste the code!
With this refactor, SimpleBot can be used standalone or made part of other bots. Here's an example from ChatBot: we use the Mixin pattern to compose ChatBot from SimpleBot and History.
class ChatBot(SimpleBot, History): def __init__( self, system_prompt: str, session_name: str, temperature=0.0, model_name=default_language_model(), stream=True, response_budget=2_000, ): SimpleBot.__init__( self, system_prompt=system_prompt, temperature=temperature, model_name=model_name, stream=stream, ) History.__init__(self, session_name=session_name) self.response_budget = response_budget def __call__(self, message: str) -> AIMessage: human_message = HumanMessage(content=message) history = self.retrieve( query=human_message, character_budget=self.response_budget ) messages = [self.system_prompt] + history + [human_message] response = self.generate_response(messages) autorecord(message, response.content) self.append(human_message) self.append(response) return response
Notice how ChatBot
now inherits all of the class methods from SimpleBot
and History
. Therefore, it can use generate_response
from SimpleBot and append
from History
. It's __call__
is a bit more complicated than SimpleBot
's, so the custom logic gets placed there. After all, ChatBot
is nothing more than the two things mashed together. This is called the Mixin pattern, where a composite class inherits from two or more parent classes, each with its own unique set of attributes and class methods. Using the Mixin pattern results in more composable object class definitions but can also be a more significant mental burden for a code developer.
That said, this isn't the only way to build a ChatBot. Instead of using the Mixin pattern, we can hard-code a self.bot = SimpleBot(...)
and self.history = History(...)
in the class. Indeed, this was the original updated design of ChatBot before I experimented with the Mixin pattern. It has some advantages; for example, I could have both a DocumentStore
and a History
system (which, if you remember, have the same interface); these interfaces would invariably clash if we tried to do a mixin from both. But it also has some disadvantages: if I wanted to access an attribute such as the model_name
, I would have to do self.bot.model_name
; and if I wanted to use self.model_name
instead, I would have counterfactually had to set the attribute in the __init__
as well -- a duplication of information.
The same can be said for QueryBot
; here's what it looks like:
class QueryBot(SimpleBot, DocumentStore): def __init__( self, system_prompt: str, document_paths: Path | list[Path], collection_name: str, temperature: float = 0.0, model_name: str = default_language_model(), stream=True, ): SimpleBot.__init__( self, system_prompt=system_prompt, temperature=temperature, model_name=model_name, stream=stream, ) DocumentStore.__init__(self, collection_name=collection_name) self.add_documents(document_paths=document_paths) self.response_budget = 2_000 def __call__(self, query: str, n_results: int = 20) -> AIMessage: messages = [] context_budget = model_context_window_sizes.get( self.model_name, DEFAULT_TOKEN_BUDGET ) retrieved = retrieve_messages_up_to_budget( messages=[ RetrievedMessage(content=chunk) for chunk in self.retrieve(query, n_results=n_results) ], character_budget=context_budget - self.response_budget, ) messages.extend(retrieved) messages.append(HumanMessage(content=query)) response: AIMessage = self.generate_response(messages) return response
Similar to ChatBot, __call__
is the only overridden class method. Because we inherit from DocumentStore
, the add_documents
class method becomes available to QueryBot, so we can do things like:
qb = QueryBot(system_prompt=..., document_paths=[..., ...]) # do stuff with qb # ... # then realize we need another document added in: qb.add_document(document_paths=...)
The Mixin pattern is one that I have come to appreciate. It encourages composability through modularity in our Python classes; in doing so, this helps force clarity in thinking when designing our classes. It's like being spiritually functional in style, even though we have stateful objects. As you can tell, many other bots we can build are nothing more than a SimpleBot
composed together with some storage system, whether that storage system is History
or DocumentStore
. The biggest thing that differentiates SimpleBot
, ChatBot
, QueryBot
, and maybe even ChatQueryBot
is how messages get put together to be sent over the wire (API) and how documents are retrieved. Llamabot's new design reflects this updated knowledge.
At the same time, I will freely admit that as a trained biologist, I fantasized for a moment about building bots composed of other bots... just like how biological systems are composed!
While working with the OpenAI API, measuring the length of texts in terms of tokens was natural. However, now that I've worked with multiple LLMs and their providers, I've found knowing what kind of tokenization scheme each model works with is opaque. As such, I decided to switch back to counting the length of strings by the number of characters. Doing so will ensure that we stay conservatively under the actual token length (counting characters over-estimates the number of tokens used) but will also give a more humanly understood way of calculating sequence length, which will be a bit more user-friendly.
Having had some time to pause actual coding on LlamaBot and think carefully about how to organize the underlying code is something I felt the need to do for a while, especially as the tooling for building LLM applications begins to mature. The highlight here was changing the abstractions to build composite bots -- e.g., bots with document retrieval -- more efficiently. Greg Brockman once said that,
Much of the challenge in machine learning engineering is fitting the complexity into your head. Best way to approach is with great patience & a desire to dig into any detail.
Source: Twitter
The same goes for any software system being built, including LLM applications. This was the primary motivation for the refactor; the previous set of abstractions made it challenging for me to wrap my head around what was happening.
More than that, abstractions reflect working knowledge. Suppose my working understanding of a problem domain is clean. In that case, I'll write code that cleanly maps those abstractions. By contrast, if my working knowledge is fuzzy, I'll write fuzzy code. The code state of LlamaBot reflects my working knowledge of applied LLM usage. Hopefully, this iteration is much clearer than before!
@article{
ericmjl-2024-evolving-llamabot,
author = {Eric J. Ma},
title = {Evolving LlamaBot},
year = {2024},
month = {01},
day = {10},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2024/1/10/evolving-llamabot},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!