Eric J Ma's Website

How to run Ollama with LlamaBot

written by Eric J. Ma on 2023-10-22 | tags: python large language models llms gitbot zotero library local llms ollama langchain openai gpt-4 prompt engineering llamabot


If you've been following the LlamaBot project, you know it's my pet Pythonic project to interface with Large Language Models (LLMs). I've had fun building some cool stuff, like GitBot, a chatbot for your Zotero library, and even a blogging assistant (more on that later, promise!).

However, there's been one area I've shied away from until now: using local LLMs. The setup could be daunting, but I finally found a way to simplify it with Ollama.

A request sparks an idea

A community member posted an issue on GitHub:

"Hi @ericmjl, thank you for making this! I'd sent a PR but it's a little beyond me but I was wondering if there is a simple way to use local models such as Ollama?"

And I thought, why not? Can it work? Let's find out.

First steps with Ollama

Ollama made setting up local LLMs a breeze. I was pleasantly surprised at the smooth installation process; you simply need to follow the instructions on their main page.

Two ways to run Ollama models

Following that, there are two ways to access Ollama models.

  1. Chat in the Terminal: Run ollama run <model name>
  2. Local API Mode: Run ollama serve

One thing to note: Ollama pulls in models on the fly. They range from 3GB to 16GB, so you may need to be patient while they download.

Ollama + LlamaBot: How I integrated them

A happy architectural decision

My earlier decision to use LangChain paid off, even with all of the frustrations I had trying to track a fast-evolving Python package. LangChain's architecture made it straightforward to write a model dispatcher.

def create_model(
    model_name,
    temperature=0.0,
    streaming=True,
    verbose=True,
):
    """Dispatch and create the right model.

    This is necessary to validate b/c LangChain doesn't do the validation for us.

    :param model_name: The name of the model to use.
    :param temperature: The model temperature to use.
    :param streaming: (LangChain config) Whether to stream the output to stdout.
    :param verbose: (LangChain config) Whether to print debug messages.
    :return: The model.
    """
    ModelClass = ChatOpenAI
    if model_name.split(":")[0] in ollama_model_keywords:
        ModelClass = ChatOllama

    return ModelClass(
        model_name=model_name,
        temperature=temperature,
        streaming=streaming,
        verbose=verbose,
        callback_manager=BaseCallbackManager(
            handlers=[StreamingStdOutCallbackHandler()] if streaming else []
        ),
    )

Making the adjustments

I made some tweaks in SimpleBot, ChatBot, and QueryBot to ensure that they work with Ollama models.

How to use Ollama models with LlamaBot

So, how exactly do we use Ollama models with LlamaBot?

Firstly, start by serving an Ollama:

ollama pull <ollama model name>
ollama serve

Secondly, in your Jupyter notebook, initialize the bot:

bot = SimpleBot(..., model_name=<ollama model name>)

And that's it!

A Quick Demo with Zotero Chat

I’ve already enabled Zotero chat to use Ollama models to give you a taste. Try it out:

llamabot zotero chat --model-name <ollama model name>

What's more

While OpenAI's GPT-4 still sets the benchmark in speed and response quality, local models offer the freedom of being cost-free. This opens up new avenues for fine-tuning and prompt engineering!

Thanks for reading, and happy coding!


I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.

If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!

Finally, I do free 30-minute GenAI strategy calls for organizations who are seeking guidance on how to best leverage this technology. Consider booking a call on Calendly if you're interested!