Deploying Ollama on Modal

written by Eric J. Ma on 2024-11-14 | tags: modal deployment open source api cloud gpu software models ollama large language models

In this blog post, I share my journey of deploying Ollama to Modal, enhancing my understanding of Modal's capabilities. I detail the script used, the setup of the Modal app, and the deployment process, which includes ensuring the Ollama service is ready and operational. I also implement an OpenAI-compatible endpoint that makes it easy to use the deployment with existing tools and libraries. This exploration not only expanded my technical skills but also created a practical solution for using open-source models in production. Curious about how this deployment could streamline your projects?

I recently learned how to deploy Ollama to Modal! I mostly copied code from another source but modified it just enough that I think I have upgraded my mental model of Modal and want to leave notes. My motivation here was to gain access to open source models that are larger than can fit comfortably on my 16GB M1 MacBook Air.

Credits

In this case, I feel obliged to give credit where credit is due:

The Modal Blog has a lot of great resources.
The original code by Irfan Sharif was great for my learning journey.

The script

If you're here just for the code, then you'll want to check out my modal-deployments repository!

I have also embedded the code below for reference:

"""FastAPI endpoint for Ollama chat completions with OpenAI-compatible API.

This module provides a FastAPI application that serves as a bridge between clients
and Ollama models, offering an OpenAI-compatible API interface. It supports both
streaming and non-streaming responses.
"""

import modal
import os
import subprocess
import time
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from typing import List, Any, Optional, AsyncGenerator
from pydantic import BaseModel, Field


MODEL = os.environ.get("MODEL", "gemma2:27b")
DEFAULT_MODELS = ["gemma2:27b"]


def pull() -> None:
    """Initialize and pull the Ollama model.

    Sets up the Ollama service using systemctl and pulls the specified model.
    """
    subprocess.run(["systemctl", "daemon-reload"])
    subprocess.run(["systemctl", "enable", "ollama"])
    subprocess.run(["systemctl", "start", "ollama"])
    wait_for_ollama()
    subprocess.run(["ollama", "pull", MODEL], stdout=subprocess.PIPE)


def wait_for_ollama(timeout: int = 30, interval: int = 2) -> None:
    """Wait for Ollama service to be ready.

    :param timeout: Maximum time to wait in seconds
    :param interval: Time between checks in seconds
    :raises TimeoutError: If the service doesn't start within the timeout period
    """
    import httpx
    from loguru import logger

    start_time = time.time()
    while True:
        try:
            response = httpx.get("http://localhost:11434/api/version")
            if response.status_code == 200:
                logger.info("Ollama service is ready")
                return
        except httpx.ConnectError:
            if time.time() - start_time > timeout:
                raise TimeoutError("Ollama service failed to start")
            logger.info(
                f"Waiting for Ollama service... ({int(time.time() - start_time)}s)"
            )
            time.sleep(interval)


# Configure Modal image with Ollama dependencies
image = (
    modal.Image.debian_slim()
    .apt_install("curl", "systemctl")
    .run_commands(  # from https://github.com/ollama/ollama/blob/main/docs/linux.md
        "curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz",
        "tar -C /usr -xzf ollama-linux-amd64.tgz",
        "useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama",
        "usermod -a -G ollama $(whoami)",
    )
    .copy_local_file("ollama.service", "/etc/systemd/system/ollama.service")
    .pip_install("ollama", "httpx", "loguru")
    .run_function(pull)
)
app = modal.App(name="ollama", image=image)
api = FastAPI()


class ChatMessage(BaseModel):
    """A single message in a chat completion request.

    Represents one message in the conversation history, following OpenAI's chat format.
    """

    role: str = Field(
        ..., description="The role of the message sender (e.g. 'user', 'assistant')"
    )
    content: str = Field(..., description="The content of the message")


class ChatCompletionRequest(BaseModel):
    """Request model for chat completions.

    Follows OpenAI's chat completion request format, supporting both streaming
    and non-streaming responses.
    """

    model: Optional[str] = Field(
        default=MODEL, description="The model to use for completion"
    )
    messages: List[ChatMessage] = Field(
        ..., description="The messages to generate a completion for"
    )
    stream: bool = Field(default=False, description="Whether to stream the response")


@api.post("/v1/chat/completions")
async def v1_chat_completions(request: ChatCompletionRequest) -> Any:
    """Handle chat completion requests in OpenAI-compatible format.

    :param request: Chat completion parameters
    :return: Chat completion response in OpenAI-compatible format, or StreamingResponse if streaming
    :raises HTTPException: If the request is invalid or processing fails
    """
    import ollama  # Import here to ensure it's available in the Modal container
    import json

    try:
        if not request.messages:
            raise HTTPException(
                status_code=400,
                detail="Messages array is required and cannot be empty",
            )

        if request.stream:

            async def generate_stream() -> AsyncGenerator[str, None]:
                """Generate streaming response chunks.

                :return: AsyncGenerator yielding SSE-formatted JSON strings
                """
                response = ollama.chat(
                    model=request.model,
                    messages=[msg.dict() for msg in request.messages],
                    stream=True,
                )

                for chunk in response:
                    chunk_data = {
                        "id": "chat-" + str(int(time.time())),
                        "object": "chat.completion.chunk",
                        "created": int(time.time()),
                        "model": request.model,
                        "choices": [
                            {
                                "index": 0,
                                "delta": {
                                    "role": "assistant",
                                    "content": chunk["message"]["content"],
                                },
                                "finish_reason": None,
                            }
                        ],
                    }
                    yield f"data: {json.dumps(chunk_data)}\n\n"

                # Send final chunk with finish_reason
                final_chunk = {
                    "id": "chat-" + str(int(time.time())),
                    "object": "chat.completion.chunk",
                    "created": int(time.time()),
                    "model": request.model,
                    "choices": [
                        {
                            "index": 0,
                            "delta": {},
                            "finish_reason": "stop",
                        }
                    ],
                }
                yield f"data: {json.dumps(final_chunk)}\n\n"
                yield "data: [DONE]\n\n"

            return StreamingResponse(
                generate_stream(),
                media_type="text/event-stream",
            )

        # Non-streaming response
        response = ollama.chat(
            model=request.model, messages=[msg.model_dump() for msg in request.messages]
        )

        return {
            "id": "chat-" + str(int(time.time())),
            "object": "chat.completion",
            "created": int(time.time()),
            "model": request.model,
            "choices": [
                {
                    "index": 0,
                    "message": {
                        "role": "assistant",
                        "content": response["message"]["content"],
                    },
                    "finish_reason": "stop",
                }
            ],
            "usage": {
                "prompt_tokens": -1,  # Ollama doesn't provide token counts
                "completion_tokens": -1,
                "total_tokens": -1,
            },
        }

    except Exception as e:
        raise HTTPException(
            status_code=500, detail=f"Error processing chat completion: {str(e)}"
        )


@app.cls(
    gpu=modal.gpu.A10G(count=1),
    container_idle_timeout=10,
)
class Ollama:
    """Modal container class for running Ollama service.

    Handles initialization, startup, and serving of the Ollama model through FastAPI.
    """

    def __init__(self):
        """Initialize the Ollama service."""
        self.serve()

    @modal.build()
    def build(self):
        """Build step for Modal container setup."""
        subprocess.run(["systemctl", "daemon-reload"])
        subprocess.run(["systemctl", "enable", "ollama"])

    @modal.enter()
    def enter(self):
        """Entry point for Modal container.

        Starts Ollama service and pulls the specified model.
        """
        subprocess.run(["systemctl", "start", "ollama"])
        wait_for_ollama()
        subprocess.run(["ollama", "pull", MODEL])

    @modal.asgi_app()
    def serve(self):
        """Serve the FastAPI application.

        :return: FastAPI application instance
        """
        return api

## Code Walkthrough

Let's walk through the code step by step to understand how it works.

### Core Configuration and Dependencies

At the start, we define our default model and a list of models we want to pre-bake into our container:

```python
MODEL = os.environ.get("MODEL", "gemma2:27b")
DEFAULT_MODELS = ["gemma2:27b"]

Helper Functions

We have two important helper functions that manage the Ollama service:

pull(): This function initializes the Ollama service and pulls the specified models:

def pull() -> None:
    subprocess.run(["systemctl", "daemon-reload"])
    subprocess.run(["systemctl", "enable", "ollama"])
    subprocess.run(["systemctl", "start", "ollama"])
    wait_for_ollama()
    subprocess.run(["ollama", "pull", MODEL], stdout=subprocess.PIPE)

wait_for_ollama(): This function ensures the Ollama service is ready before proceeding:

def wait_for_ollama(timeout: int = 30, interval: int = 2) -> None:
    """Wait for Ollama service to be ready.

    :param timeout: Maximum time to wait in seconds
    :param interval: Time between checks in seconds
    """
    import httpx
    from loguru import logger

    start_time = time.time()
    while True:
        try:
            response = httpx.get("http://localhost:11434/api/version")
            if response.status_code == 200:
                logger.info("Ollama service is ready")
                return
        except httpx.ConnectError:
            if time.time() - start_time > timeout:
                raise TimeoutError("Ollama service failed to start")
            logger.info(
                f"Waiting for Ollama service... ({int(time.time() - start_time)}s)"
            )
            time.sleep(interval)

Container Image Setup

We define our Modal container image with all necessary dependencies:

image = (
    modal.Image.debian_slim()
    .apt_install("curl", "systemctl")
    .run_commands(
        "curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz",
        "tar -C /usr -xzf ollama-linux-amd64.tgz",
        "useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama",
        "usermod -a -G ollama $(whoami)",
    )
    .copy_local_file("ollama.service", "/etc/systemd/system/ollama.service")
    .pip_install("ollama", "httpx", "loguru")
    .run_function(pull)
)

This image definition:

Starts with a Debian slim base
Installs system dependencies
Sets up Ollama
Installs Python packages
Pre-pulls our models

FastAPI Application Setup

We define our API models using Pydantic for request validation:

class ChatMessage(BaseModel):
    role: str
    content: str

class ChatCompletionRequest(BaseModel):
    model: Optional[str] = Field(default=MODEL)
    messages: List[ChatMessage]
    stream: bool = Field(default=False)

The Main Endpoint

The /v1/chat/completions endpoint is OpenAI-compatible and handles both streaming and non-streaming responses:

@api.post("/v1/chat/completions")
async def v1_chat_completions(request: ChatCompletionRequest) -> Any:

This endpoint:

Validates the incoming request
Handles streaming responses if requested
Formats responses to match OpenAI's API structure
Includes proper error handling

Finally, we tie everything together in the Modal app class:

@app.cls(
    gpu=modal.gpu.A10G(count=1),
    container_idle_timeout=10,
)
class Ollama:

This class defines three key lifecycle methods:

build(): Runs during container build time to set up systemd services
enter(): Runs when the container starts to initialize Ollama
serve(): Exposes our FastAPI application

Deployment

To deploy this application, simply run:

modal deploy endpoint.py

Modal will:

Build the container image
Deploy it to Modal's infrastructure
Provide a unique URL for your API endpoint

The deployment includes automatic Swagger documentation at /docs, allowing you to test the API directly from your browser.

Using the API

Because we've made the endpoint OpenAI-compatible, you can use it with any OpenAI-compatible client. For example, with LlamaBot:

import llamabot as lmb

bot = lmb.SimpleBot(
    model_name="openai/gemma2:27b",
    api_base="https://<your-modal-deployment>.modal.run/v1",
    system_prompt="You are a helpful assistant.",
)

response = bot("Hello!")

This compatibility makes it easy to experiment with different open-source models while maintaining compatibility with existing tools and workflows.

Cite this blog post:

@article{
    ericmjl-2024-deploying-ollama-on-modal,
    author = {Eric J. Ma},
    title = {Deploying Ollama on Modal},
    year = {2024},
    month = {11},
    day = {14},
    howpublished = {\url{https://ericmjl.github.io}},
    journal = {Eric J. Ma's Blog},
    url = {https://ericmjl.github.io/blog/2024/11/14/deploying-ollama-on-modal},
}

I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.

If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!

Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!

Eric J Ma's Website