written by Eric J. Ma on 2024-11-14 | tags: modal deployment open source api cloud gpu software models ollama large language models
In this blog post, I share my journey of deploying Ollama to Modal, enhancing my understanding of Modal's capabilities. I detail the script used, the setup of the Modal app, and the deployment process, which includes ensuring the Ollama service is ready and operational. I also implement an OpenAI-compatible endpoint that makes it easy to use the deployment with existing tools and libraries. This exploration not only expanded my technical skills but also created a practical solution for using open-source models in production. Curious about how this deployment could streamline your projects?
I recently learned how to deploy Ollama to Modal! I mostly copied code from another source but modified it just enough that I think I have upgraded my mental model of Modal and want to leave notes. My motivation here was to gain access to open source models that are larger than can fit comfortably on my 16GB M1 MacBook Air.
In this case, I feel obliged to give credit where credit is due:
If you're here just for the code, then you'll want to check out my modal-deployments repository!
I have also embedded the code below for reference:
"""FastAPI endpoint for Ollama chat completions with OpenAI-compatible API. This module provides a FastAPI application that serves as a bridge between clients and Ollama models, offering an OpenAI-compatible API interface. It supports both streaming and non-streaming responses. """ import modal import os import subprocess import time from fastapi import FastAPI, HTTPException from fastapi.responses import StreamingResponse from typing import List, Any, Optional, AsyncGenerator from pydantic import BaseModel, Field MODEL = os.environ.get("MODEL", "gemma2:27b") DEFAULT_MODELS = ["gemma2:27b"] def pull() -> None: """Initialize and pull the Ollama model. Sets up the Ollama service using systemctl and pulls the specified model. """ subprocess.run(["systemctl", "daemon-reload"]) subprocess.run(["systemctl", "enable", "ollama"]) subprocess.run(["systemctl", "start", "ollama"]) wait_for_ollama() subprocess.run(["ollama", "pull", MODEL], stdout=subprocess.PIPE) def wait_for_ollama(timeout: int = 30, interval: int = 2) -> None: """Wait for Ollama service to be ready. :param timeout: Maximum time to wait in seconds :param interval: Time between checks in seconds :raises TimeoutError: If the service doesn't start within the timeout period """ import httpx from loguru import logger start_time = time.time() while True: try: response = httpx.get("http://localhost:11434/api/version") if response.status_code == 200: logger.info("Ollama service is ready") return except httpx.ConnectError: if time.time() - start_time > timeout: raise TimeoutError("Ollama service failed to start") logger.info( f"Waiting for Ollama service... ({int(time.time() - start_time)}s)" ) time.sleep(interval) # Configure Modal image with Ollama dependencies image = ( modal.Image.debian_slim() .apt_install("curl", "systemctl") .run_commands( # from https://github.com/ollama/ollama/blob/main/docs/linux.md "curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz", "tar -C /usr -xzf ollama-linux-amd64.tgz", "useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama", "usermod -a -G ollama $(whoami)", ) .copy_local_file("ollama.service", "/etc/systemd/system/ollama.service") .pip_install("ollama", "httpx", "loguru") .run_function(pull) ) app = modal.App(name="ollama", image=image) api = FastAPI() class ChatMessage(BaseModel): """A single message in a chat completion request. Represents one message in the conversation history, following OpenAI's chat format. """ role: str = Field( ..., description="The role of the message sender (e.g. 'user', 'assistant')" ) content: str = Field(..., description="The content of the message") class ChatCompletionRequest(BaseModel): """Request model for chat completions. Follows OpenAI's chat completion request format, supporting both streaming and non-streaming responses. """ model: Optional[str] = Field( default=MODEL, description="The model to use for completion" ) messages: List[ChatMessage] = Field( ..., description="The messages to generate a completion for" ) stream: bool = Field(default=False, description="Whether to stream the response") @api.post("/v1/chat/completions") async def v1_chat_completions(request: ChatCompletionRequest) -> Any: """Handle chat completion requests in OpenAI-compatible format. :param request: Chat completion parameters :return: Chat completion response in OpenAI-compatible format, or StreamingResponse if streaming :raises HTTPException: If the request is invalid or processing fails """ import ollama # Import here to ensure it's available in the Modal container import json try: if not request.messages: raise HTTPException( status_code=400, detail="Messages array is required and cannot be empty", ) if request.stream: async def generate_stream() -> AsyncGenerator[str, None]: """Generate streaming response chunks. :return: AsyncGenerator yielding SSE-formatted JSON strings """ response = ollama.chat( model=request.model, messages=[msg.dict() for msg in request.messages], stream=True, ) for chunk in response: chunk_data = { "id": "chat-" + str(int(time.time())), "object": "chat.completion.chunk", "created": int(time.time()), "model": request.model, "choices": [ { "index": 0, "delta": { "role": "assistant", "content": chunk["message"]["content"], }, "finish_reason": None, } ], } yield f"data: {json.dumps(chunk_data)}\n\n" # Send final chunk with finish_reason final_chunk = { "id": "chat-" + str(int(time.time())), "object": "chat.completion.chunk", "created": int(time.time()), "model": request.model, "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", } ], } yield f"data: {json.dumps(final_chunk)}\n\n" yield "data: [DONE]\n\n" return StreamingResponse( generate_stream(), media_type="text/event-stream", ) # Non-streaming response response = ollama.chat( model=request.model, messages=[msg.model_dump() for msg in request.messages] ) return { "id": "chat-" + str(int(time.time())), "object": "chat.completion", "created": int(time.time()), "model": request.model, "choices": [ { "index": 0, "message": { "role": "assistant", "content": response["message"]["content"], }, "finish_reason": "stop", } ], "usage": { "prompt_tokens": -1, # Ollama doesn't provide token counts "completion_tokens": -1, "total_tokens": -1, }, } except Exception as e: raise HTTPException( status_code=500, detail=f"Error processing chat completion: {str(e)}" ) @app.cls( gpu=modal.gpu.A10G(count=1), container_idle_timeout=10, ) class Ollama: """Modal container class for running Ollama service. Handles initialization, startup, and serving of the Ollama model through FastAPI. """ def __init__(self): """Initialize the Ollama service.""" self.serve() @modal.build() def build(self): """Build step for Modal container setup.""" subprocess.run(["systemctl", "daemon-reload"]) subprocess.run(["systemctl", "enable", "ollama"]) @modal.enter() def enter(self): """Entry point for Modal container. Starts Ollama service and pulls the specified model. """ subprocess.run(["systemctl", "start", "ollama"]) wait_for_ollama() subprocess.run(["ollama", "pull", MODEL]) @modal.asgi_app() def serve(self): """Serve the FastAPI application. :return: FastAPI application instance """ return api ## Code Walkthrough Let's walk through the code step by step to understand how it works. ### Core Configuration and Dependencies At the start, we define our default model and a list of models we want to pre-bake into our container: ```python MODEL = os.environ.get("MODEL", "gemma2:27b") DEFAULT_MODELS = ["gemma2:27b"]
We have two important helper functions that manage the Ollama service:
pull()
: This function initializes the Ollama service and pulls the specified models:def pull() -> None: subprocess.run(["systemctl", "daemon-reload"]) subprocess.run(["systemctl", "enable", "ollama"]) subprocess.run(["systemctl", "start", "ollama"]) wait_for_ollama() subprocess.run(["ollama", "pull", MODEL], stdout=subprocess.PIPE)
wait_for_ollama()
: This function ensures the Ollama service is ready before proceeding:def wait_for_ollama(timeout: int = 30, interval: int = 2) -> None: """Wait for Ollama service to be ready. :param timeout: Maximum time to wait in seconds :param interval: Time between checks in seconds """ import httpx from loguru import logger start_time = time.time() while True: try: response = httpx.get("http://localhost:11434/api/version") if response.status_code == 200: logger.info("Ollama service is ready") return except httpx.ConnectError: if time.time() - start_time > timeout: raise TimeoutError("Ollama service failed to start") logger.info( f"Waiting for Ollama service... ({int(time.time() - start_time)}s)" ) time.sleep(interval)
We define our Modal container image with all necessary dependencies:
image = ( modal.Image.debian_slim() .apt_install("curl", "systemctl") .run_commands( "curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz", "tar -C /usr -xzf ollama-linux-amd64.tgz", "useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama", "usermod -a -G ollama $(whoami)", ) .copy_local_file("ollama.service", "/etc/systemd/system/ollama.service") .pip_install("ollama", "httpx", "loguru") .run_function(pull) )
This image definition:
We define our API models using Pydantic for request validation:
class ChatMessage(BaseModel): role: str content: str class ChatCompletionRequest(BaseModel): model: Optional[str] = Field(default=MODEL) messages: List[ChatMessage] stream: bool = Field(default=False)
The /v1/chat/completions
endpoint is OpenAI-compatible and handles both streaming and non-streaming responses:
@api.post("/v1/chat/completions") async def v1_chat_completions(request: ChatCompletionRequest) -> Any:
This endpoint:
Finally, we tie everything together in the Modal app class:
@app.cls( gpu=modal.gpu.A10G(count=1), container_idle_timeout=10, ) class Ollama:
This class defines three key lifecycle methods:
build()
: Runs during container build time to set up systemd servicesenter()
: Runs when the container starts to initialize Ollamaserve()
: Exposes our FastAPI applicationTo deploy this application, simply run:
modal deploy endpoint.py
Modal will:
The deployment includes automatic Swagger documentation at /docs
, allowing you to test the API directly from your browser.
Because we've made the endpoint OpenAI-compatible, you can use it with any OpenAI-compatible client. For example, with LlamaBot:
import llamabot as lmb bot = lmb.SimpleBot( model_name="openai/gemma2:27b", api_base="https://<your-modal-deployment>.modal.run/v1", system_prompt="You are a helpful assistant.", ) response = bot("Hello!")
This compatibility makes it easy to experiment with different open-source models while maintaining compatibility with existing tools and workflows.
@article{
ericmjl-2024-deploying-modal,
author = {Eric J. Ma},
title = {Deploying Ollama on Modal},
year = {2024},
month = {11},
day = {14},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2024/11/14/deploying-ollama-on-modal},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!