Ollama
Overview
Ollama is a local AI platform enabling users to run large language models directly on their devices without cloud dependencies. The platform emphasizes privacy, accessibility, and ease of use through a simple command-line interface and REST API. Ollama manages model downloading, loading, execution, and lifecycle through a persistent daemon service running on port 11434.
The framework addresses the problem of making LLM inference accessible on consumer hardware through efficient model packaging, quantization, and resource management. Ollama uses GGUF format for model storage and initially leveraged llama.cpp for inference, later transitioning to a proprietary engine for enhanced multimodal support.
Key technical components covered:
- Architecture and inference engine evolution
- Model format and registry system
- REST API endpoints and capabilities
- Modelfile configuration and templates
- Quantization formats and performance
- Concurrent request handling and batching
- Multimodal and embedding support
- Version history and feature releases
Architecture and Inference Engine
Ollama’s initial architecture integrated llama.cpp through Go bindings (CGo), enabling interaction between Go orchestration code and the C++ inference library. This integration facilitated:
Model loading and management utilizing llama.cpp capabilities to load models from GGUF files, configure GPU acceleration parameters, and manage memory efficiently.
Context and batch operations leveraging llama.cpp to handle inference contexts, manage key-value caches, and process batches of tokens or embeddings for efficient computation.
Sampling and generation control implementing sophisticated sampling algorithms providing control over token generation and supporting structured output generation.
Memory and GPU management utilizing llama.cpp’s memory management features including GPU enumeration and backend initialization to optimize resource utilization across NVIDIA, AMD, and integrated graphics.
Transition to proprietary engine: As demand for complex multimodal models grew, Ollama developed a custom engine designed to handle intricacies of such models. This shift aimed to improve reliability, accuracy, and performance for local inference tasks.
New engine features:
- Model modularity: Each model is self-contained with its own projection layer, simplifying integration
- Enhanced memory management: Introduces image caching and KVCache optimizations improving inference speed and memory efficiency
- Advanced techniques support: Supports chunked attention and 2D rotary embedding for complex models like Meta’s Llama 4 Scout
- Hardware collaboration: Optimizes memory estimation and inference performance across various devices through hardware partnerships
The daemon architecture runs as a persistent service accepting HTTP requests on port 11434, managing model lifecycle (loading, unloading, memory management), and serving multiple concurrent requests.
Model Format and Registry System
Ollama employs a content-addressable storage system using SHA-256 hashes to uniquely identify and store model components as immutable blobs.
Layered architecture: Models comprise multiple layers representing specific components:
- Model weights
- Adapters (LoRA fine-tuning)
- Templates (prompt formatting)
- System prompts
- Parameters (configuration)
Each layer is stored as an independent blob identified by its SHA-256 digest. A manifest file (JSON format) references these blobs by their digests, describing the model’s composition.
Manifest structure includes:
- SchemaVersion: Version of manifest schema
- MediaType: Content type for manifest
- Config: References configuration layer containing model metadata
- Layers: Array of layers, each with MediaType, Digest (SHA-256), Size, and optional From field for parent model reference
Blob storage: Blobs stored in directory structure under ~/.ollama/models/blobs/, with each blob named according to SHA-256 digest (e.g., sha256-abc123...). This naming ensures unique identification and facilitates efficient retrieval.
Deduplication: Multiple models sharing common components (weights, templates) store those components only once. For example, 10 variants of Llama 3 with different system prompts require only one copy of weights.
Model distribution: Manifest-based system enables efficient versioning and distribution. Manifests can be updated to reference new or modified layers without duplicating entire model content, similar to container image layers.
REST API Endpoints
Ollama provides comprehensive REST API on http://localhost:11434 with no authentication required by default. API accepts JSON payloads with Content-Type: application/json and supports streaming via HTTP chunked encoding. CORS enabled for browser-based applications.
Text Generation (/api/generate):
- Method: POST
- Parameters:
model(required),prompt,suffix,images(base64-encoded for multimodal),stream,format,options,system,template,context,raw - Generates text completions from specified model with optional image inputs
Chat Completions (/api/chat):
- Method: POST
- Parameters:
model(required),messages(array with role/content/images/tool_calls),tools,format,options,stream,keep_alive - Facilitates multi-turn conversations with message history
- Supports tool calling through
tool_callsin messages
Embeddings (/api/embeddings or /api/embed):
- Method: POST
- Parameters:
model(required),input(string or array),truncate,options,keep_alive - Generates vector embeddings for text inputs
- Returns array of floating-point vectors representing semantic meaning
Model Management:
List Models (/api/tags):
- Method: GET
- Returns list of locally available models with metadata including name, modified timestamp, size, digest, format, family, parameter count, quantization level
Create Model (/api/create):
- Method: POST
- Parameters:
model(required name),from(existing model),modelfile(string),quantize,stream - Creates new model from existing model, safetensors directory, or GGUF file
- Supports quantization during creation
Pull Model (/api/pull):
- Method: POST
- Downloads model from Ollama registry
- Streams progress updates
Push Model (/api/push):
- Method: POST
- Uploads custom model to registry
Delete Model (/api/delete):
- Method: DELETE
- Removes model from local storage
Copy Model (/api/copy):
- Method: POST
- Creates copy of existing model with new name
Show Model Info (/api/show):
- Method: POST
- Returns detailed model information including modelfile, parameters, template, license, system message
All endpoints support timing metrics in responses including total_duration, load_duration, prompt_eval_duration, eval_duration, prompt_eval_count, eval_count for performance monitoring.
Modelfile Configuration
Modelfile serves as blueprint for creating and customizing models using declarative syntax:
# comment
INSTRUCTION argumentsFROM (required) specifies base model:
FROM llama3.2PARAMETER sets runtime parameters:
temperature: Controls randomness (0.0-2.0)num_ctx: Context window size (e.g., 4096)repeat_penalty: Penalizes repetitionstop_k: Limits token selection to top k tokenstop_p: Nucleus sampling thresholdseed: Random seed for reproducibilitystop: Stop sequences to halt generationnum_predict: Maximum tokens to predictmirostat,mirostat_eta,mirostat_tau: Mirostat sampling controlsmin_p: Minimum probability thresholdtfs_z: Tail-free sampling parameter
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER stop "<|endoftext|>"TEMPLATE defines full prompt template using Go template syntax:
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>"""Template variables include {{ .System }}, {{ .Prompt }}, {{ .Response }} for structuring interactions.
SYSTEM specifies system message guiding behavior:
SYSTEM """You are a helpful assistant specialized in technical documentation."""ADAPTER applies fine-tuned LoRA adapter:
ADAPTER ./adapter.binLICENSE specifies legal license:
LICENSE """MIT License"""MESSAGE defines conversation history:
MESSAGE user Is Toronto in Canada?
MESSAGE assistant yes
MESSAGE user Is Ontario in Canada?
MESSAGE assistant yesModelfiles are case-insensitive and instructions can appear in any order. Creating custom model:
ollama create mymodel -f ModelfileQuantization Formats
Ollama supports multiple GGUF quantization formats balancing model size, performance, and accuracy:
Q4_0: 4-bit round-to-nearest quantization with 32 weights per block. Weight formula: w = q * block_scale. Legacy format, not widely used. Results in approximately 4 bits per weight.
Q4_K_M: 4-bit K-means quantization with super-blocks of 8 blocks, each containing 32 weights. Weight formula: w = q * block_scale(6-bit) + block_min(6-bit). Results in 4.5 bits per weight. Recommended for better performance than legacy formats with balanced quality/size tradeoff.
Q8_0: 8-bit round-to-nearest quantization with 32 weights per block. Weight formula: w = q * block_scale. Legacy format, not widely used. Provides higher quality than 4-bit but larger model size.
Additional K-quant formats: Ollama supports full K-quant family including Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_K_S, Q5_K_S, Q5_K_M, Q6_K offering various quality/size tradeoffs.
Quantization during model creation:
ollama create llama3.1:quantized -f - <<EOF
FROM llama3.1:8b-instruct-fp16
QUANTIZE q4_K_M
EOFQuantization reduces model size significantly (e.g., 7B parameter model from 28GB FP16 to ~4GB Q4_K_M) with minimal accuracy degradation, enabling larger models on consumer hardware.
Concurrent Request Handling and Batching
Ollama supports concurrent request processing and batching for enhanced performance:
Concurrent requests managed through OLLAMA_NUM_PARALLEL environment variable determining maximum parallel requests per model. Default value typically 1 or 4 depending on available memory.
OLLAMA_NUM_PARALLEL=4 ollama serveBatch processing groups multiple requests into single batch for processing together in one forward pass. Particularly beneficial for GPU inference, yielding significant throughput and memory efficiency gains. Batching effectiveness depends on:
- Model architecture supporting batch processing
- Available GPU VRAM accommodating batch size
- Request arrival patterns enabling batching
Model loading happens once, with loaded model serving multiple concurrent requests. Automatic model unloading occurs after idle timeout (configurable via OLLAMA_KEEP_ALIVE).
GPU acceleration requires sufficient VRAM for model and batch size. Ollama automatically detects and utilizes available GPUs (NVIDIA CUDA, AMD ROCm, Intel, Apple Metal). Monitoring tools like nvidia-smi help assess GPU utilization.
Configuration recommendations:
- Set
OLLAMA_NUM_PARALLELbased on system memory capacity - Enable batching when supported by model
- Monitor GPU and system memory usage
- Adjust
OLLAMA_MAX_LOADED_MODELSfor multi-model scenarios
The daemon’s efficient request handling enables serving multiple users from single Ollama instance with shared model loading and optimized resource utilization.
Multimodal and Embedding Support
Multimodal vision models process both text and images locally. Ollama supports models including:
- Meta’s Llama 4 Scout
- Google’s Gemma 3
- Alibaba’s Qwen 2.5 VL
- LLaVA variants
Usage:
ollama pull llama4:scout
ollama run llama4:scout
>>> what do you see in this image? /path/to/image.pngVia API:
curl http://localhost:11434/api/generate -d '{
"model": "llama4:scout",
"prompt": "describe this image",
"images": ["base64_encoded_image"]
}'Embedding models generate vector embeddings for semantic search and RAG applications. Supported models include:
mxbai-embed-largenomic-embed-textall-minilm
Usage:
ollama pull mxbai-embed-largeVia API:
import ollama
response = ollama.embed(
model='mxbai-embed-large',
input='Your text here',
)
embeddings = response['embeddings']Embeddings are floating-point vectors (typically 384-1024 dimensions) representing semantic meaning. These vectors can be stored in vector databases (Pinecone, Weaviate, Chroma) for similarity search, enabling retrieval-augmented generation where relevant documents are retrieved before LLM generates responses.
Version History and Features
2024 releases:
Llama 3 Integration (April 18, 2024): Added support for Meta’s Llama 3 models with improved performance.
Embedding Models (April 8, 2024): Introduced vector embedding generation for RAG applications.
AMD Graphics Support (March 14, 2024): Expanded GPU acceleration to AMD cards on Windows and Linux.
Windows Preview (February 15, 2024): Released Windows version with GPU acceleration and full model library.
OpenAI Compatibility (February 8, 2024): Achieved compatibility with OpenAI Chat Completions API enabling use of existing OpenAI tools.
Vision Models (February 2, 2024): Introduced LLaVA 1.6 supporting higher-resolution images and improved text recognition.
2025 releases:
Desktop Application (July 30, 2025): Launched native desktop app for macOS and Windows with drag-and-drop for PDFs and images.
Cloud Inference Service (August 2025): Introduced “Turbo” cloud service ($20/month) for datacenter-grade hardware access.
Secure Minions Protocol (June 3, 2025): Enabled encrypted local-remote communication for private collaboration between local and cloud models (Stanford collaboration).
Thinking Mode (May 30, 2025): Added ability to enable/disable model’s reasoning behavior for different applications.
Streaming with Tool Calling (May 28, 2025): Supported streaming responses with function execution during streaming.
Multimodal Engine (May 15, 2025): Launched proprietary engine supporting multimodal models with enhanced capabilities.
Web Search API (September 24, 2025): Added web search API with generous free tier.
Model Scheduling System (September 23, 2025): Improved scheduling reducing memory-related crashes and maximizing multi-GPU utilization.
Cloud Models Preview (September 19, 2025): Introduced cloud models allowing larger models on datacenter hardware while maintaining local tool usage.
Ollama maintains rapid development pace with frequent feature releases focusing on expanding model support, improving performance, and enhancing hardware compatibility.