Ollama

Overview

Ollama is a local inference server for large language models. It downloads, quantizes, loads, and serves models through a REST API on localhost:11434. The goal: make running an LLM locally as simple as docker pull + docker run.

Ollama uses the GGUF format for model storage and runs inference via Metal (macOS), CUDA (NVIDIA), or ROCm (AMD). It provides an OpenAI-compatible API, so any tool that speaks the OpenAI /v1/chat/completions format can use Ollama as a drop-in local backend.

Architecture

The Daemon

Ollama runs as a persistent Go daemon on port 11434. The daemon:

Accepts HTTP requests (REST API)
Loads the requested model into memory (if not already loaded)
Runs inference using the loaded model
Keeps the model in memory for a configurable duration (keep_alive, default 5 minutes)
Unloads the model when the timeout expires or when memory pressure requires it

Multiple models can be loaded simultaneously if memory permits. The environment variable OLLAMA_MAX_LOADED_MODELS controls the limit (default: 1 on most systems, auto-scaled on high-memory machines).

Inference Engine

Ollama’s inference engine is built on llama.cpp, a C++ inference library, integrated through Go’s CGo foreign function interface. llama.cpp handles the core inference loop: loading GGUF tensors, managing the KV cache (the key-value pairs that accumulate during token generation so already-processed tokens don’t need recomputation), and running the forward pass through the transformer layers.

Starting with the Multimodal Engine release (May 2025), Ollama also ships a custom inference backend for model architectures that llama.cpp does not support. The model format is still GGUF — what differs is the architecture: models like Llama 4 Scout use chunked attention and 2D rotary embeddings, which require inference code that llama.cpp hadn’t implemented. The custom engine covers these architectures; llama.cpp continues to handle the rest.

GPU Acceleration

Ollama auto-detects the available GPU compute backend and uses it for inference:

Platform	Backend	Detection
macOS (Apple Silicon)	Metal	Automatic
Linux/Windows (NVIDIA)	CUDA	Requires NVIDIA drivers
Linux (AMD)	ROCm	Requires ROCm 6.0+

On discrete GPUs (NVIDIA, AMD), model weights must fit in the GPU’s dedicated VRAM. On Apple Silicon, the CPU and GPU share a single unified memory pool — a 64 GB Mac can load a 40 GB model with no data copying between CPU and GPU. See Mac Mini with unified memory vs Mini ITX with eGPU for a cost/performance comparison.

When a model doesn’t fit entirely in GPU memory, Ollama splits it: some layers run on the GPU, the rest on CPU. This is called partial offloading — it’s slower than full GPU inference but faster than pure CPU.

Model Registry

Ollama uses a content-addressable storage system — the same design pattern as Docker image layers and Git objects. Each piece of a model (weights, template, system prompt, adapter) is stored as an immutable blob named by its SHA-256 hash.

A manifest (JSON file) lists the blobs that compose a model:

{
  "schemaVersion": 2,
  "layers": [
    {"mediaType": "application/vnd.ollama.image.model", "digest": "sha256:abc...", "size": 4109853696},
    {"mediaType": "application/vnd.ollama.image.template", "digest": "sha256:def...", "size": 1286},
    {"mediaType": "application/vnd.ollama.image.params", "digest": "sha256:ghi...", "size": 96}
  ]
}

Deduplication: Multiple model variants that share the same weights blob reference the same hash. Ten models with different system prompts but identical weights store the weights only once. ollama list shows total storage; du -sh ~/.ollama/models/blobs/ shows actual disk usage.

REST API

The daemon serves a REST API on http://localhost:11434 with no authentication by default. CORS is enabled for browser clients. All endpoints accept JSON and support streaming via HTTP chunked encoding.

Chat Completions (POST /api/chat) — multi-turn conversation with message history, tool calling, and streaming. This is the primary endpoint most applications use.

Text Generation (POST /api/generate) — single-turn completion. Simpler than chat but less commonly used.

Embeddings (POST /api/embed) — generates vector embeddings for text. Used in RAG pipelines to convert documents into vectors for similarity search.

Model Management — GET /api/tags (list models), POST /api/pull (download), POST /api/create (build from Modelfile), DELETE /api/delete (remove), POST /api/show (inspect metadata).

OpenAI-compatible endpoints — Ollama also serves /v1/chat/completions and /v1/models in the OpenAI API format. Any client library or application that targets OpenAI can point at http://localhost:11434/v1 instead, with no code changes. This is how FluidVoice and most third-party tools integrate.

All responses include timing metrics: total_duration, load_duration, prompt_eval_duration (time to process the input), eval_duration (time to generate tokens), and token counts for benchmarking.

Modelfile

A Modelfile is a declarative recipe for creating a custom model variant — similar to a Dockerfile. It specifies a base model, system prompt, template, and parameter overrides.

FROM qwen2.5:14b
SYSTEM """You are a concise technical assistant. Answer in 1-3 sentences unless asked for more detail."""
PARAMETER temperature 0.3
PARAMETER num_ctx 8192

ollama create my-assistant -f Modelfile
ollama run my-assistant "What is a KV cache?"

Why Template?

The TEMPLATE directive controls how Ollama formats the conversation into the raw text the model sees. Different model families expect different prompt formats (ChatML, Llama-style, Mistral-style). Ollama ships default templates per model family, but a Modelfile lets you override this.

Practical use case — wrapping a base model with a domain-specific system prompt and conservative sampling for a customer-facing chatbot:

FROM llama3.1:8b
SYSTEM """You are a customer support agent for Acme Corp. Only answer questions about Acme products. If unsure, say you don't know."""
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
{{ .Content }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
"""
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"

The template uses Go’s text/template syntax. {{ .System }} injects the SYSTEM message, {{ range .Messages }} iterates over conversation turns, and {{ .Role }} / {{ .Content }} are the role (user/assistant) and message text for each turn. The special tokens (<|start_header_id|>, <|eot_id|>) are model-family-specific delimiters that the model was trained to recognize.

Sampling Parameters

When the model produces a probability distribution over the next token, sampling decides which token to actually pick. The Modelfile can tune this:

Parameter	What it controls	Default
`temperature`	Randomness. 0 = deterministic (always pick the most likely token). Higher = more creative/random.	0.8
`top_k`	Only consider the top K most likely tokens. Cuts off the long tail.	40
`top_p`	Only consider tokens whose cumulative probability reaches P (nucleus sampling).	0.9
`min_p`	Discard tokens with probability below this fraction of the top token’s probability.	0.0
`repeat_penalty`	Penalize tokens that already appeared in the output. Reduces repetition.	1.1
`seed`	Fix the random seed for reproducible outputs.	-1 (random)

These interact: top_k filters first, then top_p filters the survivors, then temperature scales the remaining probabilities before sampling. For deterministic output (e.g., JSON extraction), set temperature 0. For creative writing, temperature 1.0 with top_p 0.95.

Concurrent Requests and Parallelism

The daemon serves multiple clients. Key configuration:

OLLAMA_NUM_PARALLEL — how many requests a single loaded model handles concurrently (default: 1 or 4 depending on memory). Each parallel slot needs its own KV cache allocation, so more parallel requests = more memory per model. With OLLAMA_NUM_PARALLEL=4 and a 7B Q4_K_M model, expect ~8 GB total (model weights + 4 KV caches).

OLLAMA_MAX_LOADED_MODELS — how many models stay in memory simultaneously. If you switch between qwen2.5:7b and qwen2.5:14b frequently, setting this to 2 keeps both loaded and avoids reload delays. The cost is memory: both models’ weights resident simultaneously.

Integration Patterns

Any application that speaks the OpenAI Chat Completions API format can use Ollama with zero code changes — just point the base URL at http://localhost:11434/v1.

Localhost detection: Ollama accepts requests from localhost without API keys. Applications that detect a local endpoint (localhost, 127.0.0.1, private IPs) can skip authentication. FluidVoice, LM Studio, and most OpenAI client libraries support this.

keep_alive semantics: After a request, the model stays in memory for 5 minutes by default. This means the first request after idle has a ~2-3 second cold start (model loading), but subsequent requests within the window are instant. Control this per-request via the keep_alive field in the API payload, or globally via OLLAMA_KEEP_ALIVE.

# Keep model loaded for 30 minutes
curl http://localhost:11434/api/chat -d '{"model": "qwen2.5:7b", "keep_alive": "30m", ...}'
 
# Unload immediately after response
curl http://localhost:11434/api/chat -d '{"model": "qwen2.5:7b", "keep_alive": "0", ...}'
 
# Keep loaded indefinitely (until server restart)
curl http://localhost:11434/api/chat -d '{"model": "qwen2.5:7b", "keep_alive": "-1", ...}'

Practical Model Selection for Apple Silicon

See notebook for an interactive walkthrough with live controls.

Which model to pull depends on the task, the available unified memory, and how much latency is acceptable.

Memory Budget

The memory needed for model weights:

$Memory (GB) \approx Parameters (B) \times \frac{Bits per weight}{8}$

The “bits per weight” depends on the quantization format. On top of the weights, the KV cache adds ~1-2 GB depending on context window size — a 4096-token context uses less memory than 32K tokens.

On an M4 Max with 64 GB, the OS and background apps typically use 8-15 GB, leaving ~50 GB:

Model class	Quant	Memory	Fits?
7-8B (Llama 3.1 8B, Qwen 2.5 7B)	Q4_K_M	~6 GB	Comfortable. Room for other apps.
14B (Qwen 2.5 14B)	Q4_K_M	~9 GB	Comfortable.
32B (Qwen 2.5 32B, Gemma 3 27B)	Q4_K_M	~20 GB	Fits. Noticeable memory pressure with heavy apps.
70B (Llama 3.1 70B)	Q4_K_M	~40 GB	Fits but leaves little headroom.
70B	Q8_0	~70 GB	Does not fit. Requires Q4 or lower.

Model Recommendations by Task

Task	Recommended	Memory	Why
Dictation cleanup	`qwen2.5:7b` or `gemma3:4b`	4-6 GB	Simple task. Sub-second latency on M4 Max.
Text rewriting	`qwen2.5:14b`	~9 GB	Better instruction-following than 7B.
Function calling / agentic	`qwen2.5:14b` or `qwen2.5:32b`	9-20 GB	Needs tool-use training and multi-step reasoning.
General assistant	`qwen2.5:14b`	~9 GB	Good all-rounder.

Not all models support function calling

Function calling requires the model to emit structured JSON tool call syntax. Models explicitly trained for it (Qwen 2.5, Llama 3.1, Mistral) work reliably. Smaller or older models may hallucinate tool calls.

Expected Performance (M4 Max, 40 GPU cores, Q4_K_M)

Model	Tokens/sec	Time for 200-token response
Gemma 3 4B	150-250	< 1 sec
Qwen 2.5 7B	100-180	~1-2 sec
Qwen 2.5 14B	60-100	~2-3 sec
Qwen 2.5 32B	30-50	~4-7 sec
Llama 3.1 70B	12-20	~10-17 sec

These are decode (generation) speeds. The first token takes longer — time to first token (TTFT) — because the model must process the entire prompt. For short prompts, TTFT is 200-500ms on 7B models. For long conversations on 70B, TTFT can be several seconds.

Model Swapping Overhead

If you use different models for different tasks (e.g., 7B for dictation, 14B for function calling), Ollama must unload one model and load the other when you switch. This has real costs:

Scenario	What happens	Delay
Same model, within `keep_alive` window	Model already loaded. Instant.	~0 ms
Same model, after `keep_alive` expired	Model reloaded from disk.	2-3 sec (7B), 5-8 sec (32B)
Switch to a different model	Current model unloaded, new model loaded.	Same as reload time above
Two models, both within `OLLAMA_MAX_LOADED_MODELS`	Both stay resident. No swap needed.	~0 ms

Practical advice: If you frequently alternate between tasks, use one model for everything (qwen2.5:14b is a good all-rounder) to avoid swap delays. Alternatively, set OLLAMA_MAX_LOADED_MODELS=2 and keep both loaded — at the cost of ~15 GB total memory for a 7B + 14B pair.

Resource Impact

Idle (model loaded, no requests): Memory occupied but CPU/GPU idle. A 7B Q4_K_M model holds ~6 GB — negligible on 64 GB. A 32B model holds ~20 GB — noticeable with Xcode or Docker.

Active inference: GPU cores fully utilized, memory bandwidth saturated. Duration: 1-5 seconds for short tasks, up to 30 seconds for complex multi-turn conversations. Other GPU-heavy apps (video playback, 3D rendering) may stutter briefly.

Monitoring:

ollama ps                                    # loaded models and their memory
sysctl -n hw.memsize | awk '{print $1/1073741824 " GB"}'  # total memory

Installation

brew install ollama
ollama serve                    # start the daemon (or: brew services start ollama)
ollama pull qwen2.5:7b          # download a model
ollama run qwen2.5:7b "Hello"   # test it
ollama ps                       # verify it's loaded and check memory

Keeping Up to Date

Models and quantization methods evolve rapidly. These resources help refresh model knowledge:

Ollama model library — https://ollama.com/library — browse available models, sizes, and quantization variants
Open LLM Leaderboard — https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard — community benchmarks comparing open models
LM Arena — https://lmarena.ai/ — blind human preference rankings
Berkeley Function Calling Leaderboard — https://gorilla.cs.berkeley.edu/leaderboard.html — ranks models on function calling accuracy
r/LocalLLaMA — https://reddit.com/r/LocalLLaMA — new releases, Apple Silicon benchmarks, practical tips
Ollama releases — https://github.com/ollama/ollama/releases
MLX Community — https://huggingface.co/mlx-community — Apple Silicon performance baselines

Periodic refresh checklist

Every 3-6 months:

Check the Open LLM Leaderboard for new top models in the 7B-14B range

Check the Berkeley Function Calling Leaderboard if using tool calling

Run ollama --version and update if behind (brew upgrade ollama)

Test a newer model to see if quality or speed improved

Edmondo's Vault

Explorer

OLLAMA

Ollama

Overview

Architecture

The Daemon

Inference Engine

GPU Acceleration

Model Registry

REST API

Modelfile

Why Template?

Sampling Parameters

Concurrent Requests and Parallelism

Integration Patterns

Practical Model Selection for Apple Silicon

Memory Budget

Model Recommendations by Task

Expected Performance (M4 Max, 40 GPU cores, Q4_K_M)

Model Swapping Overhead

Resource Impact

Installation

Keeping Up to Date

See also

Graph View

Table of Contents

Backlinks