Model Formats

Overview

Model formats define how neural network weights, architecture metadata, and configuration are stored and loaded for inference or training. Different formats optimize for specific concerns including security, performance, hardware compatibility, and model size through quantization. Understanding these formats is essential for selecting appropriate tools (Ollama, LM Studio, llama.cpp) and deploying models efficiently on target hardware.

The landscape includes several major formats: GGUF for CPU and mixed CPU-GPU inference with advanced quantization, MLX for Apple Silicon optimization, SafeTensors for secure serialization, PyTorch checkpoints for training workflows, and ONNX for cross-framework deployment.

Key technical aspects covered:

GGUF format structure and evolution from GGML
Quantization formats and K-quants family
MLX format for Apple Silicon
SafeTensors security advantages
Format conversion tools and workflows
Quantization algorithms (GPTQ, AWQ)
Performance and quality trade-offs

GGUF Format Structure

GGUF (GGML Universal File) is a binary format designed for efficient storage and loading of large language models, particularly for CPU and mixed CPU-GPU inference. Encapsulates model weights (tensors) and metadata in a single file enabling rapid loading.

File structure comprises four sections:

Header Section identifies file and provides parsing information:

Magic Number: 4-byte sequence 0x47 0x47 0x55 0x46 (ASCII “GGUF”)
Version: 4-byte unsigned integer indicating format version
Tensor Count: 8-byte unsigned integer specifying number of tensors
Metadata Key-Value Count: 8-byte unsigned integer for metadata pairs

Metadata Section contains key-value pairs describing model attributes including architecture details, tokenizer information, quantization parameters. Structure per pair:

Key: UTF-8 encoded string
Value Type: 4-byte unsigned integer indicating data type
Value: Actual value (integers, floats, booleans, strings, arrays)

Tensor Information Section provides metadata for each tensor:

Name: UTF-8 string identifying tensor (e.g., “blk.0.attn_q.weight”)
Number of Dimensions: 4-byte unsigned integer
Dimensions: Array of 8-byte unsigned integers for size of each dimension
Type: 4-byte unsigned integer indicating data type and quantization format
Offset: 8-byte unsigned integer specifying byte offset to tensor data

Tensor Data Section contains actual tensor data stored sequentially. Data aligned according to general.alignment metadata field (default 32 bytes). Each tensor begins at offset specified in Tensor Information Section.

Advantages over predecessor GGML:

mmap compatibility: Faster loading through memory mapping
Single-file format: All information in one file versus multiple files
Extensibility: New features added without breaking compatibility
Special token support: Essential for prompt engineering and fine-tuning
Broader architecture support: Beyond LLaMA to Falcon, Bloom, Phi, Mistral, Qwen

Quantization Formats and K-Quants

GGUF supports multiple quantization formats balancing model size, inference speed, and accuracy:

Legacy quantization formats:

Q4_0: 4-bit round-to-nearest quantization with 32 weights per block. Weight formula: w = q × block_scale. Simple but less accurate than modern methods.

Q8_0: 8-bit round-to-nearest quantization with 32 weights per block. Weight formula: w = q × block_scale. Closer to original precision but larger size.

K-quants family (modern, recommended):

K-quantization divides model weights into super-blocks containing multiple sub-blocks. Each sub-block has its own scale and minimum value. These scales and minimums are quantized to limited bits (8, 6, or 4) depending on method. Hierarchical approach maintains accuracy despite reduced precision.

Q2_K: 2-bit quantization with super-blocks. Extremely compressed but significant quality loss.

Q3_K_S/M/L: 3-bit quantization variants (Small, Medium, Large) with different super-block sizes and scale precision trade-offs.

Q4_K_S: 4-bit quantization with smaller super-blocks prioritizing size reduction.

Q4_K_M (recommended): 4-bit mixed precision using:

q4_k tensors: Majority of model with 4-bit weights
q6_k tensors: Critical layers (attention weights, feed-forward) with 6-bit precision
Weight formula: w = q × block_scale(6-bit) + block_min(6-bit) resulting in 4.5 bits per weight

Q5_K_S/M: 5-bit quantization variants offering better precision than 4-bit with moderate size increase.

Q6_K: 6-bit quantization with 8-bit scaling factors. Weight formula: w = q × block_scale(8-bit). Near-original quality with reasonable compression.

Q8_K: 8-bit quantization using K-quants structure. Minimal quality loss from full precision.

Importance matrix (imatrix) enhances K-quants effectiveness by identifying critical weights. Computed through forward pass on calibration dataset recording most active weights. During quantization, important weights receive higher precision or remain unquantized. Tool llama-imatrix computes imatrix for guiding quantization.

Implementation details: Super-block structures defined in code contain arrays for scales, quantized values, and dequantization factors. Sub-block organization enables efficient vectorized operations during inference.

Performance vs quality trade-offs:

Q2_K through Q4_K: Significant size reduction (4-8x), noticeable quality degradation
Q4_K_M: Balanced choice with 4.5 bits/weight, good quality retention
Q5_K through Q6_K: Minimal quality loss (2-4x compression)
Q8_K: Near-original quality (2x compression)

Quantization Algorithms: GPTQ and AWQ

Beyond format-level quantization, specific algorithms optimize the quantization process:

GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot weight quantization method using approximate second-order information. Process involves:

Quantizing weights sequentially within each channel or block
Adjusting remaining weights using inverse Hessian matrix calculated from activation values
Maintaining accuracy while reducing model size

Symmetric vs asymmetric quantization:

Symmetric: Quantization range symmetric around zero; same scaling for positive and negative values. Computationally efficient when data distribution centered around zero.
Asymmetric: Quantization range includes zero-point offset. Better for skewed distributions or significant offset from zero.

GPTQ allows efficient quantization of models with billions of parameters through this sequential approach with second-order compensation.

AWQ (Activation-aware Weight Quantization) focuses on preserving salient weights by:

Analyzing distribution of activations and weights per channel
Identifying small portion of critical weights (1-2%)
Applying larger scaling factor to important weights before quantization
Minimizing accuracy loss through selective precision

Weight clipping strategies align with quantization method:

Asymmetric quantization uses asymmetric clipping
Symmetric quantization uses symmetric clipping
Alignment improves performance in low-bit scenarios

Calibration methods determine optimal quantization parameters:

MinMax: Uses minimum and maximum values from calibration data
Percentile: Uses percentile values to reduce outlier influence
MSE: Minimizes mean squared error between original and quantized weights

Both GPTQ and AWQ require calibration datasets to determine parameters. Choice depends on model characteristics, target hardware, and accuracy requirements.

MLX Format for Apple Silicon

MLX is Apple’s open-source machine learning framework optimized for Apple Silicon (M1/M2/M3/M4 chips). MLX format is native format for models in this framework.

Unified Memory Architecture is MLX’s primary advantage:

CPU and GPU share same physical memory pool
No data copying between CPU and GPU reducing latency
Larger models fit in memory through efficient sharing
Seamless data movement without explicit transfers

Metal GPU Acceleration utilizes Apple’s Metal API for optimized GPU operations. Neural Engine provides additional acceleration for specific matrix operations. Framework efficiently schedules work across CPU, GPU, and Neural Engine.

Lazy Computation materializes arrays only when necessary, optimizing memory usage. Dynamic graph construction allows flexible model architectures without recompilation overhead.

Multi-language support provides APIs for:

Python (primary interface)
Swift (native iOS/macOS integration)
C++ (low-level operations)
C (system integration)

Performance characteristics: Benchmarks show MLX achieving ~40% higher throughput than PyTorch on Apple Silicon with batch size 16, and ~15% improvement at optimal batch sizes. Efficiency particularly beneficial for resource-intensive tasks on Mac hardware.

Format structure: MLX uses its own binary format optimized for Unified Memory. Models stored with architecture information, weights in format compatible with Metal shaders, and metadata for framework integration.

Conversion from other formats: Tools like mlx-lm convert Hugging Face models (PyTorch/SafeTensors) to MLX format. Process involves loading original model, adapting architecture to MLX operations, and saving in MLX-compatible format.

Use cases: MLX format optimal for:

On-device inference on Mac/iPad/iPhone
Development and testing on Apple Silicon Macs
Applications requiring Unified Memory benefits
Integration with native Apple frameworks

SafeTensors Security and Structure

SafeTensors is a serialization format designed for secure tensor storage without arbitrary code execution risks.

Security advantages over PyTorch pickle:

PyTorch with pickle traditionally uses Python’s pickle module for model serialization. Pickle can execute arbitrary code during deserialization, creating severe security vulnerabilities:

Attackers embed malicious code in pickled models
Loading untrusted models executes embedded code
Multiple CVE advisories document pickle-related exploits
Requires trusting model sources completely

SafeTensors design eliminates code execution by focusing purely on tensor data serialization:

No code execution during loading
Pure data format without execution capabilities
Metadata and tensor data only
Safe to load from untrusted sources

Format structure:

Header containing tensor metadata (shapes, data types, byte offsets)
Tensor data stored as raw bytes
JSON metadata section for additional information
No executable code or serialized Python objects

Performance considerations:

Optimized for tensor operations
Fast loading through memory-mapped files
Efficient for large model weights
Comparable or better performance than pickle for tensor-heavy models

Adoption and compatibility:

Hugging Face Hub standardized on SafeTensors
PyTorch models can be saved/loaded as SafeTensors
Growing ecosystem support
Conversion tools available for legacy formats

Trade-offs:

Cannot serialize arbitrary Python objects (by design)
Focused on tensor data only
May require workflow adjustments from pickle-based systems
Benefits outweigh limitations for model distribution

Format Conversion Tools

Converting between formats enables model deployment across different frameworks and hardware:

SafeTensors to GGUF conversion:

safetensorstogguf toolkit converts SafeTensors to GGUF supporting:

Mixture of Experts (MoE) architecture
Custom tokenizer formats
Quantization features for GGUF output
Command-line interface for batch conversion

llm-gguf-tools Python-based utility provides:

Conversion from SafeTensors to GGUF
Quantization during conversion
Integration with Hugging Face Hub
Bridge between HF ecosystem and llama.cpp

PyTorch to ONNX conversion:

ONNX Runtime Transformer Optimization Tool enables:

Offline optimization of transformer models
Conversion to ONNX format
Various optimization passes for performance
Support for different precision levels

Quark for PyTorch supports:

Export to ONNX with various quantization schemes
int4, int8, fp8, float16, bfloat16 support
Hardware-specific optimizations
AMD ROCm integration

ONNX to GGUF conversion:

ComfyUI-GGUF facilitates conversion from multiple formats to GGUF including ONNX, SafeTensors, and PyTorch. Provides workflow for preparing models for GGUF-based inference engines.

Conversion workflow considerations:

Model compatibility: Verify target format supports source architecture and features
Quantization during conversion: Apply quantization to reduce size and improve speed
Validation: Compare outputs between formats to ensure correctness
Metadata preservation: Ensure tokenizer configs, special tokens, and architecture details transfer correctly
Tool documentation: Follow specific tool requirements for successful conversion

Common conversion paths:

Hugging Face (SafeTensors/PyTorch) → GGUF: For CPU/mixed inference with Ollama/LM Studio
PyTorch → ONNX: For cross-framework deployment and optimization
Hugging Face → MLX: For Apple Silicon optimization
ONNX → GGUF: For specialized model formats to CPU inference

Quantization during conversion: Many tools support applying quantization simultaneously with format conversion, enabling one-step process from full-precision source to quantized target format.

Performance and Quality Trade-offs

Selecting appropriate format and quantization involves balancing multiple factors:

Model size considerations:

Full precision (FP32/FP16): 2-4x larger but original quality
Q8 quantization: 2x compression, minimal quality loss
Q4_K_M quantization: ~4x compression, good quality retention
Q2_K quantization: ~8x compression, significant quality degradation

Inference speed:

Lower precision generally faster due to reduced memory bandwidth
Quantized operations can use specialized CPU instructions (AVX2, AVX-512)
GPU quantization benefits depend on hardware support
Memory-bound models benefit more from quantization than compute-bound

Hardware-specific considerations:

Apple Silicon: MLX format optimal for Unified Memory architecture. GGUF Q4_K_M reasonable alternative when MLX unavailable.

NVIDIA GPUs: GGUF with CUDA acceleration or native PyTorch/SafeTensors. Higher precision (Q6_K, Q8_0) leverages GPU compute better.

AMD GPUs: GGUF with ROCm or Vulkan support. Format support varies by tool.

CPUs: GGUF specifically optimized for CPU inference. Lower precision (Q4_K_M) balances quality and performance on limited CPU resources.

Quality metrics:

Perplexity: Lower is better, measures prediction accuracy
Task-specific benchmarks: MMLU, HumanEval, TruthfulQA
Subjective quality: Human evaluation for conversational use cases
Quantization typically increases perplexity by 1-10% depending on level

Use case alignment:

Production deployment: Q4_K_M or Q5_K_M for balanced performance
Research and development: Full precision or Q8_0 for accuracy
Edge devices: Q2_K or Q3_K for extreme size constraints
Interactive applications: Q4_K_M or higher for quality user experience

The optimal choice depends on specific application requirements, available hardware, and tolerance for quality degradation. Experimentation with different formats and quantization levels recommended to find best balance.

Edmondo's Vault

Explorer

Model Formats

Model Formats

Overview

GGUF Format Structure

Quantization Formats and K-Quants

Quantization Algorithms: GPTQ and AWQ

MLX Format for Apple Silicon

SafeTensors Security and Structure

Format Conversion Tools

Performance and Quality Trade-offs

Graph View

Table of Contents

Backlinks