Model Formats
Overview
Model formats define how neural network weights, architecture metadata, and configuration are stored and loaded for inference or training. Different formats optimize for specific concerns including security, performance, hardware compatibility, and model size through quantization. Understanding these formats is essential for selecting appropriate tools (Ollama, LM Studio, llama.cpp) and deploying models efficiently on target hardware.
The landscape includes several major formats: GGUF for CPU and mixed CPU-GPU inference with advanced quantization, MLX for Apple Silicon optimization, SafeTensors for secure serialization, PyTorch checkpoints for training workflows, and ONNX for cross-framework deployment.
Key technical aspects covered:
- GGUF format structure and evolution from GGML
- Quantization formats and K-quants family
- MLX format for Apple Silicon
- SafeTensors security advantages
- Format conversion tools and workflows
- Quantization algorithms (GPTQ, AWQ)
- Performance and quality trade-offs
GGUF Format Structure
GGUF (GGML Universal File) is a binary format designed for efficient storage and loading of large language models, particularly for CPU and mixed CPU-GPU inference. Encapsulates model weights (tensors) and metadata in a single file enabling rapid loading.
File structure comprises four sections:
Header Section identifies file and provides parsing information:
- Magic Number: 4-byte sequence
0x47 0x47 0x55 0x46(ASCII “GGUF”) - Version: 4-byte unsigned integer indicating format version
- Tensor Count: 8-byte unsigned integer specifying number of tensors
- Metadata Key-Value Count: 8-byte unsigned integer for metadata pairs
Metadata Section contains key-value pairs describing model attributes including architecture details, tokenizer information, quantization parameters. Structure per pair:
- Key: UTF-8 encoded string
- Value Type: 4-byte unsigned integer indicating data type
- Value: Actual value (integers, floats, booleans, strings, arrays)
Tensor Information Section provides metadata for each tensor:
- Name: UTF-8 string identifying tensor (e.g., “blk.0.attn_q.weight”)
- Number of Dimensions: 4-byte unsigned integer
- Dimensions: Array of 8-byte unsigned integers for size of each dimension
- Type: 4-byte unsigned integer indicating data type and quantization format
- Offset: 8-byte unsigned integer specifying byte offset to tensor data
Tensor Data Section contains actual tensor data stored sequentially. Data aligned according to general.alignment metadata field (default 32 bytes). Each tensor begins at offset specified in Tensor Information Section.
Advantages over predecessor GGML:
- mmap compatibility: Faster loading through memory mapping
- Single-file format: All information in one file versus multiple files
- Extensibility: New features added without breaking compatibility
- Special token support: Essential for prompt engineering and fine-tuning
- Broader architecture support: Beyond LLaMA to Falcon, Bloom, Phi, Mistral, Qwen
Quantization Formats and K-Quants
GGUF supports multiple quantization formats balancing model size, inference speed, and accuracy:
Legacy quantization formats:
Q4_0: 4-bit round-to-nearest quantization with 32 weights per block. Weight formula: w = q × block_scale. Simple but less accurate than modern methods.
Q8_0: 8-bit round-to-nearest quantization with 32 weights per block. Weight formula: w = q × block_scale. Closer to original precision but larger size.
K-quants family (modern, recommended):
K-quantization divides model weights into super-blocks containing multiple sub-blocks. Each sub-block has its own scale and minimum value. These scales and minimums are quantized to limited bits (8, 6, or 4) depending on method. Hierarchical approach maintains accuracy despite reduced precision.
Q2_K: 2-bit quantization with super-blocks. Extremely compressed but significant quality loss.
Q3_K_S/M/L: 3-bit quantization variants (Small, Medium, Large) with different super-block sizes and scale precision trade-offs.
Q4_K_S: 4-bit quantization with smaller super-blocks prioritizing size reduction.
Q4_K_M (recommended): 4-bit mixed precision using:
- q4_k tensors: Majority of model with 4-bit weights
- q6_k tensors: Critical layers (attention weights, feed-forward) with 6-bit precision
- Weight formula:
w = q × block_scale(6-bit) + block_min(6-bit)resulting in 4.5 bits per weight
Q5_K_S/M: 5-bit quantization variants offering better precision than 4-bit with moderate size increase.
Q6_K: 6-bit quantization with 8-bit scaling factors. Weight formula: w = q × block_scale(8-bit). Near-original quality with reasonable compression.
Q8_K: 8-bit quantization using K-quants structure. Minimal quality loss from full precision.
Importance matrix (imatrix) enhances K-quants effectiveness by identifying critical weights. Computed through forward pass on calibration dataset recording most active weights. During quantization, important weights receive higher precision or remain unquantized. Tool llama-imatrix computes imatrix for guiding quantization.
Implementation details: Super-block structures defined in code contain arrays for scales, quantized values, and dequantization factors. Sub-block organization enables efficient vectorized operations during inference.
Performance vs quality trade-offs:
- Q2_K through Q4_K: Significant size reduction (4-8x), noticeable quality degradation
- Q4_K_M: Balanced choice with 4.5 bits/weight, good quality retention
- Q5_K through Q6_K: Minimal quality loss (2-4x compression)
- Q8_K: Near-original quality (2x compression)
Quantization Algorithms: GPTQ and AWQ
Beyond format-level quantization, specific algorithms optimize the quantization process:
GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot weight quantization method using approximate second-order information. Process involves:
- Quantizing weights sequentially within each channel or block
- Adjusting remaining weights using inverse Hessian matrix calculated from activation values
- Maintaining accuracy while reducing model size
Symmetric vs asymmetric quantization:
- Symmetric: Quantization range symmetric around zero; same scaling for positive and negative values. Computationally efficient when data distribution centered around zero.
- Asymmetric: Quantization range includes zero-point offset. Better for skewed distributions or significant offset from zero.
GPTQ allows efficient quantization of models with billions of parameters through this sequential approach with second-order compensation.
AWQ (Activation-aware Weight Quantization) focuses on preserving salient weights by:
- Analyzing distribution of activations and weights per channel
- Identifying small portion of critical weights (1-2%)
- Applying larger scaling factor to important weights before quantization
- Minimizing accuracy loss through selective precision
Weight clipping strategies align with quantization method:
- Asymmetric quantization uses asymmetric clipping
- Symmetric quantization uses symmetric clipping
- Alignment improves performance in low-bit scenarios
Calibration methods determine optimal quantization parameters:
- MinMax: Uses minimum and maximum values from calibration data
- Percentile: Uses percentile values to reduce outlier influence
- MSE: Minimizes mean squared error between original and quantized weights
Both GPTQ and AWQ require calibration datasets to determine parameters. Choice depends on model characteristics, target hardware, and accuracy requirements.
MLX Format for Apple Silicon
MLX is Apple’s open-source machine learning framework optimized for Apple Silicon (M1/M2/M3/M4 chips). MLX format is native format for models in this framework.
Unified Memory Architecture is MLX’s primary advantage:
- CPU and GPU share same physical memory pool
- No data copying between CPU and GPU reducing latency
- Larger models fit in memory through efficient sharing
- Seamless data movement without explicit transfers
Metal GPU Acceleration utilizes Apple’s Metal API for optimized GPU operations. Neural Engine provides additional acceleration for specific matrix operations. Framework efficiently schedules work across CPU, GPU, and Neural Engine.
Lazy Computation materializes arrays only when necessary, optimizing memory usage. Dynamic graph construction allows flexible model architectures without recompilation overhead.
Multi-language support provides APIs for:
- Python (primary interface)
- Swift (native iOS/macOS integration)
- C++ (low-level operations)
- C (system integration)
Performance characteristics: Benchmarks show MLX achieving ~40% higher throughput than PyTorch on Apple Silicon with batch size 16, and ~15% improvement at optimal batch sizes. Efficiency particularly beneficial for resource-intensive tasks on Mac hardware.
Format structure: MLX uses its own binary format optimized for Unified Memory. Models stored with architecture information, weights in format compatible with Metal shaders, and metadata for framework integration.
Conversion from other formats: Tools like mlx-lm convert Hugging Face models (PyTorch/SafeTensors) to MLX format. Process involves loading original model, adapting architecture to MLX operations, and saving in MLX-compatible format.
Use cases: MLX format optimal for:
- On-device inference on Mac/iPad/iPhone
- Development and testing on Apple Silicon Macs
- Applications requiring Unified Memory benefits
- Integration with native Apple frameworks
SafeTensors Security and Structure
SafeTensors is a serialization format designed for secure tensor storage without arbitrary code execution risks.
Security advantages over PyTorch pickle:
PyTorch with pickle traditionally uses Python’s pickle module for model serialization. Pickle can execute arbitrary code during deserialization, creating severe security vulnerabilities:
- Attackers embed malicious code in pickled models
- Loading untrusted models executes embedded code
- Multiple CVE advisories document pickle-related exploits
- Requires trusting model sources completely
SafeTensors design eliminates code execution by focusing purely on tensor data serialization:
- No code execution during loading
- Pure data format without execution capabilities
- Metadata and tensor data only
- Safe to load from untrusted sources
Format structure:
- Header containing tensor metadata (shapes, data types, byte offsets)
- Tensor data stored as raw bytes
- JSON metadata section for additional information
- No executable code or serialized Python objects
Performance considerations:
- Optimized for tensor operations
- Fast loading through memory-mapped files
- Efficient for large model weights
- Comparable or better performance than pickle for tensor-heavy models
Adoption and compatibility:
- Hugging Face Hub standardized on SafeTensors
- PyTorch models can be saved/loaded as SafeTensors
- Growing ecosystem support
- Conversion tools available for legacy formats
Trade-offs:
- Cannot serialize arbitrary Python objects (by design)
- Focused on tensor data only
- May require workflow adjustments from pickle-based systems
- Benefits outweigh limitations for model distribution
Format Conversion Tools
Converting between formats enables model deployment across different frameworks and hardware:
SafeTensors to GGUF conversion:
safetensorstogguf toolkit converts SafeTensors to GGUF supporting:
- Mixture of Experts (MoE) architecture
- Custom tokenizer formats
- Quantization features for GGUF output
- Command-line interface for batch conversion
llm-gguf-tools Python-based utility provides:
- Conversion from SafeTensors to GGUF
- Quantization during conversion
- Integration with Hugging Face Hub
- Bridge between HF ecosystem and llama.cpp
PyTorch to ONNX conversion:
ONNX Runtime Transformer Optimization Tool enables:
- Offline optimization of transformer models
- Conversion to ONNX format
- Various optimization passes for performance
- Support for different precision levels
Quark for PyTorch supports:
- Export to ONNX with various quantization schemes
- int4, int8, fp8, float16, bfloat16 support
- Hardware-specific optimizations
- AMD ROCm integration
ONNX to GGUF conversion:
ComfyUI-GGUF facilitates conversion from multiple formats to GGUF including ONNX, SafeTensors, and PyTorch. Provides workflow for preparing models for GGUF-based inference engines.
Conversion workflow considerations:
- Model compatibility: Verify target format supports source architecture and features
- Quantization during conversion: Apply quantization to reduce size and improve speed
- Validation: Compare outputs between formats to ensure correctness
- Metadata preservation: Ensure tokenizer configs, special tokens, and architecture details transfer correctly
- Tool documentation: Follow specific tool requirements for successful conversion
Common conversion paths:
- Hugging Face (SafeTensors/PyTorch) → GGUF: For CPU/mixed inference with Ollama/LM Studio
- PyTorch → ONNX: For cross-framework deployment and optimization
- Hugging Face → MLX: For Apple Silicon optimization
- ONNX → GGUF: For specialized model formats to CPU inference
Quantization during conversion: Many tools support applying quantization simultaneously with format conversion, enabling one-step process from full-precision source to quantized target format.
Performance and Quality Trade-offs
Selecting appropriate format and quantization involves balancing multiple factors:
Model size considerations:
- Full precision (FP32/FP16): 2-4x larger but original quality
- Q8 quantization: 2x compression, minimal quality loss
- Q4_K_M quantization: ~4x compression, good quality retention
- Q2_K quantization: ~8x compression, significant quality degradation
Inference speed:
- Lower precision generally faster due to reduced memory bandwidth
- Quantized operations can use specialized CPU instructions (AVX2, AVX-512)
- GPU quantization benefits depend on hardware support
- Memory-bound models benefit more from quantization than compute-bound
Hardware-specific considerations:
Apple Silicon: MLX format optimal for Unified Memory architecture. GGUF Q4_K_M reasonable alternative when MLX unavailable.
NVIDIA GPUs: GGUF with CUDA acceleration or native PyTorch/SafeTensors. Higher precision (Q6_K, Q8_0) leverages GPU compute better.
AMD GPUs: GGUF with ROCm or Vulkan support. Format support varies by tool.
CPUs: GGUF specifically optimized for CPU inference. Lower precision (Q4_K_M) balances quality and performance on limited CPU resources.
Quality metrics:
- Perplexity: Lower is better, measures prediction accuracy
- Task-specific benchmarks: MMLU, HumanEval, TruthfulQA
- Subjective quality: Human evaluation for conversational use cases
- Quantization typically increases perplexity by 1-10% depending on level
Use case alignment:
- Production deployment: Q4_K_M or Q5_K_M for balanced performance
- Research and development: Full precision or Q8_0 for accuracy
- Edge devices: Q2_K or Q3_K for extreme size constraints
- Interactive applications: Q4_K_M or higher for quality user experience
The optimal choice depends on specific application requirements, available hardware, and tolerance for quality degradation. Experimentation with different formats and quantization levels recommended to find best balance.