Model Formats

Overview

Model formats define how neural network weights, architecture metadata, and configuration are stored and loaded for inference or training. Different formats optimize for specific concerns including security, performance, hardware compatibility, and model size through quantization. Understanding these formats is essential for selecting appropriate tools (Ollama, LM Studio, llama.cpp) and deploying models efficiently on target hardware.

The landscape includes several major formats: GGUF for CPU and mixed CPU-GPU inference with advanced quantization, MLX for Apple Silicon optimization, SafeTensors for secure serialization, PyTorch checkpoints for training workflows, and ONNX for cross-framework deployment.

Key technical aspects covered:

  • GGUF format structure and evolution from GGML
  • Quantization formats and K-quants family
  • MLX format for Apple Silicon
  • SafeTensors security advantages
  • Format conversion tools and workflows
  • Quantization algorithms (GPTQ, AWQ)
  • Performance and quality trade-offs

GGUF Format Structure

GGUF (GGML Universal File) is a binary format designed for efficient storage and loading of large language models, particularly for CPU and mixed CPU-GPU inference. Encapsulates model weights (tensors) and metadata in a single file enabling rapid loading.

File structure comprises four sections:

Header Section identifies file and provides parsing information:

  • Magic Number: 4-byte sequence 0x47 0x47 0x55 0x46 (ASCII “GGUF”)
  • Version: 4-byte unsigned integer indicating format version
  • Tensor Count: 8-byte unsigned integer specifying number of tensors
  • Metadata Key-Value Count: 8-byte unsigned integer for metadata pairs

Metadata Section contains key-value pairs describing model attributes including architecture details, tokenizer information, quantization parameters. Structure per pair:

  • Key: UTF-8 encoded string
  • Value Type: 4-byte unsigned integer indicating data type
  • Value: Actual value (integers, floats, booleans, strings, arrays)

Tensor Information Section provides metadata for each tensor:

  • Name: UTF-8 string identifying tensor (e.g., “blk.0.attn_q.weight”)
  • Number of Dimensions: 4-byte unsigned integer
  • Dimensions: Array of 8-byte unsigned integers for size of each dimension
  • Type: 4-byte unsigned integer indicating data type and quantization format
  • Offset: 8-byte unsigned integer specifying byte offset to tensor data

Tensor Data Section contains actual tensor data stored sequentially. Data aligned according to general.alignment metadata field (default 32 bytes). Each tensor begins at offset specified in Tensor Information Section.

Advantages over predecessor GGML:

  • mmap compatibility: Faster loading through memory mapping
  • Single-file format: All information in one file versus multiple files
  • Extensibility: New features added without breaking compatibility
  • Special token support: Essential for prompt engineering and fine-tuning
  • Broader architecture support: Beyond LLaMA to Falcon, Bloom, Phi, Mistral, Qwen

Quantization Formats and K-Quants

GGUF supports multiple quantization formats balancing model size, inference speed, and accuracy:

Legacy quantization formats:

Q4_0: 4-bit round-to-nearest quantization with 32 weights per block. Weight formula: w = q × block_scale. Simple but less accurate than modern methods.

Q8_0: 8-bit round-to-nearest quantization with 32 weights per block. Weight formula: w = q × block_scale. Closer to original precision but larger size.

K-quants family (modern, recommended):

K-quantization divides model weights into super-blocks containing multiple sub-blocks. Each sub-block has its own scale and minimum value. These scales and minimums are quantized to limited bits (8, 6, or 4) depending on method. Hierarchical approach maintains accuracy despite reduced precision.

Q2_K: 2-bit quantization with super-blocks. Extremely compressed but significant quality loss.

Q3_K_S/M/L: 3-bit quantization variants (Small, Medium, Large) with different super-block sizes and scale precision trade-offs.

Q4_K_S: 4-bit quantization with smaller super-blocks prioritizing size reduction.

Q4_K_M (recommended): 4-bit mixed precision using:

  • q4_k tensors: Majority of model with 4-bit weights
  • q6_k tensors: Critical layers (attention weights, feed-forward) with 6-bit precision
  • Weight formula: w = q × block_scale(6-bit) + block_min(6-bit) resulting in 4.5 bits per weight

Q5_K_S/M: 5-bit quantization variants offering better precision than 4-bit with moderate size increase.

Q6_K: 6-bit quantization with 8-bit scaling factors. Weight formula: w = q × block_scale(8-bit). Near-original quality with reasonable compression.

Q8_K: 8-bit quantization using K-quants structure. Minimal quality loss from full precision.

Importance matrix (imatrix) enhances K-quants effectiveness by identifying critical weights. Computed through forward pass on calibration dataset recording most active weights. During quantization, important weights receive higher precision or remain unquantized. Tool llama-imatrix computes imatrix for guiding quantization.

Implementation details: Super-block structures defined in code contain arrays for scales, quantized values, and dequantization factors. Sub-block organization enables efficient vectorized operations during inference.

Performance vs quality trade-offs:

  • Q2_K through Q4_K: Significant size reduction (4-8x), noticeable quality degradation
  • Q4_K_M: Balanced choice with 4.5 bits/weight, good quality retention
  • Q5_K through Q6_K: Minimal quality loss (2-4x compression)
  • Q8_K: Near-original quality (2x compression)

Quantization Algorithms: GPTQ and AWQ

Beyond format-level quantization, specific algorithms optimize the quantization process:

GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot weight quantization method using approximate second-order information. Process involves:

  1. Quantizing weights sequentially within each channel or block
  2. Adjusting remaining weights using inverse Hessian matrix calculated from activation values
  3. Maintaining accuracy while reducing model size

Symmetric vs asymmetric quantization:

  • Symmetric: Quantization range symmetric around zero; same scaling for positive and negative values. Computationally efficient when data distribution centered around zero.
  • Asymmetric: Quantization range includes zero-point offset. Better for skewed distributions or significant offset from zero.

GPTQ allows efficient quantization of models with billions of parameters through this sequential approach with second-order compensation.

AWQ (Activation-aware Weight Quantization) focuses on preserving salient weights by:

  1. Analyzing distribution of activations and weights per channel
  2. Identifying small portion of critical weights (1-2%)
  3. Applying larger scaling factor to important weights before quantization
  4. Minimizing accuracy loss through selective precision

Weight clipping strategies align with quantization method:

  • Asymmetric quantization uses asymmetric clipping
  • Symmetric quantization uses symmetric clipping
  • Alignment improves performance in low-bit scenarios

Calibration methods determine optimal quantization parameters:

  • MinMax: Uses minimum and maximum values from calibration data
  • Percentile: Uses percentile values to reduce outlier influence
  • MSE: Minimizes mean squared error between original and quantized weights

Both GPTQ and AWQ require calibration datasets to determine parameters. Choice depends on model characteristics, target hardware, and accuracy requirements.

MLX Format for Apple Silicon

MLX is Apple’s open-source machine learning framework optimized for Apple Silicon (M1/M2/M3/M4 chips). MLX format is native format for models in this framework.

Unified Memory Architecture is MLX’s primary advantage:

  • CPU and GPU share same physical memory pool
  • No data copying between CPU and GPU reducing latency
  • Larger models fit in memory through efficient sharing
  • Seamless data movement without explicit transfers

Metal GPU Acceleration utilizes Apple’s Metal API for optimized GPU operations. Neural Engine provides additional acceleration for specific matrix operations. Framework efficiently schedules work across CPU, GPU, and Neural Engine.

Lazy Computation materializes arrays only when necessary, optimizing memory usage. Dynamic graph construction allows flexible model architectures without recompilation overhead.

Multi-language support provides APIs for:

  • Python (primary interface)
  • Swift (native iOS/macOS integration)
  • C++ (low-level operations)
  • C (system integration)

Performance characteristics: Benchmarks show MLX achieving ~40% higher throughput than PyTorch on Apple Silicon with batch size 16, and ~15% improvement at optimal batch sizes. Efficiency particularly beneficial for resource-intensive tasks on Mac hardware.

Format structure: MLX uses its own binary format optimized for Unified Memory. Models stored with architecture information, weights in format compatible with Metal shaders, and metadata for framework integration.

Conversion from other formats: Tools like mlx-lm convert Hugging Face models (PyTorch/SafeTensors) to MLX format. Process involves loading original model, adapting architecture to MLX operations, and saving in MLX-compatible format.

Use cases: MLX format optimal for:

  • On-device inference on Mac/iPad/iPhone
  • Development and testing on Apple Silicon Macs
  • Applications requiring Unified Memory benefits
  • Integration with native Apple frameworks

SafeTensors Security and Structure

SafeTensors is a serialization format designed for secure tensor storage without arbitrary code execution risks.

Security advantages over PyTorch pickle:

PyTorch with pickle traditionally uses Python’s pickle module for model serialization. Pickle can execute arbitrary code during deserialization, creating severe security vulnerabilities:

  • Attackers embed malicious code in pickled models
  • Loading untrusted models executes embedded code
  • Multiple CVE advisories document pickle-related exploits
  • Requires trusting model sources completely

SafeTensors design eliminates code execution by focusing purely on tensor data serialization:

  • No code execution during loading
  • Pure data format without execution capabilities
  • Metadata and tensor data only
  • Safe to load from untrusted sources

Format structure:

  • Header containing tensor metadata (shapes, data types, byte offsets)
  • Tensor data stored as raw bytes
  • JSON metadata section for additional information
  • No executable code or serialized Python objects

Performance considerations:

  • Optimized for tensor operations
  • Fast loading through memory-mapped files
  • Efficient for large model weights
  • Comparable or better performance than pickle for tensor-heavy models

Adoption and compatibility:

  • Hugging Face Hub standardized on SafeTensors
  • PyTorch models can be saved/loaded as SafeTensors
  • Growing ecosystem support
  • Conversion tools available for legacy formats

Trade-offs:

  • Cannot serialize arbitrary Python objects (by design)
  • Focused on tensor data only
  • May require workflow adjustments from pickle-based systems
  • Benefits outweigh limitations for model distribution

Format Conversion Tools

Converting between formats enables model deployment across different frameworks and hardware:

SafeTensors to GGUF conversion:

safetensorstogguf toolkit converts SafeTensors to GGUF supporting:

  • Mixture of Experts (MoE) architecture
  • Custom tokenizer formats
  • Quantization features for GGUF output
  • Command-line interface for batch conversion

llm-gguf-tools Python-based utility provides:

  • Conversion from SafeTensors to GGUF
  • Quantization during conversion
  • Integration with Hugging Face Hub
  • Bridge between HF ecosystem and llama.cpp

PyTorch to ONNX conversion:

ONNX Runtime Transformer Optimization Tool enables:

  • Offline optimization of transformer models
  • Conversion to ONNX format
  • Various optimization passes for performance
  • Support for different precision levels

Quark for PyTorch supports:

  • Export to ONNX with various quantization schemes
  • int4, int8, fp8, float16, bfloat16 support
  • Hardware-specific optimizations
  • AMD ROCm integration

ONNX to GGUF conversion:

ComfyUI-GGUF facilitates conversion from multiple formats to GGUF including ONNX, SafeTensors, and PyTorch. Provides workflow for preparing models for GGUF-based inference engines.

Conversion workflow considerations:

  1. Model compatibility: Verify target format supports source architecture and features
  2. Quantization during conversion: Apply quantization to reduce size and improve speed
  3. Validation: Compare outputs between formats to ensure correctness
  4. Metadata preservation: Ensure tokenizer configs, special tokens, and architecture details transfer correctly
  5. Tool documentation: Follow specific tool requirements for successful conversion

Common conversion paths:

  • Hugging Face (SafeTensors/PyTorch) → GGUF: For CPU/mixed inference with Ollama/LM Studio
  • PyTorch → ONNX: For cross-framework deployment and optimization
  • Hugging Face → MLX: For Apple Silicon optimization
  • ONNX → GGUF: For specialized model formats to CPU inference

Quantization during conversion: Many tools support applying quantization simultaneously with format conversion, enabling one-step process from full-precision source to quantized target format.

Performance and Quality Trade-offs

Selecting appropriate format and quantization involves balancing multiple factors:

Model size considerations:

  • Full precision (FP32/FP16): 2-4x larger but original quality
  • Q8 quantization: 2x compression, minimal quality loss
  • Q4_K_M quantization: ~4x compression, good quality retention
  • Q2_K quantization: ~8x compression, significant quality degradation

Inference speed:

  • Lower precision generally faster due to reduced memory bandwidth
  • Quantized operations can use specialized CPU instructions (AVX2, AVX-512)
  • GPU quantization benefits depend on hardware support
  • Memory-bound models benefit more from quantization than compute-bound

Hardware-specific considerations:

Apple Silicon: MLX format optimal for Unified Memory architecture. GGUF Q4_K_M reasonable alternative when MLX unavailable.

NVIDIA GPUs: GGUF with CUDA acceleration or native PyTorch/SafeTensors. Higher precision (Q6_K, Q8_0) leverages GPU compute better.

AMD GPUs: GGUF with ROCm or Vulkan support. Format support varies by tool.

CPUs: GGUF specifically optimized for CPU inference. Lower precision (Q4_K_M) balances quality and performance on limited CPU resources.

Quality metrics:

  • Perplexity: Lower is better, measures prediction accuracy
  • Task-specific benchmarks: MMLU, HumanEval, TruthfulQA
  • Subjective quality: Human evaluation for conversational use cases
  • Quantization typically increases perplexity by 1-10% depending on level

Use case alignment:

  • Production deployment: Q4_K_M or Q5_K_M for balanced performance
  • Research and development: Full precision or Q8_0 for accuracy
  • Edge devices: Q2_K or Q3_K for extreme size constraints
  • Interactive applications: Q4_K_M or higher for quality user experience

The optimal choice depends on specific application requirements, available hardware, and tolerance for quality degradation. Experimentation with different formats and quantization levels recommended to find best balance.