Historical Context and Design Goals

The PDF file format was designed as a final-form interchange format whose primary invariant is visual fidelity. Its architecture derives from PostScript, a page description language intended to drive printers deterministically. PDF removes programmability while preserving the assumption that a document is a fixed sequence of pages defined by imperative rendering instructions. The format optimizes for device independence, reproducibility, and long-term stability rather than semantic expressiveness or reflowability.

Standardization under ISO 32000 codified this rendering-first model. Later extensions introduced optional facilities for metadata, tagging, and accessibility without altering the authoritative role of the rendering model.

Page Description Languages vs Semantic Document Formats

A page description language specifies how marks are painted on a page: glyph placement, vector paths, raster images, and transformations. Meaning is implicit and incidental. A semantic document format encodes logical entities—paragraphs, headings, tables, equations—and derives layout as a view over structure.

PDF is fundamentally a page description container. Although it can embed semantic annotations, the authoritative representation of a page is the ordered sequence of graphics operators. Logical reading order, text continuity, and document hierarchy are not intrinsic properties of the format and must be inferred or externally supplied.

Internal Architecture of PDF

Core Object Model

A PDF file consists of a graph of indirect objects referenced through a cross-reference table. Objects include dictionaries, arrays, primitive values, and streams. Streams carry page content, images, and font programs. The format supports incremental updates by appending new objects and cross-reference sections, enabling edits without rewriting the file while obscuring provenance.

The object model imposes minimal structure. Pages are dictionaries organized in a page tree; higher-level semantics exist only if explicitly encoded by the producer.

Rendering Model

Rendering is defined by a graphics state machine. Operators mutate state (coordinate transforms, clipping paths, font selection) and emit marks: paths, glyphs, images. Text rendering consists of selecting a font, positioning a text matrix, and painting glyph indices. The renderer has no notion of characters, words, or language; it executes drawing instructions.

This model guarantees deterministic visual output while making semantic recovery non-canonical.

Text and Semantics in PDF

Glyph-Centric Text Representation

Text in PDF is glyph-centric, not character-centric. Glyph indices reference font programs; mapping glyphs to Unicode code points is mediated by optional ToUnicode maps. If these mappings are missing or incorrect, text extraction becomes heuristic. Even when present, logical order is not guaranteed, as glyphs may be painted in an order optimized for layout rather than reading.

Search and copy/paste behavior therefore depend on producer choices rather than on format guarantees.

Logical Structure and Tagging

PDF supports an optional logical structure tree that associates rendered content with semantic roles such as paragraphs, headings, and lists. This structure enables reliable accessibility and reading order but has no effect on rendering. Because it is optional and costly to maintain, it is frequently absent, especially in scanned or legacy documents.

The absence of tagging does not affect visual correctness but renders semantic extraction lossy.

Scanned and Scan-Derived PDFs

Image-Only PDFs

A scanned PDF typically embeds one raster image per page. From the renderer’s perspective, such a document is indistinguishable from a sequence of photographs. No text objects exist; search and extraction are impossible because there is no symbolic layer.

DjVu and Scan-Optimized Containers

DjVu is a scan-optimized document format that separates page content into layered components (background, foreground strokes, masks) to achieve high compression ratios. DjVu files may include OCR text, but this is not intrinsic to the format. When DjVu content is converted to PDF without OCR, only the rasterized appearance survives; metadata indicating DjVu provenance reflects origin, not semantic capability.

OCR and Hybrid PDFs

OCR as Text Layer Injection

Optical character recognition augments image-only PDFs by injecting an invisible text layer aligned to the underlying images. OCR pipelines perform page segmentation, glyph recognition, sequence modeling, and embedding of recognized text as positioned PDF text objects. The original images remain authoritative for rendering.

An OCRed PDF is therefore a hybrid artifact: visual fidelity derives from raster data, while searchability derives from an inferred symbolic overlay.

Mathematical and Technical Content

Mathematical notation encodes meaning spatially rather than linearly. Superscripts, subscripts, fractions, and alignment form a two-dimensional syntax that cannot be reconstructed from glyph recognition alone. General OCR systems therefore treat displayed equations as images while indexing surrounding prose and labels.

Equation numbers and references are typically searchable even when the equations themselves are not extractable.

Limits of Semantic Recovery

Limits of Math OCR

Recovering mathematics requires both symbol recognition and layout grammar inference. While constrained systems exist for narrow domains, general-purpose math OCR remains unreliable at scale. Production systems therefore prioritize partial semantic recovery over full symbolic reconstruction.

Document AI and Language Models

Transformer-based vision models are strong at symbol recognition, but OCR is fundamentally a geometry-preserving extraction problem, not a classification problem. PDF text layers require deterministic bounding boxes, baselines, reading order, and Unicode mappings anchored in page space. Vision transformers optimize for semantic plausibility rather than strict spatial invariants, producing approximate geometry and probabilistic ordering. These properties are incompatible with the guarantees required by PDF rendering, which is why OCR remains a distinct, geometry-first stage and language models are applied only downstream.