A disk block (or sector) is the smallest unit of storage that a disk can read or write. Disks don’t access individual bytes directly but operate in fixed-size blocks, typically 512 bytes or 4 KB in size.

  1. Reading and Writing:
    • Data on disk is organized into blocks, and every read or write involves transferring a full block between disk and memory.
    • If you modify a single byte, the entire block containing that byte must be read into memory, updated, and written back.
  2. Alignment:
    • Disk-based data structures, like trees, are organized so that nodes fit neatly into blocks, minimizing wasted space and unnecessary I/O.

Writing a Tree to Disk

When working with a tree (e.g., a binary tree or B-Tree), you’re dealing with nodes, keys, and pointers. These must be serialized into a format suitable for storage on disk.

Serialization of a Tree Node

To store a tree node on disk:

  1. Keys and Values:
    • Serialize the keys and values in the node into a contiguous binary format.
  2. Pointers:
    • Replace in-memory pointers (e.g., parent.left or parent.right) with disk offsets (byte positions within the file).

Example

Consider a binary tree node with:

  • Key: 42.
  • Left Pointer: Points to a child node.
  • Right Pointer: Points to another child node.

In memory:

Node = [42, left=0x7ffd1234, right=0x7ffd5678]

On disk:

[Key: 42][Left Offset: 512][Right Offset: 1024]

Here, 512 and 1024 are byte offsets within the file where the child nodes are stored.

Updating a Tree

When you add a new node:

  1. Write the new node to the next free block in the file.
  2. Update the parent node’s pointer to reference the new block (e.g., set left=512).
  3. Write the updated parent node back to disk.

Disk Writing and fseek

In low-level file operations (e.g., using fopen, fwrite, and fseek), you control where in a file data is written.

How Disk Writing Works

  1. Open the File:
    • Use fopen to open the file where the tree is stored.
  2. Seek to the Position:
    • Use fseek to move the file cursor to the desired byte offset.
    • For example, to write to the block at offset 512:
      fseek(file, 512, SEEK_SET);
  3. Write the Data:
    • Serialize the node into a binary format and write it:
      fwrite(buffer, sizeof(buffer), 1, file);
  4. Close the File:
    • Use fclose to flush the data and release resources.

Example of Writing a Node

If you’re writing a node with:

  • Key: 42.
  • Left Offset: 512.
  • Right Offset: 1024.

You might serialize it into a buffer like this:

unsigned char buffer[12]; // 4 bytes for key, 4 for left, 4 for right
memcpy(buffer, &key, 4);
memcpy(buffer + 4, &left_offset, 4);
memcpy(buffer + 8, &right_offset, 4);

Then write the buffer:

fseek(file, 0, SEEK_SET); // Write to the root node position
fwrite(buffer, sizeof(buffer), 1, file);

Challenges with Disk Pointers

  1. Fixed Structure:

    • Once a tree node is written, its disk offset is fixed. If you need to move the node (e.g., during a tree reorganization), you must update all pointers referencing it.
  2. Fragmentation:

    • Over time, free space on disk may become fragmented, making it harder to write contiguous blocks.
  3. Crash Recovery:

    • If the system crashes mid-update, you could end up with partially written nodes or invalid pointers. Systems use techniques like Write-Ahead Logs (WAL) to ensure durability.

Example: A Simple File-Based Tree

Here’s how a binary tree might look when stored in a file:

Block OffsetContents
0Root Node: Key=42, Left=512, Right=1024
512Left Child: Key=17, Left=1536, Right=2048
1024Right Child: Key=84, Left=NULL, Right=NULL
1536Left-Left Child: Key=9, Left=NULL, Right=NULL

The tree is serialized into blocks, with each block representing a node. Pointers between nodes are stored as byte offsets within the file.


Summary

  • Disk Blocks: The smallest unit of I/O, typically 4 KB.
  • Serialization: Tree nodes are converted into a binary format, with pointers replaced by file offsets.
  • fseek and fwrite: Control where data is written in the file, ensuring nodes are stored at predictable offsets.
  • Efficiency: By aligning nodes with disk blocks, disk I/O is minimized, and the tree remains structured and accessible.

Let me know if you’d like to explore this further or if an example in a specific programming language would help!