External merge sort

Two-ways

  • General (K-way) merge sorting

At the beginning, all pages are sorted and written to disk (1 page sorted run). Then they are combined in larger runs by merging 2 1-page sorted run into a 2-page sorted run. 2 pages sorted run can be merged into 4 pages sorted runs and so on

The complexity for sorting N pages with an external merge sort using three buffers is

Important

The in the formula represents the initial sorting we do on all pages independently, why the second factor the merge of sorted run whose length is longer than one

General

In the case where buffer are available, the algorithm need fewer passes to merge all the data on disk

Merging two-pages sorted runs

The database will load in the buffer pool a page from each run, and start merge them. When one of the two pages gets to the end, the following pages will be loaded etc.

2- Phase aggregation: Partition + ReHash

In the first phase of 2-phase aggregation we use an hash function to put all tuples with the same grouping key in the same file. When we load that file to build an hash table for grouping as well we want to use a different hash function or all tuples would get in the same bucket!