External merge sort
Two-ways
- General (K-way) merge sorting
At the beginning, all pages are sorted and written to disk (1 page sorted run). Then they are combined in larger runs by merging 2 1-page sorted run into a 2-page sorted run. 2 pages sorted run can be merged into 4 pages sorted runs and so on
The complexity for sorting N pages with an external merge sort using three buffers is
Important
The in the formula represents the initial sorting we do on all pages independently, why the second factor the merge of sorted run whose length is longer than one
General
In the case where buffer are available, the algorithm need fewer passes to merge all the data on disk
Merging two-pages sorted runs
The database will load in the buffer pool a page from each run, and start merge them. When one of the two pages gets to the end, the following pages will be loaded etc.
2- Phase aggregation: Partition + ReHash
In the first phase of 2-phase aggregation we use an hash function to put all tuples with the same grouping key in the same file. When we load that file to build an hash table for grouping as well we want to use a different hash function or all tuples would get in the same bucket!