Query optimization is the most complex part of databases. Until System R, most people believed that human would write query plans better than DBMS. DBMS typically use two strategies to optimize query plans: heuristics (rules) and cost-based search

Planning

Two types of plans exists: logical and physical. Both can be optimized and there does not always exist a one-to-one mapping (multiple physical plans might exists, since they deal with the physical format of the data)

Logical plan optimizations

Typical optimizations involve:

  • Performs filters as early as possible (predicate pushdown)
  • Reorder predicates to apply the most selective first
  • Breaking up complex predicates and pushing it down (split conjuctive predicates)
  • Remove impossible or unnecessary predicates
  • Merge predicates WHERE val BETWEEN 1 and 100 OR val between 50 and 150
  • Reorder joins
  • Remove unnecessary joins (self join with equality)
  • Rewrite correlated subqueries

Cost estimation

Based on CPU, Disk, Memory and Network. Since it’s too slow to enumerate all valid plans the DBMS maintains statistics about tables in the internal catalog.

Tip

Different systems maintain statistics in different ways, and most systems update them in the background when significant updates happen to the tables

A type of statistics in that DBMS typically keep is the number of distinct occurrences for a certain value in an attribute. This is useful to derive selection cardinality which is useful to estimate the selectivity of simple predicates, while for more complex predicates we need additional assumptions

Tip

Since the selectivity of a predicate is equal to the probability of that predicate, we can apply the rules of probability to estimate the selectivity of a negation query (WHERE val != 'foo')

Tip

When predicates are complex, if we assume predicates are independent we can use probability rules and compute the joint probability by multiplying probabilities of single predicate

Since data is often skewed, it is tricky to make assumptions, so often databases stores histograms to avoid storing every single value. Histogram can be equi-width or equi-depth and can be used to generate sketches to approximate statistics of a dataset

Additionally, databases can use sampling to apply predicates to a smaller copy of the table with a similar distributions

Plan enumeration

After performing rule-based rewriting the database will enumerate different plans and estimate their cost. For single-relation, it typically only a matter of chosing the best access method (i.e. binary search, index scans, etc). No cost analysis is done there since the OLTP queries are Sargable(a term created in IBM to indicate that an index will lead the best plan)

For multi-relation plans, the number of alternative plans grow rapidly, so we need to restrict the search space to find the optimal plan in a short time:

  • Bottom-up planning is applied by Postgres, IBM System R, mySQL, DB2
  • Top-down planning is applied by SQL Server, Greenplum, CockroachDb, Vulcano

System R - Bottom up

System R applies static rule then uses dynamic programming:

  • break the query into blocks and generate logical operators
  • map each logical operators to a set of physical operators
  • Iteratively construct a left-deep tree that minimize the amount of work done

Left deep tree

A left-deep tree is a binary tree where each internal node represents a join operation, and the right child of each join is always a base relation (a table), while the left child can be either another join node or a base relation

Vulcano - Top down

Starts with a logical plan of what we want the query to be, perform a branch and bound search by converting logical operators into physical operators (branching), while keep track of global best plan during search and abandoning current path is worst than pre-existing best (bound)