Query optimization is the most complex part of databases. Until System R, most people believed that human would write query plans better than DBMS. DBMS typically use two strategies to optimize query plans: heuristics (rules) and cost-based search
Planning
Two types of plans exists: logical and physical. Both can be optimized and there does not always exist a one-to-one mapping (multiple physical plans might exists, since they deal with the physical format of the data)
Logical plan optimizations
Typical optimizations involve:
- Performs filters as early as possible (predicate pushdown)
- Reorder predicates to apply the most selective first
- Breaking up complex predicates and pushing it down (split conjuctive predicates)
- Remove impossible or unnecessary predicates
- Merge predicates
WHERE val BETWEEN 1 and 100 OR val between 50 and 150 - Reorder joins
- Remove unnecessary joins (self join with equality)
- Rewrite correlated subqueries
Cost estimation
Based on CPU, Disk, Memory and Network. Since it’s too slow to enumerate all valid plans the DBMS maintains statistics about tables in the internal catalog.
Tip
Different systems maintain statistics in different ways, and most systems update them in the background when significant updates happen to the tables
A type of statistics in that DBMS typically keep is the number of distinct occurrences for a certain value in an attribute. This is useful to derive selection cardinality which is useful to estimate the selectivity of simple predicates, while for more complex predicates we need additional assumptions
Tip
Since the selectivity of a predicate is equal to the probability of that predicate, we can apply the rules of probability to estimate the selectivity of a negation query (
WHERE val != 'foo')
Tip
When predicates are complex, if we assume predicates are independent we can use probability rules and compute the joint probability by multiplying probabilities of single predicate
Since data is often skewed, it is tricky to make assumptions, so often databases stores histograms to avoid storing every single value. Histogram can be equi-width or equi-depth and can be used to generate sketches to approximate statistics of a dataset
Additionally, databases can use sampling to apply predicates to a smaller copy of the table with a similar distributions
Plan enumeration
After performing rule-based rewriting the database will enumerate different plans and estimate their cost. For single-relation, it typically only a matter of chosing the best access method (i.e. binary search, index scans, etc). No cost analysis is done there since the OLTP queries are Sargable(a term created in IBM to indicate that an index will lead the best plan)
For multi-relation plans, the number of alternative plans grow rapidly, so we need to restrict the search space to find the optimal plan in a short time:
- Bottom-up planning is applied by Postgres, IBM System R, mySQL, DB2
- Top-down planning is applied by SQL Server, Greenplum, CockroachDb, Vulcano
System R - Bottom up
System R applies static rule then uses dynamic programming:
- break the query into blocks and generate logical operators
- map each logical operators to a set of physical operators
- Iteratively construct a left-deep tree that minimize the amount of work done
Left deep tree
A left-deep tree is a binary tree where each internal node represents a join operation, and the right child of each join is always a base relation (a table), while the left child can be either another join node or a base relation
Vulcano - Top down
Starts with a logical plan of what we want the query to be, perform a branch and bound search by converting logical operators into physical operators (branching), while keep track of global best plan during search and abandoning current path is worst than pre-existing best (bound)