Access Path Selection in a Relational Database Management System – Selinger et al. 1979
This is part 1 of a 7 part series on (database) ‘Techniques Everyone Should Know.’
System R was a very influential Relational Database Management System (RDBMS) built at the IBM San Jose Research Laboratory starting in 1975. This paper introduces the idea of a query optimizer, built as part of System R, that plans the most efficient way to retrieve the data requested by a SQL query. As well as giving insight into how a query optimizer may be constructed, the paper also quietly introduced the important result that a declarative query language can be supported with no loss of performance compared to the more common procedural query language approaches of the day. In fact, once you’ve understood what a good query optimizer can do, you’ll never want to write your own procedural data retrieval functions again! It’s more work, and the result won’t be as good…
More work on validation of the optimizer cost formulas needs to be done, but we can conclude from this preliminary work that database management systems can support non-procedural query languages with performance comparable to those supporting the current more procedural languages.
The four phases of (SQL) statement processing in System R are:
- Code Generation
We’re concerned here with the optimization phase, and in particular with access path selection. An access path is a way of accessing the tuples of a relation – there is always the possibility of a full (data) segment scan, but there may also be one or more indices.
… the OPTIMIZER performs access path selection. It first determines the evaluation order among the query blocks in the statement. Then for each query block, the relations in the FROM list are processed. If there is more than one relation in a block, permutations of the join order and of the method of joining are evaluated. The access paths that minimize the total cost for the block are chose from a tree of alternate path choices. This minimum cost solution is represented by a structural modification of the parse tree. The result is an execution plan in the Access Specification Language (ASL).
Scans and Access Paths
A scan returns a tuple at a time along a given access path. NEXT is called to retrieve the next tuple. For a segment scan, System R looks at all segments which contain tuples from any relation and returns those tuples belonging to the given relations. (Segments in System R may contain tuples from multiple relations). All the non-empty pages of a segment will be touched, however each page is touched only once. In contrast, when an entire relation is examined via an index scan, each page of the index is touched only once, but a data page may be examined more than once if it has two tuples on it which are not ‘close’ in the index ordering.
If the tuples are inserted into segment pages in the index ordering, and if this physical proximity corresponding to index key value is maintained, we say that the index is clustered. A clustered index has the property that not only each index page, but also each data page containing a tuple from that relation will be touched only once in a scan on that index.
Index scans also have the advantage that they need not scan the entire relation (for example, starting and stopping key values may have been specified in the query).
A query may specify predicates that are applied to a tuple to determine whether or not it is to be included in the result set. In System R the set of predicates are known as search arguments (SARGS) and these are expressed as a boolean expression in disjunctive normal form. Each individual predicate is of the form ‘column comparison-operator value’.
Single Relation Queries
Selinger et al. start with the case of a query involving only one relation, and then later build on this to show how queries involving multiple relations (joins) can be optimized.
The optimizer looks at four key pieces of information:
- the predicates in the query
- the access paths available on the relations referenced in the query
- statistics that are maintained on relations and access paths
- the desired output ordering (if any) – i.e. an ORDER BY or GROUP BY clause
A cost prediction is made for each access plan, which is based on the amount of expected I/O (page fetches), and CPU utilization (denoted as ‘RSI calls’ in the paper, which is the predicted number of tuples returned from the underlying storage system). A weighting factor W enables adjustment of the cost balance between I/O and CPU:
cost = costio + W.costcpu
where costio = #PAGE FETCHES and costcpu = #RSI CALLS.
The WHERE clause is a conjunction of boolean factors. An index matching a boolean factor will be an efficient way to satisfy it…. “we say that a predicate or set of predicates matches an index access path when the predicates are valid search arguments and the columns mentioned are an initial substring of the set of columns of the index key.” For each boolean factor in the predicate list, the optimizer assigns a selectivity factor, F, which very roughly corresponds to the expected fraction of tuples which will satisfy the predicate. The selectivity factors are derived from heuristics and the statistics maintained for relations and indices.
For a relation T, System R maintains the following statistics:
- TuplesT – the cardinality of the relation
- PagesT – the number of pages in the segment that hold tuples of relation T
- PageRatioT – the fraction of data pages in the segment that hold tuples of relation T. (PageRatioT = PagesT/#pages in segment)
And for an index I :
- KeysI – the number of distinct keys in the index
- IPagesI – the number of pages in index I
Some example heuristics follow, for the complete set see Table 1 in the full paper.
- For a
column = valueboolean factor, let F = 1 / Keyscolumn-index if there is an index on the column, and 1/10 otherwise.
- For a
column1 = column2boolean factor, let F = 1/max(Keyscolumn1-index, Keyscolumn2-index) if there is an index on both columns, 1/Keyscolumn1-index is there is only an index on the first column, and 1/10 otherwise.
- For a
column > valueboolean factor – or any other open-ended comparison, if the column is an arithmetic type then let F = (high-key-value – value) / (high-key-value – low-key-value), otherwise let F = 1/3.
Given the selectivity factor F for each boolean factor, we can compute the number of expected RSI calls (the measure for costCPU), RSICARD as the product of the relation cardinalities TuplesT for every relation in the FROM list times the selectivity factors of the search argument boolean factors.
For a query with a single relation, we can now compute a cost estimate for each access path :
- For a unique index access path matching an equal predicate, then let cost = 1+ 1 + W
- For a clustered index I access path matching one or more boolean factors, then let cost = F(preds).(KeysI + Pages) + W.RSICARD
- For a non-clustered index I access path matching one or more boolean factors, then let cost = F(preds).(KeysI + Tuples) + W.RSICARD (use Pages in place of Tuples if everything fits in memory)
- For a clustered index I access path not matching any boolean factors, let F = (KeysI + Pages) + W.RSICARD
- For a non-clustered index I acces path not maching any boolean factors, let F = (KeysI + Tuples) + W.RSICARD (use Pages in place of Tuples if everything fits in memory)
- For a segment scan access path, let F = Tuples/PageRatio + W.RSICARD
The best access path is not necessarily the one with the lowest access cost though. To determine that, we also need to considered the desired output order (ORDER BY, GROUP BY).
Using an index access path or sorting tuples produces tuples in the index value or sort key order. We say that a tuple order is an interesting order if that order is one specified by the query block’s GROUP BY or ORDER BY clauses.
In the presence or ordering requirements, the cheapest access plan is the lower of the cheapest access path that produces tuples in an interesting order (i.e., the desired order), and the cheapest access path that produces tuples in some other order + the cost of sorting the tuples . We can estimate the cost of sorting based on the expected query cardinality (the product of Tuples and the selectivity factors of the query block’s boolean factors).
Given a way to estimate the cost for queries involving a single relation, we can extend it to estimate costs for different approaches to joins. We can first consider a join of two relations, and then generalise to N-way joins which can be visualized as a sequence of 2-way joins.
Selinger et al. consider nested-loop based joins, and merging-scan based joins. In a nested loop join the outer relation is scanned, and for each tuple retrieved from the outer relation the inner relation is scanned to retrieve all ofthe tuples which satisfy the join predicate. In a merging-scan the outer and inner relations are first sorted in join column order (so columns of equi-join predicates also define ‘interesting’ orders) so that they can subsequently be more efficiently merged.
The key question here is in what order relations shoud be joined (for a 2-way join, which is the inner, and which the outer relation), with n relations in a FROM list, there are n! permutations to consider.
A heuristic is used to reduce the join order permutations which are considered. When possible, the search is reduced by consideration only of join orders which have join predicates relating the inner relation to the other relations already participating in the join.
To find the optimal plan for join operations, a tree of possible solutions is constructed.
The search tree is constructed by iteration on the number of relations joined so far. First, the best way is found to access each single relation for each interesting tuple ordering and for the unordered case. Next, the best way of joining any relation to these is found, subject to the heuristics for join order. This produces solutions for joining pairs of relations. Then the best way to join sets of three relations is found by consideration of all sets of two relations and joining in each third relation permitted by the join order heuristic. For each plan to join a set of relations, the order of the composite result is kept in the tree. This allows consideration of a marge scan join which would not require sorting the composite. After the complete solutions (all of the relations joined together) have been found, the optimizer chooses the cheapest solution which gives the required order, if any was specified.
There is a worked example in the paper of joining Employee, Job title, and Department tables that helps to make this process clearer.
The cost of a join is computed based on the costs of scans on each of the relations (using the cost estimate formulas we saw earlier) and the cardinalities.
- Let C-outerpath1 be the cost of scanning the outer relation via path1
- Let N be the cardinality of the outer relation tuples satisfying the applicable predicates, where N = product of TuplesT for all T of the join so far * product of the selectivity factor for all applicable predicates.
- Let C-innerpath2 be the cost of scanning the inner relation, satisfying all applicable predicates
Then the cost of a nested-loop join is C-outerpath1 + N.C-innerpath2, and the cost of a merge scan join is the cost of actually doing the merge, plus the cost of sorting the outer or inner relations, if required. The cost of doing the merge is also C-outerpath1 + N.C-innerpath2.
It is interesting to observe that the cost formula for nested loop joins and the cost formula for merging scans are essentially the same. The reason that merging scans is sometimes better than nested loops is that the cost of the inner scan may be much less. After sorting, the inner relation is clustered on the join column which tends to minimize the number of pages fetched, and it is not necessary to scan the entire inner relation (looking for a match) for each tuple of the outer relation.
SELECT NAME FROM EMPLOYEE WHERE SALARY = (SELECT AVG(SALARY) FROM EMPLOYEE)
If the nested query does not reference a value obtained from a higher level query block then it needs to be evaluated only once. In this case, the optimizer will evaluate the subquery first.
If a single value is returned, it will be incorporated into the top-level query as if it had been part of the original query statement… if the subquery can return a set of values, thery are returned in a temporary list, an internal form which is more efficient than a relation but which can only be accessed sequentially.
If the nested query references a value from a higher-level query block then in principle it needs to be re-evaluated for each candidate tuple from the referenced higher-level block.
However, if the referenced relation is ordered on the referenced column, the re-evaluation can be made conditional, depending on a test of whether or not the current referenced value is the same as the previous candidate tuple.