Selection Operation in Query Processing

In the previous section, we understood that estimating the cost of a query plan should be done by measuring the total resource consumption.

In this section, we will understand how the selection operation is performed in the query execution plan.

Generally, the selection operation is performed by the file scan. File scans are the search algorithms that are used for locating and accessing the data. It is the lowest-level operator used in query processing.

Let's see how selection using a file scan is performed.

Selection using File scans and Indices

In RDBMS or relational database systems, the file scan reads a relation only if the whole relation is stored in one file only. When the selection operation is performed on a relation whose tuples are stored in one file, it uses the following algorithms:

Linear Search: In a linear search, the system scans each record to test whether satisfying the given selection condition. For accessing the first block of a file, it needs an initial seek. If the blocks in the file are not stored in contiguous order, then it needs some extra seeks. However, linear search is the slowest algorithm used for searching, but it is applicable in all types of cases. This algorithm does not care about the nature of selection, availability of indices, or the file sequence. But other algorithms are not applicable in all types of cases.

Selection Operation with Indexes

The index-based search algorithms are known as Index scans. Such index structures are known as access paths. These paths allow locating and accessing the data in the file. There are following algorithms that use the index in query processing:

Primary index, equality on a key: We use the index to retrieve a single record that satisfies the equality condition for making the selection. The equality comparison is performed on the key attribute carrying a primary key.
Primary index, equality on nonkey: The difference between equality on key and nonkey is that in this, we can fetch multiple records. We can fetch multiple records through a primary key when the selection criteria specify the equality comparison on a nonkey.
Secondary index, equality on key or nonkey: The selection that specifies an equality condition can use the secondary index. Using secondary index strategy, we can either retrieve a single record when equality is on key or multiple records when the equality condition is on nonkey. When retrieving a single record, the time cost is equal to the primary index. In the case of multiple records, they may reside on different blocks. This results in one I/O operation per fetched record, and each I/O operation requires a seek and a block transfer.

Selection Operations with Comparisons

For making any selection on the basis of a comparison in a relation, we can proceed it either by using the linear search or via indices in the following ways:

Primary index, comparison: When the selection condition given by the user is a comparison, then we use a primary ordered index, such as the primary B⁺-tree index. For example, when A attribute of a relation R compared with a given value v as A>v, then we use a primary index on A to directly retrieve the tuples. The file scan starts its search from the beginning till the end and outputs all those tuples that satisfy the given selection condition.
Secondary index, comparison: The secondary ordered index is used for satisfying the selection operation that involves <, >, ≤, or ≥ In this, the files scan searches the blocks of the lowest-level index.
(< ≤): In this case, it scans from the smallest value up to the given value v.
(>, ≥): In this case, it scans from the given value v up to the maximum value.
However, the use of the secondary index should be limited for selecting a few records. It is because such an index provides pointers to point each record, so users can easily fetch the record through the allocated pointers. Such retrieved records may require an I/O operation as records may be stored on different blocks of the file. So, if the number of fetched records is large, it becomes expensive with the secondary index.

Implementing Complex Selection Operations

Working on more complex selection involves three selection predicates known as Conjunction, Disjunction, and Negation.

Conjunction: A conjunctive selection is the selection having the form as:

σ _{θ1ꓥθ2ꓥ…ꓥθn} (r)

A conjunction is the intersection of all records that satisfies the above selection condition.

Disjunction: A disjunctive selection is the selection having the form as:

σ _{θ1ꓦθ2ꓦ…ꓦθn} (r)

A disjunction is the union of all records that satisfies the given selection condition θ_i.

Negation: The result of a selection σ_¬θ(r) is the set of tuples of given relation r where the selection condition evaluates to false. But nulls are not present, and this set is only the set of tuples in relation r that are not in σ_θ(r).

Using these discussed selection predicates, we can implement the selection operations by using the following algorithms:

Conjunctive selection using one index: In such type of selection operation implementation, we initially determine if any access path is available for an attribute. If found one, then algorithms based on the index will work better. Further completion of the selection operation is done by testing that each selected records satisfy the remaining simple conditions. The cost of the selected algorithm provides the cost of this algorithm.
Conjunctive selection via Composite index: A composite index is the one that is provided on multiple attributes. Such an index may be present for some conjunctive selections. If the given selection operation proves true on the equality condition on two or more attributes and a composite index is present on these combined attribute fields, then directly search the index. Such type of index evaluates the suitable index algorithms.
Conjunctive selection via the intersection of identifiers: This implementation involves record pointers or record identifiers. It uses indices with the record pointers on those fields which are involved in the individual selection condition. It scans each index for pointers to tuples satisfying the individual condition. Therefore, the intersection of all the retrieved pointers is the set of pointers to the tuples that satisfies the conjunctive condition. The algorithm uses these pointers to fetch the actual records. However, in absence of indices on each individual condition, it tests the retrieved records for the other remaining conditions.
Disjunctive selection by the union of identifiers: This algorithm scans those entire indexes for pointers to tuples that satisfy the individual condition. But only if access paths are available on all disjunctive selection conditions. Therefore, the union of all fetched records provides pointers sets to all those tuples which satisfy or prove the disjunctive condition. Further, it makes use of pointers for fetching the actual records. Somehow, if the access path is not present for anyone condition, we need to use a linear search to find those tuples that satisfy the condition. Thus, it is good to use a linear search for determining such tests.

Cost Estimation

Here, the overall cost of the algorithm is composed by adding the cost of individual index scans and cost of fetching the records in the intersection of the retrieved lists of pointers. We can minimize the cost by sorting the list of pointers and fetching the sorted records. So, we found the following two points for cost estimation:

We can fetch all selected records of the block using a single I/O operation because each pointer in the block appears together.
The disk-arm movement gets minimized as blocks are read in sorted order.

Cost Estimation Chart for various Selection algorithms

Here, b_r is the number of blocks in the file.

h_i denotes the height of the index

b is the number of blocks holding records with specified search key

n is the number of fetched records

Selection Algorithms	Cost	Why So?
Linear Search	t_s + b_r * t_T	It needs one initial seek with b_r block transfers.
Linear Search, Equality on Key	t_s + (b_r/2) * t_T	It is the average case where it needs only one record satisfying the condition. So as soon as it is found, the scan terminates.
Primary B⁺-tree index, Equality on Key	(h_i +1) * (t_r + t_s)	Each I/O operation needs one seek and one block transfer to fetch the record by traversing the height of the tree.
Primary B⁺-tree index, Equality on a Nonkey	h_i * (t_T + t_s) + b * t_T	It needs one seek for each level of the tree, and one seek for the first block.
Secondary B⁺-tree index, Equality on Key	(h_i + 1) * (t_r + t_s)	Each I/O operation needs one seek and one block transfer to fetch the record by traversing the height of the tree.
Secondary B⁺-tree index, Equality on Nonkey	(h_i + n) * (t_r + t_s)	It requires one seek per record because each record may be on a different block.
Primary B⁺-tree index, Comparison	h_i * (t_r + t_s) + b * t_T	It needs one seek for each level of the tree, and one seek for the first block.
Secondary B⁺-tree index, Comparison	(h_i + n) * (t_r + t_s)	It requires one seek per record because each record may be on a different block.