SQL Performance Explained by Markus Winand
Here, I have listed some interesting concepts from the book.
Chapter 1: Anatomy of an index
A lookup can be slow even with an index. An example of this is the Index Range Scan. In this, the database reads a large part of the index. If the table is accessed to lookup the rows that are read from the index, even the index assisted lookup can become slow.
A lookup using an index includes the following 3 steps -
Traverse the index to search for the required key.
Traverse the leaf node chain to find all the matches.
Fetch the data from the table.
Chapter 2: The WHERE
Clause
A cost based optimizer uses statistics about the tables, columns and indexes to estimate the selectivity of the WHERE
clause predicates. Some statistics at the column level include - range of the values, distribution of the data, cardinality of the column, number of NULL values, etc. Some statistics at the table level include the number of rows in the table and the size of the table in blocks.
The database does not understand the relation between the functions and their results such as UPPER(last_name)
. It cannot use an index on the column last_name
to filter the records by UPPER(last_name)
. To handle this, we need to use a function based index on UPPER(last_name)
. Only deterministic functions can be indexed. PostgreSQL and Oracle allow us to manually mark the functions as deterministic. However, this might not work as intended if the marked function is not really deterministic.
It’s always a good idea to check the optimizer’s estimates. These can be out of date. It might be a good idea to manually update the statistics upon creation / deletion of an index. However, this opinion needs deeper exploration.
Using bind parameters can be helpful to prevent SQL injections. It can also help optimize the query execution by helping the database generate and cache an execution plan for a generic query. The same plan will be used every time the database sees a given SQL query. However, sometimes the parameters can affect execution plan selection. In such cases, avoid bind parameters. Such cases are very few. So, in general it seems okay to always use bind parameters as a rule of thumb.
LIKE
filters can only use the characters before the first wildcard for index traversal. The characters after the wildcard can only be used to filter out the results from the scanned index range. As per this understanding, the LIKE
expressions that are starting with a wildcard need to search the entire table for the matches.
Chapter 3: Performance and scalability
The performance of a query that uses an index can be determined by the scanned index range. To optimize such a query, try to make the scanned index range as small as possible.
There is nothing like a broken index. The 2 main ingredients that make an index slow are -
The table access
Scanning a wide index range
Chapter 4: The join operation
There are different join algorithms that the database can use in different situations such as nested loops join, hash join and merge join. The join algorithm is selected implicitly but we can use proper indexes to ensure that the selected algorithm is running efficiently.
To optimize hash join and sort-merge join we can index the independent conditions so as to reduce the size to the records that need to be joined from each table.
Chapter 5: Clustering data
We can use index to cluster the data. Clustered data improves the query performance by enabling faster access to data. This concept can be used such that we can add those columns to the index that do no impact the index lookup but are being used in filter predicates. This helps since the columns of the filter predicate can then be accessed without TABLE ACCESS
.
Chapter 6: Sorting and Grouping
To ensure a fast execution of the SQL queries, ensure that the set up enables execution of the operations in a pipelined manner. ie. The first result is returned before reading the entire input.
Chapter 7: Partial Results
The following row values syntax allows us to compare multiple values. This syntax is most commonly seen in the INSERT
statements -
SELECT * FROM table_a WHERE (col1, col2) < (val1, val2) ORDER BY col1 DESC, col2 DESC;