How to write efficient queries in MySQL
Learn how to write efficient queries in MySQL
Table of contents
- Introduction to queries
- EXPLAIN overview
- EXPLAIN access types
- EXPLAIN ANALYZE
- Index obfuscation
- Redundant and approximate conditions
- Select only what we need
- Limiting rows
- An overview of joins
- Indexing joins
- Subqueries
- Common table expressions (CTEs)
- Recursive CTEs
- Unions
- Window functions
- Sorting and limiting
- Sorting with indexes
- Sorting with composite indexes
- Counting results
- Dealing with NULLs
- References
Introduction to queries
We are going to talk about actually accessing the data / querying the data efficiently.
EXPLAIN
overview
We’ll try to understand the output of the EXPLAIN
command.
id
- Shows the ID of the row in the output. Can beNULL
.select_type
- Possible values areSIMPLE, UNION, PRIMARY
.table
- The table we’re accessing.partitions
- Out of the scope of this course.type
- Discussed in the next section.possible_keys
- The keys that can be used in the given query.key
- The key that the database actually uses to resolve the query.key_len
- The length of the key.ref
- The value that is being compared to the index. e.g.const
for constant, function,column
if we are doing a join, etc. For instance consider a query like this -select * from people where first_name = “aaron”
. Here the value being compared is aconst
.rows
- The best estimate for the number of rows that the database will have to access in the primary index.filtered
- The query does the index work. Then it goes back to the primary index to get the actual row. This filtered column shows an estimate for the percentage of rows that the engine thinks, it will be keeping (after the index work).e.g. Consider this query -
select * from people where first_name = “aaron” and bio like “%error%“;
The filtered value shows the percentage of the rows that will be kept using this part of the query -
bio like "%error%";
, i.e. after the index work.
Extra
- Shows extra information. e.g.Using where
shows that the database will have to do some extra filtering after the index work is done.
EXPLAIN
access types
We’ll go from the best to worst types that we can see.
const
- Unique lookup - Indicates that there will only be 1 row in the result.ref
- Index assisted direct lookup - Indicates that there is an index that will be used.fulltext
- Fulltext index will be used.range
- A range scan will be done. Here during the index lookup, the engine traverses down and then does a horizontal range scan.index
- The entire index will be scanned to look for what we need. It’s better than scanning the entire table as the index table will have lesser number of columns. So it will be faster.ALL
- The entire table will be scanned.
EXPLAIN ANALYZE
We have different formats in which we can get the EXPLAIN
output.
explain format=tree query;
- This format shows the output in a tree with the cost. So we can see the most expensive part of the query. We read this tree inside out - we start at the most nested detail and then work our way out.explain format=json query;
- This shows a machine readable json with a lot of additional data.explain analyze query;
- It is the tree format with just 1 difference. It runs the query! It also shows some additional data.
Index obfuscation
Let’s see index obfuscation and how to avoid it.
Consider this query on the film table. It has an index on the length
column. Length is the length of the film in minutes. The index would also be in minutes.
We want to find the movies shorter than 2 hours.
explain select * from film where length / 60 < 2;
This causes index obfuscation and the index on the length
column is not used. To fix this, we can alter the query to be like this -
explain select * from film where length < 120;
This allows the index to be used.
But if the mathematical operation was different per row, then it would not have been possible to un-obfuscate the index.
Redundant and approximate conditions
A redundant conditions are conditions that do not change the results and they cannot do so logically.
e.g. select * from people where id < 5 and id < 100
.
Consider this query. It we have an index on due_date
, it would be obfuscated because we are using due_time
. The mathematical operation is different for every row, so we cannot move it to the other side of the equation.
select * from todos where addtime(due_date, due_time) between now() and now() + interval 1 day;
We can set it up to use index like this -
select *
from todos
where addtime(due_date, due_time) between now() and now() + interval 1 day
and
due_date between current_date and current_date + interval 1 day;
However, the result in both the situations is the same. So the second query shows a case of using a redundant condition where the condition is useful but not needed.
Another such query is this -
select *
from items
where price * tax < 100
and price < 100;
Next, we’ll discuss some generic querying principles.
Select only what we need
Select only the columns that we need.
One thing to note is that if we’re using a framework like ActiveRecord, then the columns that have not been selected will always return NULL
. If these missing columns are being used in some background logic, then it will work in an unexpected manner.
Another note is that we can mark the columns as invisible
at the database level. These columns are not returned unless requested explicitly.
Limiting rows
Use index to reduce the rows that are looked up.
Use where to filter out the rows from the looked up rows.
Use
LIMIT
andOFFSET
for pagination. AlwaysORDER
onLIMIT
. Otherwise MySQL will decide how to order the rows.Use aggregate functions to get the aggregate value instead of querying all the rows and then doing the math in memory.
An overview of joins
Here, we saw the different types of joins.
Indexing joins
Here we’ll discuss the importance of indexing joins.
Consider a case where we have 3 tables - actors, films, film_actors
.
If we run explain on the following query, we’ll see a separate key that is in use for each of the 3 tables. The ref
column shows the value with which the index value is being compared to.
select
film.title, actor.name
from
films
left join film_actors on film_actors.film_id = films.id
left join actors on film_actors.actor_id = actors.id
where films.id <= 10;
With INNER JOIN
, MySQL gets to decide the order in which the tables are queried. By default the order in the above query would be like this - films, film_actors, actors
.
If there is an index on the table on the join column, MySQL uses indexes to join the tables. Otherwise, for every row in the left table, it scans the entire right table to find the rows that match the join criteria.
Subqueries
Initial query
select
distinct customer.first_name, customer.last_name
from
customer
inner join payment on customer.id = payment.customer_id
where
payment.amount > 7.99;
The same query rewritten using subqueries.
select
customer.first_name, customer.last_name
from
customer
where id in (
select customer_id from payment where amount > 7.99
);
These subqueries are enhanced using optimization techniques, such as “semi-join” / “anti-join”, under the hood by MySQL.
Common table expressions (CTEs)
CTEs allow us to make really huge complicated queries. e.g. allow us to reference a subquery in subsequent queries such that the subquery is run only once.
with spend_last_6 as (
select
customer_id,.
sum(amount) as total_spend
from
payment
inner join customer on customer.id = payment.customer_id
where
store_id = 1
and payment_date > CURRENT_DATE - INTERVAL 6 MONTH
group by
customer_id
);
select
*
from
spend_last_6
where
total_spend > (select avg(total_spend) from spend_last_6);
Recursive CTEs
It is a CTE that references itself. It has 2 parts - the initial condition and the recursive condition.
with recursive numbers as (
select 1 as n
union all
select n+1 from numbers where n < 10
);
select * from numbers;
Consider a self-referencing table with these columns. The records in this table form a tree structure as the parent record can also have its own parent record and so on.
id, name, parent_id
Recursive CTEs can be used in this case -
with recursive all_categories as (
select
id,
name,
0 as depth,
CAST(id as CHAR(255)) as path
from categories where parent_id is null
union all
select
categories.id,
categories.name,
depth+1,
CONCAT(path, "->", categories.id)
from all_categories inner join categories on all_categories.id = categories.parent_id
);
select * from all_categories;
The output table shows the path to each record in this form - 1→3→12→11
. The depth of the record with id 11
will be 3
and the depth of the record with id 1
will be 0
.
Unions
Unions, like joins, combine the data from different tables.
Unlike joins, they increase the number of rows. Joins combine the columns.
union
eliminates duplicates. It is slower since the database needs to sort and compare rows to identify duplicates. It also requires more memory. union all
includes all the rows even if they are duplicates.
select 1 as n, "a" as letter
union
select 2, "b"
union
select 2, "b";
select 1 as n, "a" as letter
union all
select 2, "b"
union all
select 2, "b";
Window functions
Window functions are a feature that allow us to perform calculation across a set of rows that are related to the current row without collapsing the rows into a single result (like aggregate functions do).
They provide us a way to calculate something over a window of rows while keeping the individual rows in the result.
We can choose the window (subset of rows to operate on) using the OVER
clause.
select
customer_id, rental_id, amount, payment_date,
row_number() over(partition by customer_id order by payment_date asc) as num,
first_value(payment_date) over(partition by customer_id order by payment_date asc) as first_rental_date,
from
payment
order by customer_id, payment_date asc;
We can define windows externally for ease of use. These are called named windows.
select
customer_id, rental_id, amount, payment_date,
row_number() over(window_asc) as num,
first_value(payment_date) over(window_asc) as first_rental_date,
from
payment
window
window_asc as (partition by customer_id order by payment_date asc)
order by customer_id, payment_date asc;
Sorting and limiting
Ordering the rows is not free. But we can optimize it using indexes. We order using ORDER BY
.
Consider this query. If the ordered column is not deterministic enough, ie. it has duplicate values, then different rows might be returned in every query. In such a case, we’ll need order by more columns till the result becomes deterministic.
select
id, birthday
from
people
order by
birthday asc,
id
limit 100 offset 200;
Sorting with indexes
MySQL has 2 ways of producing a sorted result set - sort the result or scan the index in order. Indexes are already ordered. Thus indexing makes sorting faster.
If the sorting is being done without indexes, then the explain
output of the query will show Using filesort
. This can happen in memory. In case of large number of rows, the database might even be writing to the disk while sorting the rows.
This is not the end of the world. But if we are seeing Using filesort
and the query is slow, then we might want to use an index.
Note - A column’s index can be used even if we have id
column in the order by clause since indexes for every column always also store the id
of the rows.
A new optimization in MySQL 8+ allows the database to read the index backward. Although it has some performance penalty. So if we’re always looking to scan in desc order, we might want to update the index accordingly.
Sorting with composite indexes
We can use composite indexes while ordering but we need to make sure that the columns are being accessed in the same order as the order in which they’re present in the index.
Even if in a 3 column composite index, we are ordering by only the 2nd column, we can use the index if we use the 1st column in the where
clause.
id
can always be used as discussed but it must appear in the end after the 3rd column. Otherwise, index will not be useable.
Also, we cannot order in ascending by the 1st column while ordering in descending by the 2nd column. The index can be read in only 1 direction. This is, however, possible if the index has been specified in this order.
Counting results
We can use COUNT
to count the rows. COUNT(*)
will count the number of rows. COUNT(column)
will count the number of distinct column
values.
We can conditionally count the rows as follows. The NULL
values are not counted.
count(if(dayofweek(rental_date) in (1,7), null, 1))
We can use similar conditionals while using other aggregate functions such as sum
.
Dealing with NULLs
NULL
is not equal to itself in MySQL.
If we want to write a condition where 2 NULLs
are considered equal, then we need to use the spaceship operator - col1 <=> col2
.
References
https://planetscale.com/learn/courses/mysql-for-developers/queries