How to write efficient queries in MySQL

Introduction to queries

We are going to talk about actually accessing the data / querying the data efficiently.

`EXPLAIN` overview

We’ll try to understand the output of the EXPLAIN command.

id - Shows the ID of the row in the output. Can be NULL.
select_type - Possible values are SIMPLE, UNION, PRIMARY.
table - The table we’re accessing.
partitions - Out of the scope of this course.
type - Discussed in the next section.
possible_keys - The keys that can be used in the given query.
key - The key that the database actually uses to resolve the query.
key_len - The length of the key.
ref - The value that is being compared to the index. e.g. const for constant, function, column if we are doing a join, etc. For instance consider a query like this - select * from people where first_name = “aaron”. Here the value being compared is a const.
rows - The best estimate for the number of rows that the database will have to access in the primary index.
filtered - The query does the index work. Then it goes back to the primary index to get the actual row. This filtered column shows an estimate for the percentage of rows that the engine thinks, it will be keeping (after the index work).
- e.g. Consider this query - select * from people where first_name = “aaron” and bio like “%error%“;
- The filtered value shows the percentage of the rows that will be kept using this part of the query - bio like "%error%";, i.e. after the index work.
Extra - Shows extra information. e.g. Using where shows that the database will have to do some extra filtering after the index work is done.

`EXPLAIN` access types

We’ll go from the best to worst types that we can see.

const - Unique lookup - Indicates that there will only be 1 row in the result.
ref - Index assisted direct lookup - Indicates that there is an index that will be used.
fulltext - Fulltext index will be used.
range - A range scan will be done. Here during the index lookup, the engine traverses down and then does a horizontal range scan.
index - The entire index will be scanned to look for what we need. It’s better than scanning the entire table as the index table will have lesser number of columns. So it will be faster.
ALL - The entire table will be scanned.

`EXPLAIN ANALYZE`

We have different formats in which we can get the EXPLAIN output.

explain format=tree query; - This format shows the output in a tree with the cost. So we can see the most expensive part of the query. We read this tree inside out - we start at the most nested detail and then work our way out.
explain format=json query; - This shows a machine readable json with a lot of additional data.
explain analyze query; - It is the tree format with just 1 difference. It runs the query! It also shows some additional data.

Index obfuscation

Let’s see index obfuscation and how to avoid it.

Consider this query on the film table. It has an index on the length column. Length is the length of the film in minutes. The index would also be in minutes.

We want to find the movies shorter than 2 hours.

explain select * from film where length / 60 < 2;

This causes index obfuscation and the index on the length column is not used. To fix this, we can alter the query to be like this -

explain select * from film where length < 120;

This allows the index to be used.

But if the mathematical operation was different per row, then it would not have been possible to un-obfuscate the index.

Redundant and approximate conditions

A redundant conditions are conditions that do not change the results and they cannot do so logically.

e.g. select * from people where id < 5 and id < 100.

Consider this query. It we have an index on due_date, it would be obfuscated because we are using due_time. The mathematical operation is different for every row, so we cannot move it to the other side of the equation.

select * from todos where addtime(due_date, due_time) between now() and now() + interval 1 day;

We can set it up to use index like this -

select *
from todos
where addtime(due_date, due_time) between now() and now() + interval 1 day
and
due_date between current_date and current_date + interval 1 day;

However, the result in both the situations is the same. So the second query shows a case of using a redundant condition where the condition is useful but not needed.

Another such query is this -

select *
from items
where price * tax < 100
and price < 100;

Next, we’ll discuss some generic querying principles.

Select only what we need

Select only the columns that we need.

One thing to note is that if we’re using a framework like ActiveRecord, then the columns that have not been selected will always return NULL. If these missing columns are being used in some background logic, then it will work in an unexpected manner.

Another note is that we can mark the columns as invisible at the database level. These columns are not returned unless requested explicitly.

Limiting rows

Use index to reduce the rows that are looked up.
Use where to filter out the rows from the looked up rows.
Use LIMIT and OFFSET for pagination. Always ORDER on LIMIT. Otherwise MySQL will decide how to order the rows.
Use aggregate functions to get the aggregate value instead of querying all the rows and then doing the math in memory.

An overview of joins

Here, we saw the different types of joins.

Indexing joins

Here we’ll discuss the importance of indexing joins.

Consider a case where we have 3 tables - actors, films, film_actors.

If we run explain on the following query, we’ll see a separate key that is in use for each of the 3 tables. The ref column shows the value with which the index value is being compared to.

select
  film.title, actor.name
from
  films
  left join film_actors on film_actors.film_id = films.id
  left join actors on film_actors.actor_id = actors.id
  where films.id <= 10;

With INNER JOIN, MySQL gets to decide the order in which the tables are queried. By default the order in the above query would be like this - films, film_actors, actors.

If there is an index on the table on the join column, MySQL uses indexes to join the tables. Otherwise, for every row in the left table, it scans the entire right table to find the rows that match the join criteria.

Subqueries

Initial query

select
  distinct customer.first_name, customer.last_name
from
  customer
  inner join payment on customer.id = payment.customer_id
where
  payment.amount > 7.99;

The same query rewritten using subqueries.

select
  customer.first_name, customer.last_name
from
  customer
where id in (
  select customer_id from payment where amount > 7.99
);

These subqueries are enhanced using optimization techniques, such as “semi-join” / “anti-join”, under the hood by MySQL.

Common table expressions (CTEs)

CTEs allow us to make really huge complicated queries. e.g. allow us to reference a subquery in subsequent queries such that the subquery is run only once.

with spend_last_6 as (
  select
    customer_id,.
    sum(amount) as total_spend
  from
    payment
    inner join customer on customer.id = payment.customer_id
  where
    store_id = 1
    and payment_date > CURRENT_DATE - INTERVAL 6 MONTH
  group by
    customer_id
);

select
  *
from
  spend_last_6
where
  total_spend > (select avg(total_spend) from spend_last_6);

Recursive CTEs

It is a CTE that references itself. It has 2 parts - the initial condition and the recursive condition.

with recursive numbers as (
  select 1 as n
  union all
  select n+1 from numbers where n < 10
);

select * from numbers;

Consider a self-referencing table with these columns. The records in this table form a tree structure as the parent record can also have its own parent record and so on.

id, name, parent_id

Recursive CTEs can be used in this case -

with recursive all_categories as (
  select
    id,
    name,
    0 as depth,
    CAST(id as CHAR(255)) as path
  from categories where parent_id is null
  union all
  select
    categories.id,
    categories.name,
    depth+1,
    CONCAT(path, "->", categories.id)
  from all_categories inner join categories on all_categories.id = categories.parent_id
);

select * from all_categories;

The output table shows the path to each record in this form - 1→3→12→11. The depth of the record with id 11 will be 3 and the depth of the record with id 1 will be 0.

Unions

Unions, like joins, combine the data from different tables.

Unlike joins, they increase the number of rows. Joins combine the columns.

union eliminates duplicates. It is slower since the database needs to sort and compare rows to identify duplicates. It also requires more memory. union all includes all the rows even if they are duplicates.

select 1 as n, "a" as letter
union
select 2, "b"
union
select 2, "b";

select 1 as n, "a" as letter
union all
select 2, "b"
union all
select 2, "b";

Window functions

Window functions are a feature that allow us to perform calculation across a set of rows that are related to the current row without collapsing the rows into a single result (like aggregate functions do).

They provide us a way to calculate something over a window of rows while keeping the individual rows in the result.

We can choose the window (subset of rows to operate on) using the OVER clause.

select
  customer_id, rental_id, amount, payment_date,
  row_number() over(partition by customer_id order by payment_date asc) as num,
  first_value(payment_date) over(partition by customer_id order by payment_date asc) as first_rental_date,
from
  payment
order by customer_id, payment_date asc;

We can define windows externally for ease of use. These are called named windows.

select
  customer_id, rental_id, amount, payment_date,
  row_number() over(window_asc) as num,
  first_value(payment_date) over(window_asc) as first_rental_date,
from
  payment
window
  window_asc as (partition by customer_id order by payment_date asc)
order by customer_id, payment_date asc;

Sorting and limiting

Ordering the rows is not free. But we can optimize it using indexes. We order using ORDER BY.

Consider this query. If the ordered column is not deterministic enough, ie. it has duplicate values, then different rows might be returned in every query. In such a case, we’ll need order by more columns till the result becomes deterministic.

select
  id, birthday
from
  people
order by
  birthday asc,
  id
limit 100 offset 200;

Sorting with indexes

MySQL has 2 ways of producing a sorted result set - sort the result or scan the index in order. Indexes are already ordered. Thus indexing makes sorting faster.

If the sorting is being done without indexes, then the explain output of the query will show Using filesort. This can happen in memory. In case of large number of rows, the database might even be writing to the disk while sorting the rows.

This is not the end of the world. But if we are seeing Using filesort and the query is slow, then we might want to use an index.

Note - A column’s index can be used even if we have id column in the order by clause since indexes for every column always also store the id of the rows.

A new optimization in MySQL 8+ allows the database to read the index backward. Although it has some performance penalty. So if we’re always looking to scan in desc order, we might want to update the index accordingly.

Sorting with composite indexes

We can use composite indexes while ordering but we need to make sure that the columns are being accessed in the same order as the order in which they’re present in the index.

Even if in a 3 column composite index, we are ordering by only the 2nd column, we can use the index if we use the 1st column in the where clause.

id can always be used as discussed but it must appear in the end after the 3rd column. Otherwise, index will not be useable.

Also, we cannot order in ascending by the 1st column while ordering in descending by the 2nd column. The index can be read in only 1 direction. This is, however, possible if the index has been specified in this order.

Counting results

We can use COUNT to count the rows. COUNT(*) will count the number of rows. COUNT(column) will count the number of distinct column values.

We can conditionally count the rows as follows. The NULL values are not counted.

count(if(dayofweek(rental_date) in (1,7), null, 1))

We can use similar conditionals while using other aggregate functions such as sum.

Dealing with NULLs

NULL is not equal to itself in MySQL.

If we want to write a condition where 2 NULLs are considered equal, then we need to use the spaceship operator - col1 <=> col2.

References

https://planetscale.com/learn/courses/mysql-for-developers/queries

How to write efficient queries in MySQL

Learn how to write efficient queries in MySQL

Table of contents

Introduction to queries

`EXPLAIN` overview

`EXPLAIN` access types

`EXPLAIN ANALYZE`

Index obfuscation

Redundant and approximate conditions

Select only what we need

Limiting rows

An overview of joins

Indexing joins

Subqueries

Common table expressions (CTEs)

Recursive CTEs

Unions

Window functions

Sorting and limiting

Sorting with indexes

Sorting with composite indexes

Counting results

Dealing with NULLs

References

How to write efficient queries in MySQL

Learn how to write efficient queries in MySQL

Table of contents

Introduction to queries

EXPLAIN overview

EXPLAIN access types

EXPLAIN ANALYZE

Index obfuscation

Redundant and approximate conditions

Select only what we need

Limiting rows

An overview of joins

Indexing joins

Subqueries

Common table expressions (CTEs)

Recursive CTEs

Unions

Window functions

Sorting and limiting

Sorting with indexes

Sorting with composite indexes

Counting results

Dealing with NULLs

References

`EXPLAIN` overview

`EXPLAIN` access types

`EXPLAIN ANALYZE`