Indexes in MySQL

Indexes in MySQL

Understand indexes in MySQL and learn when to use them.

What is an index?

An index is a -

  • A separate data structure - It is separate from the actual table.
  • A copy of (part of) our data - Whenever we add an index to any column or columns in a table, it copies a part of the data into the index table.
  • Points back to the row - To get the full data, the index row stores a pointer to the actual row in the table.

Rules

Some rules to keep in mind while creating indexes as given below. Create -

  • As many as we need - Indexing makes query faster. Add as many indexes as needed.
  • As few as we can get away with - If we can achieve what we want by creating 1 index, don't create 2.

In order to create correct indexes, we need to look at the queries.

  • Schema is driven by the data.
  • Indexes are driven by the queries.

B+ trees

It is a binary tree with a special feature. Each leaf node contains the pointer to the next leaf node such that we can traverse the entire index just by traversing the leaf nodes. The leaf nodes form a linked list.

Primary keys

Primary key is a column in a table with following properties -

  • Not nullable
  • Unique - A unique index is automatically created for the Primary key column.

There is only 1 primary key in a table. Every other key is a secondary key.

(A key is a column or a set of columns that can uniquely identify every row in the table.)

CREATE TABLE users (
  id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(255)
);

Note: The Primary key index is the table! The leaf nodes contain the actual data. This is different from the Secondary keys in which the leaf nodes contain the primary key. This primary key can then be used to query the actual data by traversing the primary index's B+ tree.

Note: We have to have a Primary key. If we don't add a primary key in our table, MySQL will secretly create a primary key or use a unique not nullable index if available.

Primary key data types

We should choose the data type that gives enough room to grow such as a BIGINT.

If we choose data types like UUID/GUID, we have following drawbacks -

  • It will make the secondary indexes bigger as they also store the Primary key.
  • With BIGINT, every new row goes at the end of the table. With UUID/GUID, the rows can be added in the middle which cause a performance overhead where the B tree needs to rebalance.

If we do want to choose such as a data type, choose an alternative version that is sorted such as ULID, KSUID, etc.

Where to add indexes

To recap - the query should drive the index.

  • Sometimes we'll need to rewrite the query to make an effective use of the indexes.
  • Sometimes we'll need to update the indexes to make the query faster.

Let's take a look at the following index and the queries that would benefit from that index -

Index on a column A.

The queries that would benefit from this index are -

  • Queries filtering records by a value on column A.
  • Queries filtering records by greater/less than a value on column A.
  • Queries ordering by column A.
  • Queries that group by column A.

Index selectively

Note that MySQL can only ever use 1 key in a query.

Consider the following case -

  1. We add single column indexes on 2 columns, A and B.
  2. We query the rows and use a WHERE clause to filter by both A and B - SELECT * from table WHERE A = valueA AND B = valueB;.
  3. The possible keys will be A and B. However, the key on column A is used.

Why would this might have happened? MySQL keeps track of the shape of the data such as the distribution of the data and it has an idea about the distinct values in columns A and B. This is also called the cardinality of the column. A has a higher cardinality, so MySQL decides to use the index on A and filter the found rows by the value of B.

Columns with less cardinality might not be a good choice to put the index on. In fact, MySQL might even decide not to use the index on a column if the selectivity of the index is very low.

Selectivity of column A

SELECT COUNT(distinct A) / COUNT(*) FROM table;

The higher the selectivity the better. The selectivity of the id column would be the highest (equal to 1).

We can see such data using the query - SHOW indexes FROM table;. However, the data here (such as cardinality) might be not be accurate since this just shows the best guess.

Prefix indexes

When we have long strings, it might be better to index a prefix of the string. The strings are filtered by the prefix and then the matched strings are scanned for the exact match.

We can index prefixes as follows -

ALTER TABLE people ADD INDEX(first_name(5));

Trick: To determine the length of the prefix to index, we can calculate the selectivity of the prefixes of different lengths, 1 to n, and compare them with the selectivity of the entire string. We can then select the prefix length that has the selectivity closest to the selectivity of the entire string.

SELECT
  COUNT(DISTINCT first_name) / COUNT(*) AS original,
  COUNT(DISTINCT LEFT(first_name, 1)) / COUNT(*) AS left1,
    COUNT(DISTINCT LEFT(first_name, 2)) / COUNT(*) AS left2,
    COUNT(DISTINCT LEFT(first_name, 3)) / COUNT(*) AS left3
FROM
  people;

Do note that we cannot use prefixed index to order or group data.

Composite indexes

These are indexes on multiple columns.

ALTER TABLE people ADD INDEX multi_col_index(first_name, last_name, birthday);

Now when we check the indexes -

SHOW INDEXES FROM people;

  • we get the multi_col_index. Here we see an additional data point about the index - Seq_in_index. This shows the order of the different columns in the index.

A multi-col index will only be used if the columns in the query are present in the order of the columns in the index. However, all columns don't need to appear in the query.

e.g. If the query filters by first_name and last_name but not by birthday, then also the index will be used. But if the query filters only by last_name, then the index will not be used as the 1st column is missing. This shows the 1st rule, Left-to-right, no skipping.

When we EXPLAIN the query, we can see the key_len column. This shows the length of the index that the database was able to use. A lower than expected key_len value indicates that the index is only partially utilized.

There are 2 rules for composite indexes -

  1. Left-to-right, no skipping - MySQL can only access the index in order, starting from the leftmost column and moving to the right. It cannot skip columns in the index.
  2. Stops at the first range - MySQL stops using the index after the first range condition is encountered.

Covering indexes

An index could be covering for 1 query and the same index could not be covering another query.

A covering index is an index that supplies everything that a particular query needs, e.g. whatever columns the query is selecting, filtering on, and ordering by.

How does it differ from the secondary index?

Let's consider when we make a query that uses a secondary index. Once the record is found in the secondary index, we get the primary key of the data and use it to find the record in the primary index.

When using a covering index, we do not go back to the primary index. All the information we need is in the covering index.

Functional indexes

Consider this query -

SELECT * FROM people WHERE YEAR(birthday) = 1989 LIMIT 10;

The YEAR function, obfuscates the index on the birthday column and it is not being used. Updating the query to this prevents the index from being obfuscated.

SELECT * FROM people WHERE birthday BETWEEN '1989-01-01' AND '1989-12-31';

But there is no good way to rewrite this query so as to un-obfuscate this column -

SELECT * FROM people WHERE MONTH(birthday) = 1 LIMIT 10;

In this case, we can create a functional index (function based index). These are available in MySQL 8+.

ALTER TABLE people ADD INDEX index_name((MONTH(birthday)));

Note that the additional set of parenthesis on the functional part are intentional.

Indexing JSON columns

Here we'll learn, how do we index JSON?

Consider a table with a json column with data like this -

{ email: "huck@example.com", name: "Huck Finn" }

We can add an index on the email in the following manner -

  1. Add a generated column that returns the email from the JSON.
  2. Index the generated column.
ALTER TABLE json_data ADD COLUMN email VARCHAR(255) GENERATED ALWAYS AS (json->>'$.email');
ALTER TABLE json_data ADD INDEX (email);

This index will be used even if we access the email value like this in the query - json->>'$.email' as MySQL is able to find out that there is a generated column using this value.

Indexing for wildcard searches

Consider this query - SELECT * FROM people WHERE email LIKE 'aaron%';

We can add an index for this query simply like this - ALTER TABLE people ADD INDEX (email);.

When the wildcard is at the end, MySQL is able to use the index up until it reaches the wildcard. It is not able to use the index when we search like this - %aaron%, or this - %aaron.

If we want to search by a substring, we can consider using a generated column like in functional indexes.

Fulltext indexes

When we want to search across multiple columns or find a few words in the middle of a column, then we look into fulltext searching. This feature allows us to simulate the functionality of a search engine using just the database.

We can add the index like this - ALTER TABLE people ADD FULLTEXT INDEX full_index(first_name, last_name, bio);

To utilize a FULLTEXT index, we write queries like this -

SELECT * FROM people WHERE MATCH(first_name, last_name, bio) AGAINST ("multiple words here");

The default is the NATURAL MODE. We also have a BOOLEAN MODE here -

SELECT * FROM people WHERE MATCH(first_name, last_name, bio) AGAINST ("+absolutely_include_this_word -excluded_word might_be_included" IN BOOLEAN MODE);

A really interesting use case is doing this to sort the results by relevancy score -

SELECT
  id, bio,
  MATCH(first_name, last_name, bio) AGAINST ("+absolutely_included_this_word -excluded_word might_be_included" IN BOOLEAN MODE) AS relevance
FROM people
SORT BY relevance;

Invisible indexes

We can make indexes invisible and monitor the queries. If nothing changes, we can remove the index.

ALTER TABLE people ALTER INDEX email INVISIBLE;
ALTER TABLE people ALTER INDEX email VISIBLE;

Duplicate indexes

Indexes that have overlapping leftmost prefixes are duplicates. If in such as a case, one of the indexes is, in its entirety, a leftmost prefix of another, then the shorter index can be removed.

Foreign keys

We can have a foreign key without a foreign key constraint. To add a foreign key constraint, the referencing column and the referenced column should be of the same data type, including signed/unsigned value.

Beyond a scale, ON DELETE CASCADE, might not be best thing to do as it might be cascading over millions of rows. Also, if a resource like a team is never meant to be deleted, it's better to add ON DELETE RESTRICT, so that prevent it from getting deleted by mistake.

References