Example for optimizing queries

Here, we’ll take a look at some examples of optimizing queries using indexes.

MD5

Problem - How to use an MD5 column to do an indexed search when we might not be able to use index otherwise?

Note that the big columns, such as text, are stored in a separate area and the database will have to specially go and look for them. So we might want to choose to make these invisible.

Consider this table - id, url(text).

To index the url column, we can add a generated column based on the MD5 value of the URL. We can index this column. This generated column can be binary to facilitate a super fast lookup since we don’t need character by character comparision.

alter table urls add column url_md5 binary(16) generated always as (unhex(md5(url)));
alter table urls add index (url_md5);

select
  *,
  url
from
  urls
where
  url_md5 = md5('https://url.com')
  and
  url = 'https://url.com';

The redundant condition makes sure that we get the correct result even in case of an MD5 collision.

MD5 on multiple columns

Consider a case where we want to enforce uniqueness using 4 columns in a table.

We can add an MD5 generated column that generates the md5 of the concatenated values of the 4 columns.

alter table addresses add column md5 binary(16) generated always as (md5(concat_ws("|", col1, col2, col3, col4)));
alter table addresses add unique index (md5);

This is an alternative to adding a multi part unique index on the 4 columns.

Here are some pros to the multi part unique index - Human readable, no hash collisions.

Here are some pros to the MD5 generated binary index - compact index, faster lookups.

Here are some cons to the MD5 generated binary index - hash collisions, on the go computation adds to processing overhead.

Bitwise operations

Consider a table where we have a lot of columns where we are storing boolean values.

In this case, we can use a JSON column.

High flexibility
High readability
High storage.

Another alternative is to use a TINYINT column. It would be of size 8 bits. We can use each bit to store the value of each boolean flag. e.g. we can store the value 255 if all the flags have been turned on.

Low flexibility.
Low readability.
Low storage.

To filter the records by the bits that have been turned on/off, we can use bitwise operators. e.g. where current_flags & 17 = 17.

Timestamps versus booleans

To store some data, we can use a timestamp column instead of a boolean. This given us additional information. Such as instead of having a column is_deleted, we can have a column called deleted_at.

The timestamp column will be of at least 4 bytes, which is 4 times bigger than a 1 byte boolean column. However, we will be able to add an index on the timestamp column. An index on the boolean column will be pretty much useless due to low cardinality.

Claiming rows

Scenario - The user uploads a file in the application. We put the file somewhere and add a record in the imports table. At some point, a worker comes and processes the import. However, we need to make sure that each import is processed only 1 time.

One way would be to use locks to ensure that no other worker is able to claim a row that is already in the process of being claimed.

If we don’t want to use locks on a busy table, we can use this alternative -

update
  imports
set
  owner = "worker_id"
  available = 0
where
  owner = 0
  and available = 1
limit 1;

The worker can then claim the rows owned by it 1 by 1 without the fear of some other worker claiming the same row. We can also add a processed flag to check whether a given row has been processed or not.

Summary tables

It is also called a rollup table.

Scenario - We have a table with a lot of historical data. We want to roll it up in a smaller, more manageable summary table.

e.g. Consider a case where we have a table that stores the sales in a store.

We can create a summary table that stores just the monthly sales amount for all the months including the current month.
While querying these summarized records, we can use union to also get all the values from the original table for the current month.

Meta tables

Scenario - How to break a very wide table to move the columns that we don’t use very often to a secondary table?

If we want the data of both the tables, we can use joins.

This is the overhead of using meta tables.

One trick here is that we can use a CTE that defines a new table that joins both the table. We can simply use the CTE table as the default. Even trickier is the use the original table’s name for the CTE table. This way we get the full record when available, otherwise we get the subset of columns from the original table. Whether this is good/bad is up for discussion.

A good approach will be to simply use a meta table and query from it when needed. Else leave it alone.

Offset limit pagination

We can use LIMIT and OFFSET to implement pagination. Make sure to order the records.

Also make sure to order the records such that the result is deterministic.

To implement pagination, if we want to display the total number of pages, we’ll need to issue a separate count query. This can be expensive in some cases.

Alternatively, we can just show the previous and next buttons. The check whether there is a next page, we can query page_size + 1 records. Existence of the additional record indicates that there is a next page.

Note - Digging into deeper pages can start to slow down in this approach. We can use a deferred joins approach which is much faster.

Note - If the data is active (added/deleted), the records would be shifting as the user is traversing the pages. In this case, we can use cursor page approach.

Cursor based pagination

This does not refer to database cursors.

It handles some of the flaws of offset limit pagination. But it has some other flaws.

In this approach, we keep a track of the last record that the user saw. We keep a track of this cursor within the request/response cycle. The subsequent queries can query the records onwards from the received cursor. This cursor can be the ID of the record. If multiple sorted columns are being used as the cursor, we can use a base64 token.

Pros -

No shifting of records for the user even if the table is active.
We need to keep track of the cursor.

Cons -

We might have gaps in the ID.
Not possible to directly jump to a page.
Most likely, we would be sorting by a column other than ID. All the columns we order by must be included in the cursor. Doing this makes the logic much more complicated.

Deferred joins

This techniques fixes an issue with offset limit based pagination and cursor based pagination where the deeper we go the longer it takes to query the records.

In this, we pagination using just columns that have a covering index and inner join the paginated records with the actual table to get the full data.

select * from people
inner join (
  select id from people order by birthday, id limit 20 offset 2000000
) people2 using (id);

This works better because we are throwing away less work.

Geographic searches

In this section, we search for all the points within a 1 mile radius of a given point.

References

https://planetscale.com/learn/courses/mysql-for-developers/examples