Steve Meyers has 15 experience as a MySQL DBA, web software architect, and system administrator, with an emphasis on scalability

How to identify an unnecessary index

+ No comment yet
Let's look at the index from the wp_posts table in a standard WordPress installation.

SHOW KEYS FROM wp_posts;
+----------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table    | Non_unique | Key_name         | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| wp_posts |          0 | PRIMARY          |            1 | ID           | A         |        1772 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | post_name        |            1 | post_name    | A         |        1772 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date |            1 | post_type    | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date |            2 | post_status  | A         |           9 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date |            3 | post_date    | A         |        1772 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date |            4 | ID           | A         |        1772 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | post_parent      |            1 | post_parent  | A         |         147 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | post_author      |            1 | post_author  | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | yarpp_title      |            1 | post_title   | NULL      |         590 |     NULL | NULL   |      | FULLTEXT   |         |               |
| wp_posts |          1 | yarpp_content    |            1 | post_content | NULL      |        1772 |     NULL | NULL   |      | FULLTEXT   |         |               |
+----------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
10 rows in set (0.00 sec)

How can we tell which indexes are good, and which are bad?

To begin with, let's look at the cardinality.  Cardinality is a measure of how many unique values there are for that column.  A column with very low cardinality is not very useful as an index, since it does not narrow down queries very much.

The post_type and post_status columns have a fairly low cardinality, so we'd ordinarily be concerned about them, but they're part of a multi-column index.  Since there are four columns to the index, their own low cardinality will probably be masked by the other columns.  As always, you know your own data better than I do, so you should know whether that applies in your circumstances.  What we're looking at is the cardinality of the combination of columns, so if the values of the columns are related to each other, you could still end up with low cardinality.

On the other hand, post_author has very low cardinality, and is a single-column index.  This column is potentially useful for blogs with many authors.  For most blogs, it ends up being more of a hindrance than a help, as it negatively affects write performance.

Going back to the type_status_date index, we noticed that the first two columns have low cardinality, but you'll notice that the third column's cardinality is equal to the primary key's cardinality.  This means that once we use the third column of the key (having necessarily already used the first two columns), we have narrowed it down to a single row.  This means that the fourth column in the index probably does not add any useful information.

We'll demonstrate this with a few queries.


EXPLAIN SELECT * FROM wp_posts WHERE post_type IN ('post', 'page') AND post_status='publish' AND post_date > '2013-01-01 00:00:00' ORDER BY ID;
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
| id   | select_type | table    | type  | possible_keys    | key              | key_len | ref  | rows | Extra                                 |
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
|    1 | SIMPLE      | wp_posts | range | type_status_date | type_status_date | 132     | NULL |   34 | Using index condition; Using filesort |
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
1 row in set (0.01 sec)



EXPLAIN SELECT * FROM wp_posts WHERE post_type='post' AND post_status='publish' AND post_date > '2013-01-01 00:00:00' ORDER BY ID ;
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
| id   | select_type | table    | type  | possible_keys    | key              | key_len | ref  | rows | Extra                                 |
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
|    1 | SIMPLE      | wp_posts | range | type_status_date | type_status_date | 132     | NULL |   33 | Using index condition; Using filesort |
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
1 row in set (0.00 sec)

We're specifying the post_type, post_status, and post_date in both queries.  You'll notice that the "rows" doesn't change much when we just select posts instead of posts and pages.  Even though the cardinality of post_type is a whopping 6, even that is a bit misleading, as almost all of the rows have type "post", making the real-world cardinality even lower than that.

If we specify an exact date and time, rather than a range, then we narrow it down to a single row.

EXPLAIN SELECT * FROM wp_posts WHERE post_type='post' AND post_status='publish' AND post_date = '2013-02-05 10:53:57' ORDER BY ID ;
+------+-------------+----------+------+------------------+------------------+---------+-------------------+------+------------------------------------+
| id   | select_type | table    | type | possible_keys    | key              | key_len | ref               | rows | Extra                              |
+------+-------------+----------+------+------------------+------------------+---------+-------------------+------+------------------------------------+
|    1 | SIMPLE      | wp_posts | ref  | type_status_date | type_status_date | 132     | const,const,const |    1 | Using index condition; Using where |
+------+-------------+----------+------+------------------+------------------+---------+-------------------+------+------------------------------------+
1 row in set (0.00 sec)

How often are we going to query the database for an exact date and time, though?  Generally, we'll either be specifying a date/time range, or ordering by the date/time.  In either case, we cannot use the portion of the index after post_date.

So does that mean the ID portion of the key is useless?  In all likelihood, yes.  I can't imagine any scenario where it would be useful.

As an aside, when using the InnoDB storage engine, secondary indexes merely point back to the primary key, so the ID would already be referenced in the index.

Post a Comment