Steve Meyers: November 2012

Tuesday, November 27, 2012

The evils of the "full table scan"

Those three words are guaranteed to make any DBA shudder. When you EXPLAIN a query, and it tells you that it is not using any index, it has to read (or scan) the entire table to find the row(s) you want. This may not be noticeable with 84 rows, but with 84 million it can take a while.

There are times when this may be unavoidable. Those are exceptions, and we'll talk about how to deal with them in a different post. For now, we'll discuss how to identify full table scans, and how to avoid them.

MariaDB [test]> EXPLAIN SELECT * FROM presentations WHERE room=403;
+------+-------------+---------------+------+---------------+------+---------+------+------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+------+-------------+
|    1 | SIMPLE      | presentations | ALL  | NULL          | NULL | NULL    | NULL |   84 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+------+-------------+

Since the "key" is "NULL", we are unable to use a key for this query. This is a simple query, and it's easy to identify that adding a key to "room" will take care of the problem. What if we needed to determine if a room is double-booked?

MariaDB [test]> EXPLAIN SELECT a.id, b.id, a.ts, a.room FROM presentations a JOIN presentations b ON b.id > a.id
> AND b.ts=a.ts AND b.room=a.room;
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                                          |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
|    1 | SIMPLE      | a     | ALL  | PRIMARY       | NULL | NULL    | NULL |   84 |                                                |
|    1 | SIMPLE      | b     | ALL  | PRIMARY       | NULL | NULL    | NULL |   84 | Range checked for each record (index map: 0x1) |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+

Now we have a real problem. Not only are we doing a full table scan, we're doing it twice! And that's just at first glance. When you understand how the database is processing the join, you realize that it needs to examine 84 * 84 rows. This is because for each row in the first table, we need to examine every row in the second table. To determine how costly a query is, you actually need to multiply all values in the "rows" column of the EXPLAIN output. In this case, it's not quite that bad because of the range check -- it will actually be about 84 * 42, if I remember my math correctly. MySQL isn't sure of the exact details, so it puts the full amount for the second table (84) and a note in the Extras.

Again, we're dealing with a small table; 3528 rows really isn't a huge deal. But you can see how as the table scales up, this query gets exponentially harder. There is a real world application of this principle. Many developers use test databases for their development, and then deploy their code to production. Test databases typically have very little data, so unless you EXPLAIN your queries, you'll probably never catch problems like this before your production site breaks.

Back to our bad query, the full table scan for table "a" is actually unavoidable. We really do want to examine every presentation to ensure it has not conflicts. When you EXPLAIN a JOIN, however, you don't want to ever see anything besides "1" for the "rows" after the first table. We need to make sure that the JOIN conditions for the 2nd table are well-indexed, to avoid having to look up more than one row in the second table per row in the first table. In this case, we want to add an index across "ts" and "room". Now, let's try that EXPLAIN again:

MariaDB [test]> EXPLAIN SELECT a.id, b.id, a.ts, a.room FROM presentations a JOIN presentations b
> ON b.id > a.id AND b.ts=a.ts AND b.room=a.room;
+------+-------------+-------+------+-----------------+---------+---------+-----------------------+------+-------------+
| id   | select_type | table | type | possible_keys   | key     | key_len | ref                   | rows | Extra       |
+------+-------------+-------+------+-----------------+---------+---------+-----------------------+------+-------------+
|    1 | SIMPLE      | a     | ALL  | PRIMARY,ts_room | NULL    | NULL    | NULL                  |   84 |             |
|    1 | SIMPLE      | b     | ref  | PRIMARY,ts_room | ts_room | 11      | test.a.ts,test.a.room |    9 | Using where |
+------+-------------+-------+------+-----------------+---------+---------+-----------------------+------+-------------+

Unfortunately,we still don't have 1 for the 2nd table, but because of the nature of the query, there's not much we can do to help that. It's still a lot better than the original query. The 9 is actually an estimate based on the database's internal statistics; if we don't have any conflicts, it will actually be 1. We can change the index to a UNIQUE index, and we'll actually see the 1:

MariaDB [test]> EXPLAIN SELECT a.id, b.id, a.ts, a.room FROM presentations a JOIN presentations b
> ON b.id > a.id AND b.ts=a.ts AND b.room=a.room;
+------+-------------+-------+--------+-----------------+---------+---------+-----------------------+------+-------------+
| id   | select_type | table | type   | possible_keys   | key     | key_len | ref                   | rows | Extra       |
+------+-------------+-------+--------+-----------------+---------+---------+-----------------------+------+-------------+
|    1 | SIMPLE      | a     | ALL    | PRIMARY,ts_room | NULL    | NULL    | NULL                  |   84 |             |
|    1 | SIMPLE      | b     | eq_ref | PRIMARY,ts_room | ts_room | 11      | test.a.ts,test.a.room |    1 | Using where |
+------+-------------+-------+--------+-----------------+---------+---------+-----------------------+------+-------------+

Of course, that just negated the need for the query looking for conflicts.

Wednesday, November 14, 2012

Other kinds of indexes

+ No comment yet

B-Tree Index

As MySQL's default index type, a b-tree index is reasonably fast to insert new items into, and is fast at looking up individual items or even ranges. The index is stored as a branching tree, so looking up any one item is not an expensive operation. Each level of branching allows an order of magnitude more items. It's not always the best solution, but in some cases it may be the only one. B-tree indexes are supported on all standard storage engines.

Hash Index

Hash indexes are similar to associative arrays (or hashes) in many programming languages. They are very fast for looking up individual items, but cannot look up ranges. They are stored as a series of "buckets", and a hash algorithm is used for determining which bucket to put an item into. Good hash algorithms are very inexpensive, and spread input values evenly among the buckets. It is because they are spread out that range operations become impossible. Hash indexes are currently only supported for MEMORY and NDB (MySQL Cluster) storage engines.

Fulltext Index

Fulltext indexes are good for searching out individual words inside a long block of text. "Good" is a relative term; MySQL's fulltext indexes perform better than no index, but for decent performance on larger tables, you'll probably want to use a dedicated search engine such as Sphinx, Solr, Lucene, or other similar applications. I personally use Sphinx, because you can connect to it from MySQL using the SphinxSE storage engine. Fulltext indexes are only available in the MyISAM storage engine, which is another reason to look into other solutions.

Spatial Index

If you're working with spatial data, such as latitude and longitude, you might be able to get a speed boost with spatial indexes. Be careful with the spatial functions though; if you're not using MySQL 5.6 or MariaDB 5.5 (or higher), the functions that test relationships between polygons operate in "minimum bounding rectangle" mode. Spatial indexes are only available in the MyISAM storage engine, although spatial data can be stored in InnoDB. In order to protect your data, I would recommend storing your data in InnoDB, and creating a cache table in MyISAM for doing spatial lookups on.

Roll Your Own Index

Sometimes, it makes sense to create your own indexing system. I'm not talking about hacking one into the MySQL source code, but rather finding other ways to index your data. In a thread on the MySQL mailing list from 2001, I discussed the benefits of using a separately calculated hash column to index a column containing URLs. The crux of it is that you don't want the length of your indexes to be too long, because you want MySQL to be able to fit all of your indexes in memory. It's also more expensive to write longer indexes to disk.

At another company, we found that as fast as spatial indexes are, integer b-tree indexes are faster. We rounded our latitude and longitude and put them together to create a grid, and then we were able to just request the items in a list of grid squares. We also included the more precise fields in our where clause, in order to avoid getting extra points back from our query.

So if the default index types aren't working for you, try thinking outside the box. There may be a solution that nobody else has thought of before, that will increase your index speed substantially.

Wednesday, November 7, 2012

Redundant Indexes

+ No comment yet

One of the more egregious forms of "too many indexes" is redundant indexes. For example, let us consider the following table:

MariaDB [test]> desc presentations;
+---------------+-----------------------+------+-----+---------+----------------+
| Field         | Type                  | Null | Key | Default | Extra          |
+---------------+-----------------------+------+-----+---------+----------------+
| id            | int(10) unsigned      | NO   | PRI | NULL    | auto_increment |
| ts            | datetime              | NO   | MUL | NULL    |                |
| room          | mediumint(8) unsigned | NO   |     | NULL    |                |
| ss_presid     | int(10) unsigned      | NO   |     | NULL    |                |
| first_name    | varchar(255)          | NO   |     | NULL    |                |
| last_name     | varchar(255)          | NO   |     | NULL    |                |
| title         | varchar(255)          | NO   |     | NULL    |                |
| len_min       | mediumint(8) unsigned | NO   |     | NULL    |                |
| votes         | int(10) unsigned      | NO   |     | NULL    |                |
| timesid       | int(10) unsigned      | NO   |     | NULL    |                |
| num_conflicts | int(10) unsigned      | NO   |     | NULL    |                |
+---------------+-----------------------+------+-----+---------+----------------+
11 rows in set (0.00 sec)

Notice that the "Key" column doesn't really give us a ton of information. We know that "ts" has multiple indexes, but nothing more than that. As an aside, MySQL uses the terms "key" and "index" interchangeably. To get better information about our keys or indexes, use the "SHOW KEYS FROM" query:

MariaDB [test]> SHOW KEYS FROM presentations;
+---------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| presentations |          0 | PRIMARY  |            1 | id          | A         |          84 |     NULL | NULL   |      | BTREE      |         |               |
| presentations |          0 | ts_room  |            1 | ts          | A         |        NULL |     NULL | NULL   |      | BTREE      |         |               |
| presentations |          0 | ts_room  |            2 | room        | A         |          84 |     NULL | NULL   |      | BTREE      |         |               |
| presentations |          1 | ts       |            1 | ts          | A         |        NULL |     NULL | NULL   |      | BTREE      |         |               |
+---------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
4 rows in set (0.00 sec)

You'll notice that this is much more informative than the DESC command, at least as far as indexes are concerned. The format takes some getting used to, though. Use the "Key_name" column to determine which key we're talking about, as indexes can cross multiple columns.

Knowing that, we can tell that there are three keys: "PRIMARY", "ts_room", and "ts". The "ts_room" index has two columns. The "Seq_in_index" column tells us which comes first, although the "SHOW KEYS FROM" command will always return them in the proper order.

We have the primary key ("PRIMARY") which covers just the "id" column. Primary keys are always unique indexes, which means that no two rows are allowed to have the same value for that key. The "ts_room" index spans the "ts" and "room" columns, and is also unique. Last, the "ts" index just covers the "ts" column, and is not unique.

So where is the redundant index? First, we must understand how indexes work. The left-most prefix of any index can be used by itself. If we have an index on a varchar column, a SELECT query asking for rows with a value LIKE 'prefix%' will use the index on that column; a similar query looking for rows with a value LIKE '%suffix' will not use the index, since the left-most part is unspecified.

In multi-column indexes, this property also applies. The columns in the index are processed in order, so a query on our presentations table with just the "room" specified would not be able to use the "ts_room" index. However, a query with just the "ts" specified COULD use the "ts_room" index, since the "ts" is the leftmost part of it.

So now we've identified the redundant index: "ts" will never be needed, since its duties can be handled by "ts_room". So, how do we get rid of it?

MariaDB [test]> ALTER TABLE presentations DROP INDEX ts;
Query OK, 84 rows affected (0.01 sec)              
Records: 84  Duplicates: 0  Warnings: 0

Subscribe to: Posts ( Atom )