Steve Meyers has 15 experience as a MySQL DBA, web software architect, and system administrator, with an emphasis on scalability

The evils of the "full table scan"

+ No comment yet
Those three words are guaranteed to make any DBA shudder.  When you EXPLAIN a query, and it tells you that it is not using any index, it has to read (or scan) the entire table to find the row(s) you want.  This may not be noticeable with 84 rows, but with 84 million it can take a while.

There are times when this may be unavoidable.  Those are exceptions, and we'll talk about how to deal with them in a different post.  For now, we'll discuss how to identify full table scans, and how to avoid them.

MariaDB [test]> EXPLAIN SELECT * FROM presentations WHERE room=403;
+------+-------------+---------------+------+---------------+------+---------+------+------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+------+-------------+
|    1 | SIMPLE      | presentations | ALL  | NULL          | NULL | NULL    | NULL |   84 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+------+-------------+


Since the "key" is "NULL", we are unable to use a key for this query.  This is a simple query, and it's easy to identify that adding a key to "room" will take care of the problem.  What if we needed to determine if a room is double-booked?

MariaDB [test]> EXPLAIN SELECT a.id, b.id, a.ts, a.room FROM presentations a JOIN presentations b ON b.id > a.id
> AND b.ts=a.ts AND b.room=a.room;
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                                          |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
|    1 | SIMPLE      | a     | ALL  | PRIMARY       | NULL | NULL    | NULL |   84 |                                                |
|    1 | SIMPLE      | b     | ALL  | PRIMARY       | NULL | NULL    | NULL |   84 | Range checked for each record (index map: 0x1) |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+


Now we have a real problem.  Not only are we doing a full table scan, we're doing it twice!  And that's just at first glance.  When you understand how the database is processing the join, you realize that it needs to examine 84 * 84 rows.  This is because for each row in the first table, we need to examine every row in the second table.  To determine how costly a query is, you actually need to multiply all values in the "rows" column of the EXPLAIN output.  In this case, it's not quite that bad because of the range check -- it will actually be about 84 * 42, if I remember my math correctly.  MySQL isn't sure of the exact details, so it puts the full amount for the second table (84) and a note in the Extras.

Again, we're dealing with a small table; 3528 rows really isn't a huge deal.  But you can see how as the table scales up, this query gets exponentially harder.  There is a real world application of this principle.  Many developers use test databases for their development, and then deploy their code to production.  Test databases typically have very little data, so unless you EXPLAIN your queries, you'll probably never catch problems like this before your production site breaks.

Back to our bad query, the full table scan for table "a" is actually unavoidable.  We really do want to examine every presentation to ensure it has not conflicts.  When you EXPLAIN a JOIN, however, you don't want to ever see anything besides "1" for the "rows" after the first table.  We need to make sure that the JOIN conditions for the 2nd table are well-indexed, to avoid having to look up more than one row in the second table per row in the first table.  In this case, we want to add an index across "ts" and "room".  Now, let's try that EXPLAIN again:

MariaDB [test]> EXPLAIN SELECT a.id, b.id, a.ts, a.room FROM presentations a JOIN presentations b
> ON b.id > a.id AND b.ts=a.ts AND b.room=a.room;
+------+-------------+-------+------+-----------------+---------+---------+-----------------------+------+-------------+
| id   | select_type | table | type | possible_keys   | key     | key_len | ref                   | rows | Extra       |
+------+-------------+-------+------+-----------------+---------+---------+-----------------------+------+-------------+
|    1 | SIMPLE      | a     | ALL  | PRIMARY,ts_room | NULL    | NULL    | NULL                  |   84 |             |
|    1 | SIMPLE      | b     | ref  | PRIMARY,ts_room | ts_room | 11      | test.a.ts,test.a.room |    9 | Using where |
+------+-------------+-------+------+-----------------+---------+---------+-----------------------+------+-------------+

Unfortunately,we still don't have 1 for the 2nd table, but because of the nature of the query, there's not much we can do to help that.  It's still a lot better than the original query.  The 9 is actually an estimate based on the database's internal statistics; if we don't have any conflicts, it will actually be 1.  We can change the index to a UNIQUE index, and we'll actually see the 1:

MariaDB [test]> EXPLAIN SELECT a.id, b.id, a.ts, a.room FROM presentations a JOIN presentations b
> ON b.id > a.id AND b.ts=a.ts AND b.room=a.room;
+------+-------------+-------+--------+-----------------+---------+---------+-----------------------+------+-------------+
| id   | select_type | table | type   | possible_keys   | key     | key_len | ref                   | rows | Extra       |
+------+-------------+-------+--------+-----------------+---------+---------+-----------------------+------+-------------+
|    1 | SIMPLE      | a     | ALL    | PRIMARY,ts_room | NULL    | NULL    | NULL                  |   84 |             |
|    1 | SIMPLE      | b     | eq_ref | PRIMARY,ts_room | ts_room | 11      | test.a.ts,test.a.room |    1 | Using where |
+------+-------------+-------+--------+-----------------+---------+---------+-----------------------+------+-------------+

Of course, that just negated the need for the query looking for conflicts.

Post a Comment