Steve Meyers: 2012

Ernie Souhrada has a great writeup called The Optimization That (Often) Isn’t: Index Merge Intersection. It talks about how MySQL can perform an "index merge" in order to use more than one index.

For example, if you have an OR in your WHERE clause, and each of the conditions are indexable, MySQL can perform each index lookup separately, and then do a union on the results. Similarly, if you have an AND in your WHERE clause, and again each condition is indexable, MySQL can do an intersection of the results of the two indexes. MySQL will decide whether to use this optimization based on the statistics it keeps of your table's indexes.

The problem that Ernie identified happened because the customer had an index on a column with very little variance. Almost all of the rows had the same value, and the optimizer made an incorrect assumption when evaluating indexes.

For this reason, you should be very cautious about indexing columns with very little variance. It can make sense sometimes; perhaps almost all of the rows have a 1 in that column, but you need to find the few that have a 0. This is common when you have to process new rows, and then mark them as "processed". In this case, I will often use a separate queue table; this negates the possibility of the optimizer choosing to use a very unbalanced index on the main table.

Tuesday, December 11, 2012

How do JOINs work?

+ No comment yet

For beginning web developers, JOINs are a scary thing. Simple queries are easy enough, but once you add a second table to the mix, it becomes a lot more complex. In reality, JOINs aren't really all that hard to understand, but you do need to pay attention to what you're doing.

By default, JOINs in SQL are Cartesian; for each row in table A, you join it with every single row in table B, so the number of output rows equals the number of rows in A multiplied by the number of rows in B. This is generally not the desired outcome. In fact, the Drizzle project has broken SQL compatibility to make it so that Cartesian JOINs need to be specified explicitly. If no JOIN conditions are given, the query simply generates an error.

I remember a web developer showing me a query that he thought would work like a UNION. He had 20 "cache" tables that each contained several tens of thousands of rows of information. He wanted to get all of that information with a single query, rather than querying each table individually. This is how he wrote his query:

SELECT * FROM cache1, cache2, cache3, cache4, cache5, cache6, cache7, 
> cache8, cache9, cache10, cache11, cache12, cache13, cache14, cache15,
> cache16, cache17, cache18, cache19, cache20;

As you can imagine, the query never returned, and I think the database server eventually ran out of swap space.

Most web developers I've worked with tend to craft their JOIN queries with the condition in the WHERE clause. While this does (usually) work, it is not as easy for the next poor schlep reading your code to know how the JOIN works. It also is easier to mistakes when you do this, as you may forget to specify the condition for one of your JOINs.

JOIN conditions are given using the keywords ON or USING. The USING keyword lets you specify one or more columns that are identical across multiple tables, like so:

SELECT field1, field2 FROM table1 JOIN table2 USING (field3, field4);

The field(s) in the USING condition must have the same name in both tables. If this doesn't work with your schema, or if you have more complicated JOIN conditions, you should use the ON keyword:

SELECT field1, field2 FROM table1 JOIN table2 ON table1.field3=table2.field4;

When we see an ON or USING condition in a query, that tells us immediately that the JOIN is using that condition to filter the possible rows in the second table.

Ideally, as discussed in The evils of the "full table scan" previously, we should only find a single row in the second table for each row in the first table. Additionally, the column in the second table that we are using for filtering the JOIN will have an index on it. This will allow the query to use that index to find the matching rows, greatly speeding up the execution of our query.

If you can, it is also beneficial to eliminate as many rows as possible from the first table, so that fewer lookups happen on the second table. This is appropriately done in the WHERE clause.

By putting our JOIN conditions in the appropriate place, it is now easier for us to glance at our query and be able to tell how the tables connect together:

SELECT
 MONTH(STR_TO_DATE(t2.value, '%H:%i:%S %b %d, %Y ')) AS payment_month,
 COUNT(t1.transid) AS num,
 SUM(t3.value) AS total,
 SUM(t4.value) as shipping,
 SUM(t5.value) AS fee,
 SUM(t6.value) AS tax
FROM
 new_transactions t1
 JOIN new_transactions t2 ON t2.transid=t1.transid AND t2.field = 'payment_date'
 JOIN new_transactions t3 ON t3.transid=t1.transid AND t3.field = 'mc_gross'
 JOIN new_transactions t4 ON t4.transid=t1.transid AND t4.field = 'mc_shipping'
 JOIN new_transactions t5 ON t5.transid=t1.transid AND t5.field = 'mc_fee'
 JOIN new_transactions t6 ON t6.transid=t1.transid AND t6.field = 'tax'
WHERE
 t1.field = 'txn_type'
 AND t1.value = 'cart'

GROUP BY
 payment_month
WITH ROLLUP

A complex JOIN query, but we can easily glance at it and tell that all of the JOINs have appropriate conditions on them.

Tuesday, November 27, 2012

The evils of the "full table scan"

+ No comment yet

Those three words are guaranteed to make any DBA shudder. When you EXPLAIN a query, and it tells you that it is not using any index, it has to read (or scan) the entire table to find the row(s) you want. This may not be noticeable with 84 rows, but with 84 million it can take a while.

There are times when this may be unavoidable. Those are exceptions, and we'll talk about how to deal with them in a different post. For now, we'll discuss how to identify full table scans, and how to avoid them.

MariaDB [test]> EXPLAIN SELECT * FROM presentations WHERE room=403;
+------+-------------+---------------+------+---------------+------+---------+------+------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+------+-------------+
|    1 | SIMPLE      | presentations | ALL  | NULL          | NULL | NULL    | NULL |   84 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+------+-------------+

Since the "key" is "NULL", we are unable to use a key for this query. This is a simple query, and it's easy to identify that adding a key to "room" will take care of the problem. What if we needed to determine if a room is double-booked?

MariaDB [test]> EXPLAIN SELECT a.id, b.id, a.ts, a.room FROM presentations a JOIN presentations b ON b.id > a.id
> AND b.ts=a.ts AND b.room=a.room;
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                                          |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
|    1 | SIMPLE      | a     | ALL  | PRIMARY       | NULL | NULL    | NULL |   84 |                                                |
|    1 | SIMPLE      | b     | ALL  | PRIMARY       | NULL | NULL    | NULL |   84 | Range checked for each record (index map: 0x1) |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+

Now we have a real problem. Not only are we doing a full table scan, we're doing it twice! And that's just at first glance. When you understand how the database is processing the join, you realize that it needs to examine 84 * 84 rows. This is because for each row in the first table, we need to examine every row in the second table. To determine how costly a query is, you actually need to multiply all values in the "rows" column of the EXPLAIN output. In this case, it's not quite that bad because of the range check -- it will actually be about 84 * 42, if I remember my math correctly. MySQL isn't sure of the exact details, so it puts the full amount for the second table (84) and a note in the Extras.

Again, we're dealing with a small table; 3528 rows really isn't a huge deal. But you can see how as the table scales up, this query gets exponentially harder. There is a real world application of this principle. Many developers use test databases for their development, and then deploy their code to production. Test databases typically have very little data, so unless you EXPLAIN your queries, you'll probably never catch problems like this before your production site breaks.

Back to our bad query, the full table scan for table "a" is actually unavoidable. We really do want to examine every presentation to ensure it has not conflicts. When you EXPLAIN a JOIN, however, you don't want to ever see anything besides "1" for the "rows" after the first table. We need to make sure that the JOIN conditions for the 2nd table are well-indexed, to avoid having to look up more than one row in the second table per row in the first table. In this case, we want to add an index across "ts" and "room". Now, let's try that EXPLAIN again:

MariaDB [test]> EXPLAIN SELECT a.id, b.id, a.ts, a.room FROM presentations a JOIN presentations b
> ON b.id > a.id AND b.ts=a.ts AND b.room=a.room;
+------+-------------+-------+------+-----------------+---------+---------+-----------------------+------+-------------+
| id   | select_type | table | type | possible_keys   | key     | key_len | ref                   | rows | Extra       |
+------+-------------+-------+------+-----------------+---------+---------+-----------------------+------+-------------+
|    1 | SIMPLE      | a     | ALL  | PRIMARY,ts_room | NULL    | NULL    | NULL                  |   84 |             |
|    1 | SIMPLE      | b     | ref  | PRIMARY,ts_room | ts_room | 11      | test.a.ts,test.a.room |    9 | Using where |
+------+-------------+-------+------+-----------------+---------+---------+-----------------------+------+-------------+

Unfortunately,we still don't have 1 for the 2nd table, but because of the nature of the query, there's not much we can do to help that. It's still a lot better than the original query. The 9 is actually an estimate based on the database's internal statistics; if we don't have any conflicts, it will actually be 1. We can change the index to a UNIQUE index, and we'll actually see the 1:

MariaDB [test]> EXPLAIN SELECT a.id, b.id, a.ts, a.room FROM presentations a JOIN presentations b
> ON b.id > a.id AND b.ts=a.ts AND b.room=a.room;
+------+-------------+-------+--------+-----------------+---------+---------+-----------------------+------+-------------+
| id   | select_type | table | type   | possible_keys   | key     | key_len | ref                   | rows | Extra       |
+------+-------------+-------+--------+-----------------+---------+---------+-----------------------+------+-------------+
|    1 | SIMPLE      | a     | ALL    | PRIMARY,ts_room | NULL    | NULL    | NULL                  |   84 |             |
|    1 | SIMPLE      | b     | eq_ref | PRIMARY,ts_room | ts_room | 11      | test.a.ts,test.a.room |    1 | Using where |
+------+-------------+-------+--------+-----------------+---------+---------+-----------------------+------+-------------+

Of course, that just negated the need for the query looking for conflicts.

Wednesday, November 14, 2012

Other kinds of indexes

+ No comment yet

B-Tree Index

As MySQL's default index type, a b-tree index is reasonably fast to insert new items into, and is fast at looking up individual items or even ranges. The index is stored as a branching tree, so looking up any one item is not an expensive operation. Each level of branching allows an order of magnitude more items. It's not always the best solution, but in some cases it may be the only one. B-tree indexes are supported on all standard storage engines.

Hash Index

Hash indexes are similar to associative arrays (or hashes) in many programming languages. They are very fast for looking up individual items, but cannot look up ranges. They are stored as a series of "buckets", and a hash algorithm is used for determining which bucket to put an item into. Good hash algorithms are very inexpensive, and spread input values evenly among the buckets. It is because they are spread out that range operations become impossible. Hash indexes are currently only supported for MEMORY and NDB (MySQL Cluster) storage engines.

Fulltext Index

Fulltext indexes are good for searching out individual words inside a long block of text. "Good" is a relative term; MySQL's fulltext indexes perform better than no index, but for decent performance on larger tables, you'll probably want to use a dedicated search engine such as Sphinx, Solr, Lucene, or other similar applications. I personally use Sphinx, because you can connect to it from MySQL using the SphinxSE storage engine. Fulltext indexes are only available in the MyISAM storage engine, which is another reason to look into other solutions.

Spatial Index

If you're working with spatial data, such as latitude and longitude, you might be able to get a speed boost with spatial indexes. Be careful with the spatial functions though; if you're not using MySQL 5.6 or MariaDB 5.5 (or higher), the functions that test relationships between polygons operate in "minimum bounding rectangle" mode. Spatial indexes are only available in the MyISAM storage engine, although spatial data can be stored in InnoDB. In order to protect your data, I would recommend storing your data in InnoDB, and creating a cache table in MyISAM for doing spatial lookups on.

Roll Your Own Index

Sometimes, it makes sense to create your own indexing system. I'm not talking about hacking one into the MySQL source code, but rather finding other ways to index your data. In a thread on the MySQL mailing list from 2001, I discussed the benefits of using a separately calculated hash column to index a column containing URLs. The crux of it is that you don't want the length of your indexes to be too long, because you want MySQL to be able to fit all of your indexes in memory. It's also more expensive to write longer indexes to disk.

At another company, we found that as fast as spatial indexes are, integer b-tree indexes are faster. We rounded our latitude and longitude and put them together to create a grid, and then we were able to just request the items in a list of grid squares. We also included the more precise fields in our where clause, in order to avoid getting extra points back from our query.

So if the default index types aren't working for you, try thinking outside the box. There may be a solution that nobody else has thought of before, that will increase your index speed substantially.

Wednesday, November 7, 2012

Redundant Indexes

+ No comment yet

One of the more egregious forms of "too many indexes" is redundant indexes. For example, let us consider the following table:

MariaDB [test]> desc presentations;
+---------------+-----------------------+------+-----+---------+----------------+
| Field         | Type                  | Null | Key | Default | Extra          |
+---------------+-----------------------+------+-----+---------+----------------+
| id            | int(10) unsigned      | NO   | PRI | NULL    | auto_increment |
| ts            | datetime              | NO   | MUL | NULL    |                |
| room          | mediumint(8) unsigned | NO   |     | NULL    |                |
| ss_presid     | int(10) unsigned      | NO   |     | NULL    |                |
| first_name    | varchar(255)          | NO   |     | NULL    |                |
| last_name     | varchar(255)          | NO   |     | NULL    |                |
| title         | varchar(255)          | NO   |     | NULL    |                |
| len_min       | mediumint(8) unsigned | NO   |     | NULL    |                |
| votes         | int(10) unsigned      | NO   |     | NULL    |                |
| timesid       | int(10) unsigned      | NO   |     | NULL    |                |
| num_conflicts | int(10) unsigned      | NO   |     | NULL    |                |
+---------------+-----------------------+------+-----+---------+----------------+
11 rows in set (0.00 sec)

Notice that the "Key" column doesn't really give us a ton of information. We know that "ts" has multiple indexes, but nothing more than that. As an aside, MySQL uses the terms "key" and "index" interchangeably. To get better information about our keys or indexes, use the "SHOW KEYS FROM" query:

MariaDB [test]> SHOW KEYS FROM presentations;
+---------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| presentations |          0 | PRIMARY  |            1 | id          | A         |          84 |     NULL | NULL   |      | BTREE      |         |               |
| presentations |          0 | ts_room  |            1 | ts          | A         |        NULL |     NULL | NULL   |      | BTREE      |         |               |
| presentations |          0 | ts_room  |            2 | room        | A         |          84 |     NULL | NULL   |      | BTREE      |         |               |
| presentations |          1 | ts       |            1 | ts          | A         |        NULL |     NULL | NULL   |      | BTREE      |         |               |
+---------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
4 rows in set (0.00 sec)

You'll notice that this is much more informative than the DESC command, at least as far as indexes are concerned. The format takes some getting used to, though. Use the "Key_name" column to determine which key we're talking about, as indexes can cross multiple columns.

Knowing that, we can tell that there are three keys: "PRIMARY", "ts_room", and "ts". The "ts_room" index has two columns. The "Seq_in_index" column tells us which comes first, although the "SHOW KEYS FROM" command will always return them in the proper order.

We have the primary key ("PRIMARY") which covers just the "id" column. Primary keys are always unique indexes, which means that no two rows are allowed to have the same value for that key. The "ts_room" index spans the "ts" and "room" columns, and is also unique. Last, the "ts" index just covers the "ts" column, and is not unique.

So where is the redundant index? First, we must understand how indexes work. The left-most prefix of any index can be used by itself. If we have an index on a varchar column, a SELECT query asking for rows with a value LIKE 'prefix%' will use the index on that column; a similar query looking for rows with a value LIKE '%suffix' will not use the index, since the left-most part is unspecified.

In multi-column indexes, this property also applies. The columns in the index are processed in order, so a query on our presentations table with just the "room" specified would not be able to use the "ts_room" index. However, a query with just the "ts" specified COULD use the "ts_room" index, since the "ts" is the leftmost part of it.

So now we've identified the redundant index: "ts" will never be needed, since its duties can be handled by "ts_room". So, how do we get rid of it?

MariaDB [test]> ALTER TABLE presentations DROP INDEX ts;
Query OK, 84 rows affected (0.01 sec)              
Records: 84  Duplicates: 0  Warnings: 0

Wednesday, October 24, 2012

When too many indexes can be a bad thing

+ No comment yet

Many web developers I've worked with believe that the solution to every database slowness problem is to add an index. This is due to an incomplete understanding of how databases use indexes. The explanation is fairly simple, so this will be a short post.

Any time you insert a new row into a table, or delete a row, all indexes will need to be updated. If you update a row, only the affected indexes will need to be updated -- in some cases, that may still be all of them. Thus, while too few indexes can slow down your read queries, too many indexes can slow down your write queries. Depending on the workload for that table, this may not be an issue. In fact, there are cases where it makes sense to have the space required for a table's indexes exceed the size of the table data.

In many cases, though, I've seen indexes that are unlikely to ever be used. I've also seen the same index added multiple times, probably by different developers. It's helpful to know what commands can be used to determine what indexes already exist, and how useful they are.

SHOW INDEXES FROM times;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| times |          0 | PRIMARY  |            1 | id          | A         |         120 |     NULL | NULL   |      | BTREE      |         |               |
| times |          1 | ts       |            1 | ts          | A         |          12 |     NULL | NULL   |      | BTREE      |         |               |
| times |          1 | room     |            1 | room        | A         |          10 |     NULL | NULL   |      | BTREE      |         |               |
| times |          1 | size     |            1 | size        | A         |           8 |     NULL | NULL   |      | BTREE      |         |               |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

The SHOW INDEXES FROM command can be used to see what indexes already exist on a table. This will help you determine when to add new keys. You should also use the EXPLAIN command to determine if your index is being used. You will need a sample SELECT query. You can use the MariaDB site's Explain Analyzer to better understand the output of the EXPLAIN command.

If you are using a graphical SQL query tool, you'll need to consult the documentation for your tool to determine how to view indexes and explain your queries.

Wednesday, September 12, 2012

Intentionally lagging a MySQL slave

+ No comment yet

pt-slave-delay is another great utility in the Percona Toolkit. Percona recently published a quick guide to using it, including how it can be helpful when somebody screws up.

http://www.mysqlperformanceblog.com/2012/09/11/how-to-lag-a-slave-behind-to-avoid-a-disaster

Monday, September 3, 2012

Visualization tools for pt-query-digest

+ No comment yet

There's a great post over at the MySQL Performance Blog about visualization tools for pt-query-digest. If you haven't heard of the Percona Toolkit, you should get to know more about it. It is the successor to Maatkit and Aspersa, and is a great collection of utilities for MySQL database administrators. pt-query-digest can parse the slow query log, the general query log, the binlog, TCP dump files of the MySQL protocol, and even PostgreSQL, memcached, and HTTP packet dumps. It will aggregate the queries it finds into similar queries where possible, and has several report formats. It's a great tool for finding out what queries are really going to your database, so you can find and remove bottlenecks.

There are lots of other great tools in the Percona Toolkit, which I'll discuss in later posts.

Friday, August 24, 2012

The Explain Analyzer

+ No comment yet

If you're having trouble reading the output of the EXPLAIN command in MySQL, just copy and paste it into the Explain Analyzer hosted over at the MariaDB site. You then click on the various fields displayed, and it will explain what they mean.

For those who don't know, MariaDB is a fork of MySQL. The project is headed by Monty Widenius, the original creator of MySQL.

Thursday, August 23, 2012

Not Enough Indexes!

+ No comment yet

Every web developer worth his salt knows that if your database query is running too slowly, you probably need to add an index. How do indexes work, and why is that advice often true?

Let's look at a typical user table:

+--------------+----------------------+------+-----+---------------------+----------------+
| Field        | Type                 | Null | Key | Default             | Extra          |
+--------------+----------------------+------+-----+---------------------+----------------+
| userid       | smallint(5) unsigned | NO   | PRI | NULL                | auto_increment |
| username     | varchar(20)          | NO   | UNI |                     |                |
| password     | varchar(32)          | NO   |     |                     |                |
| firstname    | varchar(15)          | NO   |     |                     |                |
| lastname     | varchar(15)          | NO   |     |                     |                |
| email        | varchar(255)         | NO   |     |                     |                |
| perms        | varchar(255)         | NO   |     |                     |                |
| last_login   | datetime             | NO   |     | 0000-00-00 00:00:00 |                |
+--------------+----------------------+------+-----+---------------------+----------------+

We have a userid field that is an auto-increment field, and a username that has a unique index on it. But what if we want to look up a user by email address? There is no index on the email field, and so the database server will need to do a "full table scan" in order to find the row(s) you are looking for. If we EXPLAIN our query, we can see that the key being used is NULL, and the rows being examined are many.

EXPLAIN SELECT * FROM users WHERE email='demo@example.com';
+------+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+-------------+
|    1 | SIMPLE      | users | ALL  | NULL          | NULL | NULL    | NULL | 79824 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+-------------+

Whenever you do an EXPLAIN, the crucial columns you want to look at our "key" and "rows". Ideally, we always want the "rows" column to be 1, or as close to 1 as possible.

We can add an index to the email field like so:

ALTER TABLE users ADD INDEX idx_email (email);

Now, when we run the same EXPLAIN query as above, we get this result:

EXPLAIN SELECT * FROM users WHERE email='demo@example.com';
+------+-------------+-------+------+---------------+-----------+---------+-------+------+-----------------------+
| id   | select_type | table | type | possible_keys | key       | key_len | ref   | rows | Extra                 |
+------+-------------+-------+------+---------------+-----------+---------+-------+------+-----------------------+
|    1 | SIMPLE      | users | ref  | idx_email     | idx_email | 257     | const |    1 | Using index condition |
+------+-------------+-------+------+---------------+-----------+---------+-------+------+-----------------------+

As you can see, the query is now using the idx_email key, and the value of rows is 1, as we had hoped.

Do indexes always help? Let's look at a similar example, but this time we are looking for users on the hotmail.com domain.

EXPLAIN SELECT * FROM users WHERE email LIKE '%@hotmail.com';
+------+-------------+-------+------+---------------+------+---------+------+-------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+-------+-------------+
|    1 | SIMPLE      | users | ALL  | NULL          | NULL | NULL    | NULL | 79824 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+-------------+

Despite the existence of an index on the email column, the query optimizer has chosen to do a full table scan. Your typical B-tree index operates from left to right. If you don't specify the left part of the field, the database server cannot use your index.

Subscribe to: Posts ( Atom )

Steve Meyers has 15 experience as a MySQL DBA, web software architect, and system administrator, with an emphasis on scalability

How your database can sometimes use two (or more) indexes

How do JOINs work?

The evils of the "full table scan"

Other kinds of indexes

B-Tree Index

Hash Index

Fulltext Index

Spatial Index

Roll Your Own Index

Redundant Indexes

When too many indexes can be a bad thing

Intentionally lagging a MySQL slave

Visualization tools for pt-query-digest

The Explain Analyzer

Not Enough Indexes!