Steve Meyers: December 2012

Ernie Souhrada has a great writeup called The Optimization That (Often) Isn’t: Index Merge Intersection. It talks about how MySQL can perform an "index merge" in order to use more than one index.

For example, if you have an OR in your WHERE clause, and each of the conditions are indexable, MySQL can perform each index lookup separately, and then do a union on the results. Similarly, if you have an AND in your WHERE clause, and again each condition is indexable, MySQL can do an intersection of the results of the two indexes. MySQL will decide whether to use this optimization based on the statistics it keeps of your table's indexes.

The problem that Ernie identified happened because the customer had an index on a column with very little variance. Almost all of the rows had the same value, and the optimizer made an incorrect assumption when evaluating indexes.

For this reason, you should be very cautious about indexing columns with very little variance. It can make sense sometimes; perhaps almost all of the rows have a 1 in that column, but you need to find the few that have a 0. This is common when you have to process new rows, and then mark them as "processed". In this case, I will often use a separate queue table; this negates the possibility of the optimizer choosing to use a very unbalanced index on the main table.

For beginning web developers, JOINs are a scary thing. Simple queries are easy enough, but once you add a second table to the mix, it becomes a lot more complex. In reality, JOINs aren't really all that hard to understand, but you do need to pay attention to what you're doing.

By default, JOINs in SQL are Cartesian; for each row in table A, you join it with every single row in table B, so the number of output rows equals the number of rows in A multiplied by the number of rows in B. This is generally not the desired outcome. In fact, the Drizzle project has broken SQL compatibility to make it so that Cartesian JOINs need to be specified explicitly. If no JOIN conditions are given, the query simply generates an error.

I remember a web developer showing me a query that he thought would work like a UNION. He had 20 "cache" tables that each contained several tens of thousands of rows of information. He wanted to get all of that information with a single query, rather than querying each table individually. This is how he wrote his query:

SELECT * FROM cache1, cache2, cache3, cache4, cache5, cache6, cache7, 
> cache8, cache9, cache10, cache11, cache12, cache13, cache14, cache15,
> cache16, cache17, cache18, cache19, cache20;

As you can imagine, the query never returned, and I think the database server eventually ran out of swap space.

Most web developers I've worked with tend to craft their JOIN queries with the condition in the WHERE clause. While this does (usually) work, it is not as easy for the next poor schlep reading your code to know how the JOIN works. It also is easier to mistakes when you do this, as you may forget to specify the condition for one of your JOINs.

JOIN conditions are given using the keywords ON or USING. The USING keyword lets you specify one or more columns that are identical across multiple tables, like so:

SELECT field1, field2 FROM table1 JOIN table2 USING (field3, field4);

The field(s) in the USING condition must have the same name in both tables. If this doesn't work with your schema, or if you have more complicated JOIN conditions, you should use the ON keyword:

SELECT field1, field2 FROM table1 JOIN table2 ON table1.field3=table2.field4;

When we see an ON or USING condition in a query, that tells us immediately that the JOIN is using that condition to filter the possible rows in the second table.

Ideally, as discussed in The evils of the "full table scan" previously, we should only find a single row in the second table for each row in the first table. Additionally, the column in the second table that we are using for filtering the JOIN will have an index on it. This will allow the query to use that index to find the matching rows, greatly speeding up the execution of our query.

If you can, it is also beneficial to eliminate as many rows as possible from the first table, so that fewer lookups happen on the second table. This is appropriately done in the WHERE clause.

By putting our JOIN conditions in the appropriate place, it is now easier for us to glance at our query and be able to tell how the tables connect together:

SELECT
 MONTH(STR_TO_DATE(t2.value, '%H:%i:%S %b %d, %Y ')) AS payment_month,
 COUNT(t1.transid) AS num,
 SUM(t3.value) AS total,
 SUM(t4.value) as shipping,
 SUM(t5.value) AS fee,
 SUM(t6.value) AS tax
FROM
 new_transactions t1
 JOIN new_transactions t2 ON t2.transid=t1.transid AND t2.field = 'payment_date'
 JOIN new_transactions t3 ON t3.transid=t1.transid AND t3.field = 'mc_gross'
 JOIN new_transactions t4 ON t4.transid=t1.transid AND t4.field = 'mc_shipping'
 JOIN new_transactions t5 ON t5.transid=t1.transid AND t5.field = 'mc_fee'
 JOIN new_transactions t6 ON t6.transid=t1.transid AND t6.field = 'tax'
WHERE
 t1.field = 'txn_type'
 AND t1.value = 'cart'

GROUP BY
 payment_month
WITH ROLLUP

A complex JOIN query, but we can easily glance at it and tell that all of the JOINs have appropriate conditions on them.

Steve Meyers has 15 experience as a MySQL DBA, web software architect, and system administrator, with an emphasis on scalability

How your database can sometimes use two (or more) indexes

How do JOINs work?