Steve Meyers has 15 experience as a MySQL DBA, web software architect, and system administrator, with an emphasis on scalability

Using XtraBackup to backup a remote machine

+ 2 comments
I love Percona's XtraBackup utility.  It's basically a GPL answer to Oracle's proprietary (and expensive) MySQL Enterprise Backup.  Percona claims that it has even more features that Enterprise Backup.  I do not have access to Oracle's product, though, so I cannot evaluate that claim.

I have my backups set up in a particular way, for convenience and security.  I have a backup machine that has ssh access to the backup user on other machines.  It has terrabytes of spare disk space, and runs a script that uses rsync and other utilities to take backups of my other servers.

Since my backup server is where all backup scripts are run, I have previously used mysqldump to take a backup of one of my slave servers.  This worked well, but was not ideal.  About once a week, mysqldump dies for no apparent reason during the dump, so I end up without a good backup for that day.  It's also slow, and locks the entire database while running.

I've been wanting to switch to XtraBackup for a while.  I've used it when setting up new slaves, and not only is it faster, it does not lock up the entire database while it runs in order to get a consistent backup.  The only problem is that it doesn't fit into my current backup strategy; it is designed to run on the same host as mysqld.

Today, I figured out how to get it working in streaming mode, controlled by my backup server.  It uses little to no disk space on the MySQL server, and streams the backups over to my backup server.  It required a little sudo magic, but beyond that it fit right into my existing backup procedure.

In order to do this, Percona's XtraBackup will need to be installed on both servers.  I had the Percona repository added to my server already, so I just ran:

# yum install percona-xtrabackup

The sudo trick was necessary because I am using ssh to log into my MySQL server as the backup user, which cannot successfully run XtraBackup.  I created the following file as /usr/local/bin/backup-mysql.sh:

#!/bin/bash

/usr/bin/innobackupex --slave-info --stream=xbstream /data/tmp

Then I needed to make sudo allow the backup user to run this script as root without a password.  I added the following line to my /etc/sudoers:

backup ALL=(root) NOPASSWD: /usr/local/bin/backup-mysql.sh

Alternatively, I could have just had the sudoers file contain /usr/bin/innobackupex, and then called that directly.  By keeping it one step removed, I've made it so that only a specific innobackupex command can be run, which lowers the security risk involved if any bugs were found in innobackupex in the future.

On my backup server, I added this to my backup script:

/usr/bin/ssh \
    backup@mysqlserver \
    "/usr/bin/sudo /usr/local/bin/backup-mysql.sh" \
    | xbstream -x -C /backup/daily/0/mysql

/usr/bin/innobackupex \
    --host=mysqlserver \
    --user=root \
    --password=rootpassword \
    --apply-log \
    /backup/daily/0/mysql

The username and password is used to detect the version of MySQL being used, so it can use the correct XtraBackup binary.  I could also have just told it which binary to use, but this is more forward-compatible.

MyISAM's "table lock" problem, and how InnoDB solves it

+ 2 comments
Most serious users of MySQL have moved their tables to InnoDB years ago.  For those who haven't, let's discuss why InnoDB is a more scalable solution than MyISAM.

MyISAM was designed to be very fast for read queries.  It does not handle higher loads of writes very well.  It also suffers a more serious flaw: it isn't crash-safe.  In other words, you better have frequent backups.

MyISAM tables have a read queue and a write queue.  Queries are placed into one of those two queues to be processed.  The write queue has higher priority than the read queue, but the table can only process one write query at a time.  Multiple read queries can occur at once, so the read queue will often be empty.



If a single query is added to the write queue, the read queue will block additional queries from starting.  All existing read queries will continue to run until they're finished, then any queries in the write queue will be processed.  Then, and only then, the read queue will flush its queries.

This queue system works very well for short queries.  As long as no queries run for a significant amount of time, things will generally perform well.  A long-running write query will lock the table the entire time it's running.  Additionally, a long-running read query will do the same, if a write query (even a short one) is waiting for it.

InnoDB avoids this table-locking problem by using Multi-Version Concurrency Control, or MVCC.  This is the same method used by many other enterprise database solutions, such as Oracle, MS SQL Server, and PostgreSQL.

As its name implies, MVCC allows multiple versions of the same table to exist in parallel.  Once a SELECT query begins, it will continue to see the table as it existed at the time the query (or transaction) began.  Another query can update a row in that table without affecting the results returned to the first query.



No longer will a long-running query take down your server!  There are still some concerns, however.  Even with MVCC, we still have to worry about multiple write queries overwriting each other, and making sure things are done in the correct order.  InnoDB uses row-level locking to ensure that multiple transactions don't write to the same row of your table.  It also has deadlock detection, which aborts queries if the server thinks two transactions are waiting for each other.  For most workloads, you won't notice those very often, but you should be aware of them.

Also of note is the server's isolation level.  Which isolation level you use depends entirely on what risks you're willing to take with your data.  They're described in the MySQL manual.

Slides from OpenWest 2013 presentations

+ No comment yet
Thanks to all those who attended my presentations at the 2013 OpenWest Conference!  Please take the time to leave feedback on Joind.in.

PHP Security: It doesn't have to be an oxymoron

Download the slides
Leave feedback


Database Optimization for Web Developers

Download the slides
Leave feedback

Query in a loop?

+ 2 comments
I ran across this gem recently on StackOverflow:

$queryxyzzy12=("SELECT * FROM visitorcookiesbrowsing ORDER by id ASC");
$resultxyzzy23=mysql_query($queryxyzzy12) or die(mysql_error());

while($ckxf = mysql_fetch_assoc($resultxyzzy23)){
    $querycrtx=("SELECT * FROM cart WHERE userkey='$ckxf[usercookie]' ORDER by datebegan  DESC");
    $resultcrtx=mysql_query($querycrtx) or die(mysql_error());
    ------ snip ------
}

Besides the interesting variable names used, it's also doing a query inside a loop, which is very inefficient. This would be better written as a single JOIN query:

SELECT 
  v.*, 
  c.* 
FROM 
  visitorcookiesbrowsing v 
  LEFT JOIN cart c ON c.userkey=v.usercookie 
ORDER BY 
  v.id ASC, 
  c.datebegan DESC

The details of JOIN vs. LEFT JOIN depend on the actual intent of the code, which wasn't apparent from the snippet.

If at all possible, never do a query inside a loop.  The overhead of doing multiple queries is generally far greater than any possible benefit.  If you're using an ORM framework, you need to be aware of how it is loading data from the database.  If you are instantiating objects in a loop, you may really be executing a query (or sometimes more!) for each object you instantiate.

This advice applies to update queries as well.  It's generally best to create an array of the values that need updating, and then update them all at once.  With MySQL, this is fairly easy for INSERT and DELETE queries.

UPDATE queries can be a bit trickier, since you can't easily specify different updates for each value.  I've solved this a few different ways, depending on circumstances.  If you know that the rows you are updating exist, then you can cheat and do an INSERT ... ON DUPLICATE KEY UPDATE.  If you aren't 100% sure the rows exist, then it may be best to create a temporary table, insert all of the update values into it (with a single query), and then do a multi-table UPDATE query.

If you're running MariaDB or Percona Server, you could also try performing your update over the HandlerSocket interface.  This would eliminate much of the overhead of running multiple queries, even though you are, in effect, running one query per update.

As always, the performance of these different ways of doing things depends greatly on your data, and what you're doing with it.  Always run tests to see how these different methods work with your data, server, and application.

Speaking at the OpenWest Conference

+ No comment yet
I'm presenting two talks at the OpenWest Conference next month.  The first talk, on May 2nd, will be about PHP security.  The second talk, given on May 3rd, will be about database optimization, geared towards web developers.  I'll discuss some of the same things that I have discussed on this blog.

We're going to have some great speakers at that conference.  On the PHP front, we'll have Rasmus Lerdorf giving a keynote as well as another presentation.  From the MySQL community, Mark Callaghan from Facebook will be giving another keynote, and we'll also have Colin Charles from MariaDB.

It should be an exciting conference!

Percona Server to ship with jemalloc

+ No comment yet
Joseph Scott pointed me to a little tidbit hidden in the latest Percona Server release notes: "Percona Server for MySQL will now be shipped with the libjemalloc library."  Percona published the results of some testing of various malloc libraries on their MySQL High Performance Blog last year, and it looks like this will have a very positive impact on performance.

I'm currently using MariaDB, so I'm hoping they pick up this change as well.

How to identify an unnecessary index

+ No comment yet
Let's look at the index from the wp_posts table in a standard WordPress installation.

SHOW KEYS FROM wp_posts;
+----------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table    | Non_unique | Key_name         | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| wp_posts |          0 | PRIMARY          |            1 | ID           | A         |        1772 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | post_name        |            1 | post_name    | A         |        1772 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date |            1 | post_type    | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date |            2 | post_status  | A         |           9 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date |            3 | post_date    | A         |        1772 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | type_status_date |            4 | ID           | A         |        1772 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | post_parent      |            1 | post_parent  | A         |         147 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | post_author      |            1 | post_author  | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| wp_posts |          1 | yarpp_title      |            1 | post_title   | NULL      |         590 |     NULL | NULL   |      | FULLTEXT   |         |               |
| wp_posts |          1 | yarpp_content    |            1 | post_content | NULL      |        1772 |     NULL | NULL   |      | FULLTEXT   |         |               |
+----------+------------+------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
10 rows in set (0.00 sec)

How can we tell which indexes are good, and which are bad?

To begin with, let's look at the cardinality.  Cardinality is a measure of how many unique values there are for that column.  A column with very low cardinality is not very useful as an index, since it does not narrow down queries very much.

The post_type and post_status columns have a fairly low cardinality, so we'd ordinarily be concerned about them, but they're part of a multi-column index.  Since there are four columns to the index, their own low cardinality will probably be masked by the other columns.  As always, you know your own data better than I do, so you should know whether that applies in your circumstances.  What we're looking at is the cardinality of the combination of columns, so if the values of the columns are related to each other, you could still end up with low cardinality.

On the other hand, post_author has very low cardinality, and is a single-column index.  This column is potentially useful for blogs with many authors.  For most blogs, it ends up being more of a hindrance than a help, as it negatively affects write performance.

Going back to the type_status_date index, we noticed that the first two columns have low cardinality, but you'll notice that the third column's cardinality is equal to the primary key's cardinality.  This means that once we use the third column of the key (having necessarily already used the first two columns), we have narrowed it down to a single row.  This means that the fourth column in the index probably does not add any useful information.

We'll demonstrate this with a few queries.


EXPLAIN SELECT * FROM wp_posts WHERE post_type IN ('post', 'page') AND post_status='publish' AND post_date > '2013-01-01 00:00:00' ORDER BY ID;
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
| id   | select_type | table    | type  | possible_keys    | key              | key_len | ref  | rows | Extra                                 |
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
|    1 | SIMPLE      | wp_posts | range | type_status_date | type_status_date | 132     | NULL |   34 | Using index condition; Using filesort |
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
1 row in set (0.01 sec)



EXPLAIN SELECT * FROM wp_posts WHERE post_type='post' AND post_status='publish' AND post_date > '2013-01-01 00:00:00' ORDER BY ID ;
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
| id   | select_type | table    | type  | possible_keys    | key              | key_len | ref  | rows | Extra                                 |
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
|    1 | SIMPLE      | wp_posts | range | type_status_date | type_status_date | 132     | NULL |   33 | Using index condition; Using filesort |
+------+-------------+----------+-------+------------------+------------------+---------+------+------+---------------------------------------+
1 row in set (0.00 sec)

We're specifying the post_type, post_status, and post_date in both queries.  You'll notice that the "rows" doesn't change much when we just select posts instead of posts and pages.  Even though the cardinality of post_type is a whopping 6, even that is a bit misleading, as almost all of the rows have type "post", making the real-world cardinality even lower than that.

If we specify an exact date and time, rather than a range, then we narrow it down to a single row.

EXPLAIN SELECT * FROM wp_posts WHERE post_type='post' AND post_status='publish' AND post_date = '2013-02-05 10:53:57' ORDER BY ID ;
+------+-------------+----------+------+------------------+------------------+---------+-------------------+------+------------------------------------+
| id   | select_type | table    | type | possible_keys    | key              | key_len | ref               | rows | Extra                              |
+------+-------------+----------+------+------------------+------------------+---------+-------------------+------+------------------------------------+
|    1 | SIMPLE      | wp_posts | ref  | type_status_date | type_status_date | 132     | const,const,const |    1 | Using index condition; Using where |
+------+-------------+----------+------+------------------+------------------+---------+-------------------+------+------------------------------------+
1 row in set (0.00 sec)

How often are we going to query the database for an exact date and time, though?  Generally, we'll either be specifying a date/time range, or ordering by the date/time.  In either case, we cannot use the portion of the index after post_date.

So does that mean the ID portion of the key is useless?  In all likelihood, yes.  I can't imagine any scenario where it would be useful.

As an aside, when using the InnoDB storage engine, secondary indexes merely point back to the primary key, so the ID would already be referenced in the index.

Interesting MySQL optimizer case

+ No comment yet
Jaime at the MySQL Performance Blog had an interesting post regarding some unexpected behavior from the MySQL optimizer.  Although this particular case probably doesn't affect most people, it does give some insight into how the optimizer works, and how subtle changes to a query can change performance.

Asynchronous MySQL queries in PHP

+ No comment yet
I was made aware today by a post on the MySQL Performance Blog that the mysqlnd driver for PHP has support for asynchronous PHP queries.  At the Midwest PHP Conference, I gave a talk on database optimization for web developers, and in a follow-up discussion with an attendee, I apparently misinformed him.

Midwest PHP Presentation Notes

+ No comment yet
Thanks for attending my presentation at the Midwest PHP conference!

Slides as PDF
Leave feedback on joind.in

Useful links:

If all else fails, cache it

+ No comment yet
Sometimes, despite all your efforts to optimize your queries, you have one that just takes a while.  It requires a full table scan, and there's just no getting around it.  In this case, your best option may be to look for ways to use a cache to speed up your query.  There are two kinds of caches that are most useful for these situations: a result cache, and a data cache.

A result cache is fairly simple to implement, and is most often used for reporting.  When the exact same query is run over and over again, then it might make sense to store the result in a cache table.  This is especially true when up-to-the-second accuracy isn't important.  Every company I've worked for has had marketing or other administrative reports that were important to the business, but took a long time to run.  The people viewing the reports only cared about results up to yesterday, so the report really only needed to be run once a day.  Running a 5-minute query once a day isn't really a big deal, especially if you can schedule it to run when your site isn't very busy.

A data cache is a little more complicated, and depends entirely on the nature of your data and the queries you run.  At each company I've worked for, we've had fairly large tables that were queried often in ways that defied indexing.  The specifics were very different in each case, but our solutions held to a common pattern.

First, we were able to significantly limit the data set based on a criteria generally involving what constituted an "active" record.  Second, we were able to de-normalize the data so that JOINs weren't necessary.  Third, we were sometimes able to shard the cached data, since the lookups would only happen across some subset of the data.  Last, we were often able to add strategic indexes that were not possible or feasible on the full data set.  This included geospatial indexes that are only possible with MyISAM tables, while our main data set used InnoDB.

Once you determine your data caching schema, the next big question is how often to cache it?  This is something that you'll need to determine based on (a) how long it takes to rebuild the cache, and (b) how fresh your data needs to be.  The first is a technical constraint, while the latter is a business constraint.  If you keep track of updates, deletes, and inserts to your main data set, then you may be able to just update your cache instead of completely rebuilding it.  I'd still recommend completely rebuilding it occasionally, since it is possible that some edge case was missed when determining what has been updated.

One trick that may help with maintaining your cached data is the RENAME TABLES command.  You can atomically rename multiple tables with a command such as "RENAME TABLES table to table_new, table_old to table".  This allows you to build (or update) a new cache table without impacting the current cache table.