Performance tuning

Karen pointed out recently that an important part of our submissions database was being really really slow. I had noticed, but thought it was just a fluke; when Karen mentioned it, I decided it was worth looking into.

So I spent a while on Saturday morning looking for and re-learning about the "explain" command in SQL (the standard database language), which lets you see some of the details of how the database is carrying out your queries.

In SQL, when you try to match items from two different tables of data, the database may have to test all possible combinations of rows from those two tables. And it turns out that in one particular common query we were running, there were 36,000 items in one table and 9,000 items in another table, so every time we ran that query to generate a particular list, the database was (if I understand this right) looking at about 320 million combinations of items from the two tables.

So I added something called an "index" (which took only about a second to do once I understood how to do it) and now each time we run that query it only has to look at the 9,000 items from one of the tables. Which means, more or less, that the query takes 1/36000 of the amount of time it used to take. Which means that the system is usable again; it really wasn't before.

So yay for pointing out slownesses, and yay for performance tuning!

Sadly, I still don't really understand this stuff. I tried to do something similar to speed up the slow loading of the front page of Mary Anne's journal; unfortunately, the command to create that index took three minutes to run (instead of about one second), and having that index in place actually caused Mary Anne's journal page to load slower than it used to. So I removed that index from Mary Anne's page. I did add another index that I think will speed up the page load a little bit; if you've been seeing slow load times for Mary Anne's journal, take a look and see if it's faster now, and let me know either way.

And if you're an expert on improving MySQL database performance, I'd love to get some pointers from you. Any recommended pages/sites/tutorials?

4 Responses to “Performance tuning”

  1. David Moles

    I don’t know MySQL very well, specifically, but if you wanted to shoot me the table-creation SQL and the query you’re trying to speed up, I’d be happy to take a look at it.

  2. Stephen Sample

    Oh, yes, explain is your friend. And if you get confused about some of the ranges it gives you, check out explain analyze. Plain-vanilla explain doesn’t actually run the query, so it’s just guessing about the run times.

    As to the rest, a few caveats: I’m not an expert either, and my experience is mostly on PostgreSQL rather than MySQL. However, the basics are pretty much identical on both, and it doesn’t sound like you need much beyond the basics.

    In general, indexing speeds select statements, but it slows updates and inserts, since the index has to be updated as well as the record proper.

    Indexing also increases the amount of space the table takes up on disk. But as you noted elsewhere, disk is cheap. Increasing the size of a table can slow performance if it pushes the working set over the amount of available memory, but that usually isn’t an issue.

    In general, you should have indexes on any fields that are used in a join to match records from two different tables. But you should pay attention to the types of data you’re working with.

    Indexing (and indeed, comparisons in general) is more efficient on fields with known types and sizes. So searching/indexing integers or fixed-length text fields is more efficient than the same operation on variable-length text fields.

    That also means that if you are using queries that match variable-length text fields from two tables, it may be worth seeing if you can add a numeric ID that you can use to do the matching with.

    Oh, and matching on booleans is really fast on its own, so indexing them is usually a waste of time and disk space.

    Take advantage of what you know about your data when planning your indexes: if you know that a field will always have unique values (for example, because it’s an auto-generated numeric ID code), you can use a unique index rather than a standard index. This saves the database from having to search the entire index for matches–as soon as it inds one match, it can bail.

    Speaking as a non-expert, though, resist the temptation to over-plan your queries. Instead, stick with the simplest query that does what you want. You can in theory improve performance by forcing the database to use a specific query plan, but unless you really know both your data and the way the query planner works, it’s not worth it. You’re almost certainly better off using a more generic query that the database can optimize to suit itself.

    For example, at one point I was doing a query across five tables, and I was explicitly grouping the request into two subqueries, where one two-table query was essentially being used to set a constraint on another three-table query. That was conceptually a clearer way to ask the question, but it wasn’t efficient. The query took nearly an hour to run.

    When I rewrote it to be a stupid flat five-table query, it was less human-readable, but the results came back in about 10 seconds. The query planner knew a lot more about how to get results from the database than I did.

    If any actual database administrator types contradict any of this, listen to them: I only do this stuff for lack of anyone better qualified to fob it off on. 😉

  3. Stephen Sample

    Oops. That post was quite long enough without posting it twice.

    I got a 500 Internal Server Error message in response to the first post and thought the insertion didn’t take.

    Jed, can you delete the second copy? Or either one, really…

  4. David Moles

    Ditto all that.

    Also, keep in mind that the most elegant (“normalized”) database model is not always the best for a particular application. Sometimes it can be worth having a little duplication of data if you often need to query things in ways that would cross lots of tables — especially in something like a blog, where inserts and updates are a lot less common than selects, and a little extra hit on inserts may be worth it for the savings on selects.


Join the Conversation