My sleep was rudely interrupted yesterday morning at 9am (yes, 9am). It was Brian calling to tell me that the new database server we setup for calendars had a load of about 22 — ridiculous. I threw on some clothes, pulled Kevin out of kindergarten orientation, and headed into the office to see what was up. This is what we all found out:
We have a reminder query that uses a subquery compared against an indexed field on the table. This never used to cause problems, but something changed from MySQL 4 to 5 that caused the query to take .08 seconds. When we found out how long it took, Brian literally said, “See, it’s only .08 seconds. That shouldn’t cause problems.” But it did…
We have thousands of users hitting this query around the same time. Combine that with all of the other normal traffic, and the server just can’t catch up. We discovered what was wrong by using EXPLAIN, and found out it was scanning about 2,000 rows on average. We switched the query to use JOINs instead and got the number of rows scanned down to about 15.
I’ve since decided to start a campaign to run EXPLAIN on every query we use to make sure things are running as efficiently as possible. You should do the same.
The biggest lesson we learned is that just because something seems to be fast by itself, doesn’t mean it performs well. My next post will tell you about my second crisis of yesterday that reinforces this lesson, unfortunately.