Our customers or prospective customers often ask us how we do a performance audit (it's our most popular service). I thought I should write a blog post that will both answer their question, so I can just reply "read all about it at this URL" and share our methodology with readers a little bit. This fits well with our philosophy of openness. It also shocks people sometimes -- "you're giving away the golden goose!" Not really. What you hire us for is our experience, not a recipe that anyone can follow.
A full performance audit is far more extensive than this article can cover, and might wander into Apache, networking config, caching layers, etc. Wherever the problem is, we'll track it down. I won't talk about that. That's not because I want to keep secrets from you. To the contrary, I'd love to share it all with you. But that's a huge job; it will take many pages, and I'm not going to write that much.
The kickoff call
There's actually a step before the performance audit begins. We call this a kickoff call. We get on the phone with the key technical staff on the client's side and discuss the application in general. If possible, we take a brief look at the server beforehand, so we can ask more intelligent questions and skip obvious things. This call is sometimes up to an hour long, if we're discussing a lot of things like how to build for massive scalability, how to do read-write splitting without breaking the user experience, how to take online non-blocking backups, or things like that.
But in general it can be a lot shorter than clients expect, because
Many people assume their application is difficult or somehow different, and that we need all kinds of schema diagrams and code listings, but the truth is people often have the same insights into particular problems, and therefore they try similar solutions. And besides, most applications have a lot of the same components. Tagging, friends, queues, click tracking, search, paginated displays -- we've seen these and dozens of other common patterns done a hundred different ways.
The kickoff call has some other important goals too. We want to know what your concerns are at this stage. Is it the current performance, future performance, scalability, ability to recover from disasters? We also want to know what the operating parameters are. Can we suggest upgrades, rewriting queries, or whatnot? Sometimes there are rigid constraints on what types of solutions we can propose, and we need to know ahead of time so we don't spend time on things you have no way to change. An example of this is a third-party product whose code you are unwilling or unable to change.
We also want to know the operating mode you expect us to work in. Maybe you want us to analyze and present recommendations for your consideration, and take a look-don't-touch approach. Or, maybe you just want us to fix things and tell you later what we did. Either is a completely valid approach. You tell us what you want, and that's what we'll do (of course, we will also tell you what we think is best for you, which is our job).
After the kickoff call, we write some notes, and then get down to the audit itself. We usually have two people on the kickoff call so that we don't have a single person knowing everything about you (only one is billable, though). The second person will be from the same team within Percona. However, the audit itself is generally a single person, with another person reading all the notes and generally keeping in touch with what the lead consultant does. We want two sets of eyes on things whenever possible. This also has value in case you decide to continue on to a longer relationship with Percona, which is reasonably common; in that case you will establish a solid relationship with the team, and group involvement early on is a good start to this.
What to look for during the audit
During the audit I look for outliers and anomalies in every aspect of the server's configuration and performance and contents. I'm looking for characteristics that are much bigger or smaller than normal, or things I don't usually see. Anything out of the ordinary or out of proportion. These things are often relatively easy to catch if you have enough experience. It's a little hard to describe without making it sound like it's some secret handshake, but it's really not magical. You're either familiar with what a server normally looks like, or you're not. It's just like a mechanic who listens to the car and says "I hear the valves knocking."
In other cases there may be things that are NOT easy to catch, and may require a lot of experience and investigation. The work we do is often not simple at all. It's impossible to predict. A customer may call and ask "how long to debug server lockups?" It could be ten minutes, or it could be a really major effort chasing down something like a bug in a hardware component, and you never know ahead of time.
In either case, the point to know here is that we can work with systems that are at any level of tuning, from completely untuned to a system that's already had a lot of expert attention. I use "tuning" in a generic way here -- we focus on far more than my.cnf files. In fact my.cnf files are usually one of the smallest levers we can pull for server optimization. Schema and query optimization can give much greater improvements, for example.
The audit itself
OK, so here's a very basic audit, whose scope I'll keep within just a MySQL server. The first step is to log into the machine via SSH and open up a text editor (gedit, notepad, what have you.) I want to keep this open and paste everything I see into it. If there's no record later that someone else can follow, I am not doing my job.
I would also note that the commands I'll show here are the lowest common denominator. We often use tools to gather a lot of the data for us and speed the process, when possible, but the core commands I'll show are what we can always fall back to if that's not possible.
Gathering information about the server
I start off with these commands:
The first command is invaluable for later reference. The others will show exactly what kind of server I'm looking at, assuming I'm on a GNU/Linux box (I have to adapt to whatever environment I'm in -- obviously a FreeBSD box is different). For example,
PLAIN TEXT CODE:From this, I can see a couple of important things. One thing is that I'm looking at a 64-bit OS, which is important to know. I also know I'm on RHEL and I know the kernel -- knowing this may tell us a lot of useful things about the system.
Next I continue looking around the server. Holding with the assumption of GNU/Linux, I keep working to get a picture of the OS and hardware.
PLAIN TEXT CODE:What I see here is that this is an 8-core machine, two quad-core processors at 2GHz. I also know some things about the CPU, such as the model number and the cache size, etc. What else is important is the 'lm' flag, which tells me this is a 64-bit CPU. From this and what I saw earlier I can see whether someone installed a 32-bit OS on 64-bit hardware, which is important if you want to allocate a lot of memory to mysqld.
Next I look at a couple of all-in-one commands. For example, 'top -n 1' and 'uptime'. That shows me what processes are running, but it also shows me load average, memory sizes, and a bunch of other stuff. In this case I'm on a machine with 32GB of memory and there is very little load. I see that mysqld is using 10g of memory with 17g virtual size, and there's practically nothing else running (this is a dedicated master server). The machine has a lot of swap, but basically none is used. This isn't all that relevant, as you'll see later, but in some cases it might be.
I continue my investigation into the machine itself with these commands:
These commands show me what filesystems are mounted, how much space they have on them, and what kind of I/O system is under the box in general. dmesg is useful for several reasons. For one thing, it has a lot of information on the box's hardware, including RAID controllers, installed memory, and so on in the boot output. In addition, at the tail it has information on core dumps and so on. It's possible that it has filled up with stuff and the boot output has been lost, but in many cases there is a lot of information here that's hard to get elsewhere.
If I see an LSI MegaRAID card, which is pretty common, I can check whether MegaCli is installed, and use it to get information about the physical drives installed; then I can find out exactly what they are, both in terms of size and spindle speed. I need to know how many physical spindles are under the box, and how fast they are. I can also get information about the all-important battery-backed write cache, how it's configured and what its status is. I can see the drive status, too. Sometimes a failed drive is present but no one has noticed it yet! (By the way, we can also help you install and configure monitoring and graphing systems, if you don't have any. These will help catch such problems.) If MegaCLI isn't installed there may be other ways to get this information, too.
The last three commands are for giving me information about LVM devices, volume groups, and logical volumes. At this point, unless I've seen something that I want to investigate further, I have a pretty good idea what hardware and operating system I'm working with.
System performance
Next I investigate what the system is doing, in terms of performance. There are two key commands for this on most operating systems (there are others, such as mpstat or dstat, that I may also elect to use in some cases):
If you don't know how to read these, there's a full explanation in our book, High Performance MySQL 2nd Edition (it's not in the first edition). From this I get an idea where any possible problems might be: is the disk saturated? Is the machine swapping actively? (This is more important than whether it's using swap.) There are a lot of combinations of possible things you can see here, so I will not try to explain it all. Basically you need to know everything there is to know about these two commands and their output.
Based on what I've seen so far, I may look at other things, such as ifconfig (which shows good stuff like dropped packets).
Starting to look at MySQL
Assuming I don't see anything noteworthy here, I'll move on to the MySQL server. I mean, let's just assume the physical hardware and the machine setup is fine, and the client has asked me to figure out why the website is slow; assume also I've checked out Apache and network config, etc, and I don't see anything wrong there. So at this point, it looks like MySQL is probably a good place to look for performance problems. Everything I've done up till now is just due diligence; I've gotten my bearings on the server as a whole, and now I'm ready to see what's up with the database.
The first step is to find out what MySQL instances are on the machine. It doesn't do to just assume there's a single instance in /var/lib/mysql and it's reading from /etc/my.cnf. I've seen a lot of cases where there's a /var/lib/mysql and an /etc/my.cnf, and the server is installed in /customsoftware/mysql with a different my.cnf file. The way to find this out is to look at ps:
PLAIN TEXT CODE:In this case, yes it's all default. It's listening on the usual port, etc etc. Nothing unusual. Next I check what the MySQL version is, and I check (with the 'file' command) whether /usr/libexec/mysqld is a 64-bit build, if I'm on a 64-bit OS.
Examining MySQL configuration
Then I'll look at my.cnf quickly for anything weird:
PLAIN TEXT CODE:My goal here is to look at the my.cnf without seeing all the comments, which a) are often a small novel, b) often say what the sysadmin thinks s/he's doing, which might not be what is really being done. I want to look at it without polluting my brain with that. I'll just scan this for anything odd. In contrast to what you might think, spending a lot of time here isn't usually a good idea, because what's left out is often more important than what's included. For example, the absence of skip-name-resolve is easy to miss, as is the absence of innodb_log_file_size.
I won't give you a full rundown over every option in the file. But again, you learn quickly if something strange is here.
This whole time I've been copying the output of the commands and saving them in a text file. I'll attach this text file to the issue in our CRM system so I (or someone else) can look at it later, see what I saw, and follow my line of reasoning about any suggestions I make.
The next thing to do is grab 'mysqladmin variables' and put it in the text file. I don't spend a lot of time looking at this. For one thing, all the values are in big units, so it's hard to read. If the InnoDB buffer pool is set to 22GB, it's much easier to see 22GB than 23622320128, especially since that number is buried next to a lot of other long numbers. They make my eyes hurt, and there are better ways to do this (I will run mysqlreport to get friendly numbers I can read easily). But what I will do is scan the output for something strange that might be caused by a syntax error in the my.cnf file. For an example, take a look at this bug report (not a bug).
Looking at MySQL status
Now we're starting to get into the really interesting bits. Alas, these are the bits where there is the most variability, so I'll have to be even less detailed and a little vague. The next command is this:
PLAIN TEXT CODE:I let that run for at least two iterations. The first iteration is the current values since the server was booted; the second and subsequent are incremental differences. I'll usually capture two of these. It looks like this:
PLAIN TEXT CODE:Then I format the second set of values up beside the first set for ease of reference (an easy task with an automated too, or a quick Vim command if I'm doing this the manual way.) The result looks like this:
PLAIN TEXT CODE:This output is very useful when writing the report to the client. Now I go through it line by line looking for things that look wrong. When I find something, I compare it against the server's variables or whatever other data I need, and write something instructional to the client about it.
Preparing to report findings
I should mention here a bit about how I write the report to the client. I copy and paste the interesting snippets from what I've been gathering, save it in a text file, and explain what I'm seeing to the client. Anything I see that looks wrong should be explained in detail, because my goal is to teach the client, not just to awe them with my knowledge and get "I could never understand, O Almighty Percona, here's your money" in reply. If the client doesn't understand what I'm doing, I've done something wrong. Clients need to understand so that they know they're getting good service. A Percona consultant never says "just trust me" unless the client really insists "I don't want to know."
That's also why I'll comment on things that seem good. Clients need to know that I'm being thorough. Once upon a time, a Percona consultant was called back in to do some further work about 6 months after Peter helped a client. Now, if Peter worked with the client 6 months ago, you have big shoes to fill, and getting further gains will probably not be easy. So this consultant looked through things, spent about 2 hours, and came back to the client with "there's not a lot of performance left to gain here. You can do X, Y and Z, and that's about all I see. Beyond that, we need to talk about making your overall architecture more scalable." And the client said "that's all? Didn't you do any work? Peter's report 6 months ago was a small encyclopedia!" The mistake here was that the consultant didn't show all the things he'd checked, which was extensive indeed. After a quick explanation the customer understood that the consultant had really done a lot of work, but it's better to show the client the full extent of the investigations in the first place, and not just show "here's the 3 bad things I found in 2 hours." We also make sure that we refer back to the previous issue's findings, so that we don't duplicate efforts from the earlier optimization.
Back to the output of mysqladmin. When I send the report to the client, it typically looks something like this. First, I explain the meaning of the output, so it's clear what I'm showing them. Then, I go through it line by line, like this:
+-----------------------------------+----------------------+----------------------+ | Variable_name | Value | Value | +-----------------------------------+----------------------+----------------------+ | Aborted_clients | 205174 | 6 | Here we can see that something is disconnecting ungracefully (without closing the connection properly). This is happening 6 times every ten seconds. We need to figure out what is happening because it can indicate something else is wrong. This is not a big deal for performance, it's just something to figure out. ... snip ... | Com_admin_commands | 255868807 | 11893 | Why is something running an admin command constantly? This is generally a Ping command, which is useless. Something is pinging the server about 1200 times per second. Often this is from some ORM system, such as Class::DBI, which does $dbh->ping before each and every query. Why is Ping useless? Well, it is intended to tell you whether the server is alive. The trouble is, if it succeeds, it tells you nothing. An instant later, when the ORM system actually runs the query, the server could have died. The ORM should just be running the query, and if the query fails, then you know from the error code that the server is dead. So a successful Ping is meaningless. On the other hand, pinging can be a serious performance problem. Not really for the server, since it's a cheap query. It adds hardly any load. But for the application, it's another network round trip. It is a "garbage" query that just adds latency and makes the overall time to complete an action longer. If your application runs very short queries, this can be a non-trivial part of the app's response time.And so on, and so on. I'll often refer to other things; for example, when I'm examining the status values for Created_tmp_tables and Created_disk_tmp_tables, I'll paste in
PLAIN TEXT CODE:If these are not the same size, as often happens, I'll explain that the minimum of the two is used to determine when an in-memory temporary table will be converted to an on-disk table, which is an expensive operation.
In this way, I work through the status and the variables, and explain to the client what I can deduce about the server from them.
I would point out that I don't spend a lot of time on settings and status. It is not where the greatest gains are to be found. The real gold mine is yet to come. After you've done it a few times, you can go through the settings and status pretty quickly.
Analyzing MySQL's workload and data
After I'm done with that, I'll grab a few snapshots of SHOW FULL PROCESSLIST, and look for odd values in this. For example, if I see a lot of queries in odd statuses, like "statistics", I know something weird is going on, and I look deeper. I also look at the queries themselves. If the client has long-running queries, it can be pretty easy to catch. Maybe they're long-running because they're in Locked status, which might be an indication that it's time to convert MyISAM tables to InnoDB. (But then again, it might not be.) Or maybe the client is doing queries like "... WHERE client IN (SELECT id FROM clients)" which is a really bad query plan.
After this I'll look at SHOW INNODB STATUS\G and see if there's anything worth commenting on there. Depending on the workload, there might be a lot of substance here. There's also a very thorough section on this in our book.
Next I look at the data in the server and see what I think about it. If the server is not heavily loaded, I may even do some INFORMATION_SCHEMA queries to help me find the biggest tables and so forth. If the server is heavily loaded or has a lot of data, touching the INFORMATION_SCHEMA can take a barely-running site and bring it down, so I do this with great care! If it's not appropriate I may run 'du' on the data files, or simply skip this step. This information can help me determine what kinds of things I ought to look for later during query analysis, and also might show me things like huge tables that should be archived, tables that I should be aware of if I see them in queries, or what have you.
I will also look at the error log. The error log shows all sorts of interesting things. Basically, anything but silence in the error log is interesting and needs to be investigated. You might find that there are InnoDB tables whose .frm files exist but have been dropped from InnoDB, for example. (The INFORMATION_SCHEMA query will surely trigger errors about this.)
There are a bunch of other things I'll look at, including checking for default users in the mysql.* tables, running mk-duplicate-key-checker to find redundant indexes, and so on. As with settings and status, these set the stage for bigger gains to come later, so I don't spend all that much time here, although there is occasionally something really bad that can be fixed and bring big gains.
Schema, query, and index optimization
Analyzing and optimizing a server's table and index structures, and the queries that run against them, is by far the most important thing to concentrate on. All the other work is a prerequisite to understanding the context within which these things operate. For example, if the workload is a star schema data warehouse, it is important to know a lot about the I/O subsystem. And by now I know that I should concentrate in certain areas -- there may be hundreds of tables, but at this point I should have identified a handful of tables that really matter. That's why I save this part of the analysis for last.
You cannot consider any one of these factors (schema, queries, indexing) in isolation, because they are tightly bound together, and tweaking one will often have effects on the others. Each decision is a consideration of the costs and benefits on the system as a whole.
The queries the system runs, and hence the slow query log, is one of the more fruitful ways to analyze the system. What I usually do is ask the client to enable it a day before the audit if possible, and set it to one second. That way there's at least one "cycle" of queries in it. Then I run the following:
PLAIN TEXT CODE:By default, this outputs the top 10 most expensive queries, in total execution time. By the way, the slow query logs in the stock MySQL server are extremely limited, and for serious analysis it's basically mandatory to use a server that has the Percona patches for microsecond logging and additional information in the slow query log output. The stock MySQL server's limitation of one-second granularity makes it hide problem queries that are faster than one second (which in a high-performance system is virtually every query.) If it's not possible to use a patched binary, we can use MySQL Proxy, packet sniffing, or other techniques to get more information than the slow log gives us; but by far the richest and most efficient source of analysis information is our custom server, which we have designed exactly for the purpose of giving us as much insight as possible into the server's true workload. At the moment there is no other technique that approaches the amount of information a Percona build can give you.
Having found the desired information about the queries by any means necessary, I find out which are the most likely to give the greatest gains. I use a dual approach for this: I look for both the queries which cause the greatest load on the server (in aggregate) and which cause the greatest latency for the application. Again, it's a process of finding outliers by some criteria. Both types of queries matter. Some may be extremely fast, but run very often; others may be run seldom but take a long time to finish. And it's important to focus on ones that can give the greatest improvement for the customer. I can often tell at a glance whether a query is going to be possible to improve a lot, and a simple mental calculation can then tell me how much total gain I can get from it. Queries are not just "bad" or "good" -- it's a question of where they fall on the scale.
Now I analyze these one at a time. For each query, I run EXPLAIN if possible (rewriting non-SELECT queries if needed, and using careful judgment for queries with subqueries in the FROM clause, which will actually execute the inner subquery!), and examine the query plan. This is where you need to really know how to write queries and how EXPLAIN works. At Percona, we have peer training constantly on our internal IRC channel and mailing list. We share all sorts of dirty tricks and neat ideas with each other. Nothing is off the table. Queries can be rewritten. Indexes and data types and table structures can be changed. Queries can be broken into pieces, combined, or even eliminated entirely (we may suggest caching, or tell the customer to evaluate Sphinx, or something like that.)
In the common case, though, a query simply needs a rewrite or a new index or something. In this case we show the query and the information mysqlsla outputs about it (execution times, etc), the EXPLAIN plan, and the desired modifications to the query or the table. We explain how to interpret what we're showing, and why the proposed modifications are better. If the client approves it, we may make a copy of the table, make the proposed modifications to it, and show the difference afterwards. Or we might have a test server to run on. It varies widely; some clients have a test server, some don't; some have a formal QA process; some don't. It is very customer-specific, and we work with what we have.
Sending the report to the client
After the slow query log analysis is done, I format all my analysis for sending, add in things I may have noticed along the way (comments on backups, for example) read over it once again and make sure I didn't miss anything important or write something in a confusing way, and then send it to the client. Then I usually call to discuss, or just confirm that it was received and wait for a reply. In most cases the client has a lot of work to do -- sometimes weeks of application changes. Afterwards there's usually more to be gained by doing another pass through.
We may also want to have another call with the customer and talk about changes that can or should be made at a much higher level. If we see that the application's overall architecture needs to be changed, that's something to discuss. Customers usually want us to validate their application's overall scalability and whether it'll grow to meet their demands for some period of time. We also frequently bring the discussion to topics such as monitoring, alerting, backups, caching, reverse proxies, and high availability solutions, all of which we can frequently help customers set up much more efficiently and with better results, due to our knowledge of what works in the real world and where there's quicksand to be avoided.
The changes Percona's performance audits typically suggest can be really significant, and can completely change the performance profile and workload of a server, so it's often worth another iteration, and sometimes even more after that. A really thorough audit can take up to ten hours! The initial round usually takes less than two, however. The ten-hour cases are usually for really complex applications, or apps that have a lot of stored procedures, which are more difficult to analyze because you have to pull out the statements and analyze them individually. The goal is always to develop X-Ray Vision and see through the symptoms to the underlying causes in every aspect of application performance.
Customer interaction
In all cases, the customer's guidance is our first rule. The customer determines the depth and complexity of the analysis we do. A lot of times, customers will tell us up front to limit the work to some number of hours, such as 3 hours. Knowing this ahead of time gives us a framework within which we have to deliver the maximum value to the customer. If we have less time, we fix the "sore thumbs" and make notes of places where more effort is likely to pay off, and mention those to the customer. Or we economize by giving the customer easy choices so they can decide which things to have us fix, which to fix themselves, and which to just ignore.
The above process may look like a static recipe that you can just mechanically execute, and in some cases that's truly what we do. However, we don't just go through the motions. The "recipe" above is only a framework within which we work. Knowing the rules of a performance audit, and why they are the way they are, means you can know when and how to break them for the client's benefit. It's like knowing openings in chess -- the great chess players study openings exhaustively, but they don't robotically play them in tournaments.
For example, although it may sound as though I go work for a while and then emerge with a report, it's not that way a lot of times. Flexibility and exercising judgment about when it's appropriate to interact with the customer are a hallmark of our approach. If we need to be interactive and keep in touch with the client early and often, we do that. A lot of times I'll start out with a quick look and have a second brief call with the customer, or send a quick email and ask for feedback on the direction I'm going; or just get on an instant messaging chat and tell the customer "here's the top thing you can start working on while I keep investigating." We make our process as interactive and flexible as the situation calls for. Early, frequent feedback lets us make small adjustments to the direction we're going, communicate to the client how much time we're spending, and shortcut effort if it's not useful, e.g. skipping analysis for a query the client's developers are in the process of removing from the application completely.
That's all! I hope this has been an informative and fun trip through the world of a typical Percona performance audit, although as you've seen there really isn't any such thing. Post your questions in the comments, and I'll try to answer them as time permits.
Entry posted by Baron Schwartz | 4 comments
INFORMATION_SCHEMA, in particular by favorite TABLES table is not only helpful to understand tables you have on the system, but I have also found it to be very helpful as a scripting language for variety of database administration tasks. It can be more straightforward compared to using shell or Perl when the operation is database specific.
For example if you would like to MySQLDump only Innodb table in one file per database you can do the following:
PLAIN TEXT SQL:As you can see we're just getting the set of commands to run. How to make it easily runable ? Well just use INTO OUTFILE to create very simple shell script:
PLAIN TEXT SQL:In other case I needed to restore Innodb tables from mysqldump because of corrupted Innodb tablespace - to do this I had to clean all .frm files which correspond to innodb tables (as well as ibdata1 and innodb log files). With shell you could so it by looking for .frm files which do not have corresponding .MYI files.... but this will also get all MEMORY tables which we want to leave in tact. Using similar approach we can do:
PLAIN TEXT SQL:Do not get me wrong, this is far from replacement for shell in all cases but it is rather handy for some niche tasks, in particular which involve unix commands driven by MySQL meta data.
Entry posted by peter | 5 comments
Peter wrote a post a while ago about choosing a good InnoDB log file size. Not to pick on Peter, but the post actually kind of talks about a lot of things and then doesn't tell you how to choose a good log file size! So I thought I'd clarify it a little.
The basic point is that your log file needs to be big enough to let InnoDB optimize its I/O, but not so big that recovery takes a long time. That much Peter covered really well. But how do you choose that size? I'll show you a rule of thumb that works pretty well.
In most cases, when people give you a formula for choosing a configuration setting, you should look at it with skepticism. But in this case you can calculate a reasonable value, believe it or not. Run these queries at your server's peak usage time:
PLAIN TEXT SQL:Notice the log sequence number. That's the total number of bytes written to the transaction log. So, now you can see how many MB have been written to the log in one minute. (The technique I showed here works on all versions of MySQL. In 5.0 and newer, you can just watch Innodb_os_log_written from SHOW GLOBAL STATUS, too.)
PLAIN TEXT SQL:As a rough rule of thumb, you can make the log big enough that it can hold at most an hour or so of logs. That's generally plenty of data for InnoDB to work with; an hour's worth is more than enough so that it can reorder the writes to use sequential I/O during the flushing and checkpointing process. At this rate, this server could use about 110 MB of logs, total. Round it up to 128 for good measure. Since there are two log files by default, divide that in half, and now you can set
PLAIN TEXT CODE:Does that look surprisingly small? It might. I commonly see log file sizes in the gigabyte ranges. But that's generally a mistake. The server I used for the measurements above is a big one doing a lot of work, not a toy. Log file sizes can't be left at the default 5MB for any real workload, but they often don't need to be as big as you might think, either.
If this rule-of-thumb calculation ends up showing you that your log file size ought to be many gigabytes, well, you have a more active write workload. Perhaps you're inserting a lot of big rows or something. In this case you might want to make the log smaller so you don't end up with GB of logs. But also realize this: the recovery time depends not only on the total log file size, but the number of entries in it. If you're writing huge entries to the log, fewer log entries will fit into a given log file size, which will generally make recovery faster than you might expect with a big log.
However, most of the time when I run this calculation, I end up finding that the log file size needs to be a lot smaller than it's configured to be. In part that's because InnoDB's log entries are very compact. The other reason is that the common advice to size the logs as a fraction of the buffer pool size is just wrong.
One final note: huge buffer pools or really unusual workloads may require bigger (or smaller!) log sizes. This is where formulas break down and judgment and experience are needed. But this "rule of thumb" is generally a good sane place to start.
Entry posted by Baron Schwartz | 12 comments
After some discussions on the OpenSQLCamp 2008 conference we decided to move our development to Launchpad, to be in stream with other MySQL related projects.
We published our patches there https://code.launchpad.net/percona-patches, it is supposed to be main repository for the patches.
We advise to use Launchpad bug system to report bugs and also for feature requests.
Entry posted by Vadim | No comment
We made new binaries for MySQL 5.0.67 build 7 which include patches we recently announced.
The -percona release includes:
PLAIN TEXT CODE:and -percona-highperf release additionaly includes
PLAIN TEXT CODE:You can download RPMs for RedHat / CentOS 4.x and 5.x for x86_64, binaries, sources and patches there
Entry posted by Vadim | 24 comments
Do you know that there are two limits about dirty (modified but not flushed to disk) blocks of InnoDB buffer pool? One is the limit of "amount". The other is the limit of "age".
-- limit of "amount" --
As you know, buffer pool of InnoDB works as write-back cache of its datafiles. If the buffer pool is filled by dirty blocks, InnoDB cannot allocate new blocks without flushing the dirty blocks and the performance would get worse. This is the limit of dirty block "amount". We can avoid this limit by setting 'innodb_max_dirty_pages_pct' smaller or setting the larger buffer pool size. We might be never at a loss about the limit.
The another limit we should understand is limit of dirty block "age".
-- limit of "age" --
As you know again, because InnoDB write the modifies of datafile to transaction log file synchronously, InnoDB is allowed to treat its buffer pool as write-back cache. The transaction log from the last checkpoint assures the latest-committed consistent data. So, the transaction log must contain the all transaction log from the last checkpoint.
Basically, InnoDB makes the checkpoint passively (it is called "fuzzy checkpoint"). InnoDB makes the point of time when the oldest dirty block occurred as the new checkpoint.
In short,
the dirty block older than the oldest transaction log is never allowed to exist.
This is the limit of dirty block "age".
The max age is determined by the total size of the transaction log files, because InnoDB uses the transaction log circulately. But, the larger transaction log file may not solve the problem of dirty block "age"...
The main essence of the "age" limit problem might be clustered distribution of "age". If there are huge number of similar aged dirty blocks and their age nears the max age, InnoDB flushes the dirty blocks with its best, but the oldest "age" of dirty blocks might not change and might reach the critical limit. Then InnoDB will pause until all of the old dirty blocks flushed...
I will show you the examples.
The workload is TPC-C like and the graphs show the change of throughput, write IO and checkpoint age with time.
<Normal>
The first is the case of normal (not patched) InnoDB.
The intense write IO occur and InnoDB stalls when the checkpoint age reaches the red broken line. After that, InnoDB stalls periodically.
The steady flushing may be short, and there may be clustered distributed "age". The clustered "age" may causes the clustered flushing and the next clustered "age"...
<innodb_max_dirty_pages_pct(native parameter) & innodb_io_capacity(patched parameter)>
The next graph is the case when I tried to restrain the checkpoint age growing by controlling steady flushing with only innodb_max_dirty_pages_pct and innodb_io_capacity.
There are no sudden stalls, but many write IO and regression of the average throughput. The optimum setting may be very difficult by this method, because innodb_max_dirty_pages_pct and innodb_io_capacity are independent to the checkpoint age.
<innodb_adaptive_checkpoint(new patched parameter) & innodb_io_capacity>
The last is the case of using innodb_adaptive_checkpoint instead of innodb_max_dirty_pages_pct.
During the checkpoint age excesses the light-blue broken line, the steady flushing occurs. If the checkpoint age reaches the next orange line, the flushing becomes more strong. This is the effect of innodb_adaptive_checkpoint. And its strength can be controlled by innodb_io_capacity. (The optimum setting may not be difficult as the 2nd case.)
At first, the checkpoint age touch the red line once and the regression occurs. However after that, the periodical regression waves seem to be attenuated quickly. The clustered "age" may be flatted. There are no the regression of the average throughput.
innodb_adaptive_checkpoint will solve or soften such problems of dirty block "age" limit
with easier setting.
In addition, innodb_io_patches.patch also adds
"max checkpoint age", "modified age" and "checkpoint age"
to SHOW INNODB STATUS and we can check them easily.
Entry posted by Yasufumi | 8 comments
As you likely have seen Sun has posted the new SpecJAppServer Results More information from Tom Daly can be found here These results are quite interesting for me as I worked on some of the previous SpecJAppServer Benchmarks several years ago while being employed by MySQL.
These are great results, plus they can be relevant to a lot of us because commodity x86 based hardware was used for the test. So it is not just about Sun it is about OpenSource hardware on Commodity Hardware.
As usually with results from such benchmarks there is no direct comparison available. The configuration Tom compares results to are not OpenSource and on the hardware of the different class so it is really hard to see what really caused the difference. It would be very interesting to see for example results on the same hardware just running PostgreSQL instead of MySQL or having Sun box replaced with comparable Dell or HP box. Unfortunately such benchmarks are more about marketing then fair technical comparison and we, technical people, just can assume the team had tried to get the most the given configuration and look for tuning ideas or try to read between the lines.
For example we can see for J2EE the system with 8 cores was used while for database 4 cores was enough. Does it mean the J2EE system got saturated before the database ? Or does it mean MySQL did not scale well to 8 cores on this benchmark ?
Now lets look at Java Settings:
JDBC Pool (for EJBs): max-pool-size=100 steady-pool-size=50
cachePrepStmts=true prepStmtCacheSize=512 alwaysSendSetIsolation=false
useLocalSessionState=true useServerPreparedStmts=false useLocalSessionState=true useReadAheadInput=false
elideSetAutoCommit=true useUsageAdvisor=false useReadAheadInput=false
useUnbufferedInput=false cacheServerConfiguration=true
This gives you some hints what you can try for JDBC configuration. Some caching options are must (otherwise JDBC will have a lot of extra calls checking various server changes which typically never change). We can also see Server prepared statements were disabled for the run (same as we had few years ago). Prepared statements generally should have helped this benchmark because it has a lot of same queries an Prepared Statements can be more efficient on the server side but I guess something is not as well optimized with them as it could be which makes it better to disable them.
The MySQL Settings are probably what is the most interesting:
MySQL 5.0 Tuning in /etc/my.cnf (included in FDA)
[mysqld]
sql-mode = IGNORE_SPACE
transaction-isolation = READ-COMMITTED
max_allowed_packet = 1M
max_connections=300
max_connect_errors=100
table_cache = 6000
read_rnd_buffer_size = 2M
sort_buffer_size = 32k
thread_cache = 16
query_cache_size = 0M
thread_concurrency = 8
max_heap_table_size=200M
log-output = FILE
log-slow-queries=/tmp/mysql-slow.log
long_query_time = 1
innodb_data_home_dir = /data/mysql/var
innodb_data_file_path = ibdata1:10000M:autoextend
innodb_file_per_table
innodb_log_group_home_dir = /log/mysql/var/
innodb_checksums = 0
innodb_doublewrite = 0
innodb_buffer_pool_size = 5000m
innodb_additional_mem_pool_size = 20M
innodb_log_file_size = 1600M
innodb_log_buffer_size = 16M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 300
innodb_thread_concurrency = 0
innodb_sync_spin_loops = 40
innodb_locks_unsafe_for_binlog = 1
innodb_max_dirty_pages_pct=15
innodb_support_xa=0
innodb_flush_method = O_DIRECT
First the options are somethat benchmark optimized - for example doublewrite is disabled; checksums are disabled; binary log is disabled - this is not what you would probably run in production but as I guess benchmark does not require these properties and so they are tuned away.... though I bet the over vendors do the same thing of tuning database options for benchmark rather than running production reasonable options.
It is interesting though slow query log remained enabled - probably it caused so little overhead because there were no slow queries it just was left in tact.
Other important settings:
transaction-isolation = READ-COMMITTED - this is indeed good setting for many workloads so unless you have repeatable reads requirement consider using this.
sort_buffer_size = 32k - It is interested how much this was really investigated. Indeed this is much smaller than default. If you have queries which only do small sorts it is indeed better to have smaller sort buffer because allocating it will likely be faster.
query_cache_size = 0M - query cache disabled. Not a big surprise though as this is default value I believe there was attempt to enable it and it did not make things better.
max_heap_table_size=200M - I'm not sure why is it set. Are any explicit MEMORY tables use in the benchmarks ? As otherwise you also need to boost tmp_table_size to deal with implicit memory tables.
innodb_file_per_table - so I assume storing each table in its own file worked better for this benchmark. Depending on workload it can be a bit slower or faster but it is surely better for operations.
innodb_log_file_size = 1600M - another Benchmark optimization. Such large logs typically will cause too long recovery time in case of crash so people settle on smaller log for a little bit lless of performance.
innodb_thread_concurrency = 0 Good! So Innodb performed best in this workload without restricting number of threads inside Innodb Kernel. Again this is rather workload specific.
innodb_sync_spin_loops = 40 - This tuning option is rarely used so I'm curious how much speed benefit did it really provide in this case.
innodb_max_dirty_pages_pct=15 - Another interesting one. So the it was better to restrict amount of dirty pages Innodb can have to get better performance. It probably was done to deal with "dips" which can affect peak response times a lot. The fuzzy checkpointing could be done a lot better in Innodb (and we have patches for this)
Operating System Notes from database server:
UFS Options for log and data
noatime, nologging
mysqld moved into the FX scheduling class via priocntl -s -c FX mysql-pid
"nologging" looks a bit scary here but again potential fsck on power failure is probably not the problem for benchmark specs. But what really surises me here is why not ZFS ? The ZFS MySQL success story is all over across Sun's blogs, so why benchmarks are still run on UFS ? It would be really cool to know how ZFS would score here. And it is really OK for it to be a little bit slower - it has many nice features which are worth a bit of performance overhead.
My best wishes to the Tom Daly and his team in further benchmark results improvements. I guess we'll see more of these to come.
Entry posted by peter | No comment
As you may have recently seen there are some articles about scaling MySQL one 256-way system.
I though wow did they really make it work, considering how many bottlenecks remain in MySQL.
What article really tells us ?
First the number 256 - this is not number of Cores... this is number of hardware threads which is not exactly the same thing. Each T2 Plus CPU has 8 cores, which with 8 threads each giving 64 threads per chip or 256 threads all together.
Now what is about MySQL scaling to use these 32 cores with 256 threads ? Especially with the goal of "Do it with minimal tuning i.e as close as possible as out-of-the-box" ? Do we simply start MySQL server and change couple of defaults to make it work ?
No! To get really good performance we have to setup 32! MySQL instances on this box (approximately one per core).
Comparing single instance performance (4350 statements/sec) to the total performance archived with 32 instances (79334 statements/sec) we can see the single instance gets 1/18 performance archiveable on the system which is of course extremely poor number.
The scale out and ability to scale by using as many MySQL instances as you like is a very good application architecture but unless you have everything absolutely automated in the Google way I do not think you would enjoy running 32 MySQL instances on the box for the single application just to get decent speed.
Now I should point back to my old post about T2000 Perfrormance with MySQL The Matt Ingenthron commented “Sun” is not, to my knowledge, “aggressively pushing T2000 as Scalable MySQL Platforms”. - Well what this one would be ? Of course it is not T2000 any more and the CPU performance got a bit better but as I understand single thread performance is still many time slower than one of recent Xeons which makes it really tough call for MySQL which can't perform many operations in parallel.
Now you might got a feeling 79000 queries is a lot ? It is hard to tell without knowing what the queries are but if you just want to look at the queries - check this out - these are a year and a half old benchmarks on the single MySQL instance showing over 50.000 queries on 8 cores (and close to 40.000 queries/sec on 4 slow AMD cores).
Do not get me wrong. The multi-threading architecture Sun has is great for many applications... however it is NOT great for MySQL unless you really really do not care about single thread performance (REPLICATION, ALTER TABLE etc) and willing to run insane number of MySQL instances per server.
Entry posted by peter | 13 comments
After some pause we are going to announce bunch of patches we made and ported for last period.
Ported patches (ported from Google V2 patch):
Our patches (thanks Yasufumi!)
We are preparing our binaries for RedHat / CentOS which will include some subset of patches, and also I think OurDelta will make builds available for wide range of platform.
Entry posted by Vadim | 2 comments
For normal Innodb "hot" backups we use LVM or other snapshot based technologies with pretty good success. However having incremental backups remain the problem.
First why do you need incremental backups at all ? Why not just take the full backups daily. The answer is space - if you want to keep several generations to be able to restore to, having huge amount of full copies of large database is not efficient. Especially if it only changes couple of percents per day.
The solution MySQL offers - using binary log works in theory but it is not overly useful in practice because it may take way too long to catch up using binary log. Even if you have very light updates and can execute updates for a full day within an hour it will take over 24 hours to cover month worth of binary logs... and quite typically you would have much higher update traffic.
Another solution is rdiff which is a great general purpose tool. Though you can do much better with Innodb in Particular.
The Innodb pages have great deal of information helpful for their incremental backup in their internal. There is basically page version allowing to quickly check if the page is newer. There is page checksum and finally there is an offset of page (where it should be in the data file) stored in the page.
Using this data it should be easy to implement very efficient and yet simple for Incremental backup for Innodb.
In a way similar to rdiff the tool could both update the backup and store the rollback changes or if dealing with read-only compressed backup create the roll-forward recovery log, which also can be easily compressed.
What tool would need to do is to go through the pages for each Innodb file and simply write all the new pages to the separate file. Because pages already have position information in them there is no need to have complex "diff" meta data.
For recovery we can simply read this new pages file and put the pages back to their original places.
Of course this means .frm files and Innodb logs and MyISAM system tables need to be copied fully but they typically do not have any considerable portion of Innodb database
Entry posted by peter | 7 comments
I had an interesting tuning case few days ago. The system serving high traffic using Innodb tables would be stalling every so often causing even very simple queries both reads and writes taking long time to complete, with progress almost paused (dropping from thousands to tens of queries per second).
On the surface the problem looked simple - in the processlist every so often you would see a lot of queries, mostly selects taking 10+ seconds while at the same time there was no significant iowait, neither high CPU usage. Closer examination showed there were hundreds of queries stuck in the innodb queue, with innodb_thread_concurrency set to 8
Happily enough innodb_thread_concurrency is the variable which can be set online so it is easy to try a few different values and see what works best. In this case we decided to try removing restriction on runnable queries all together by setting it to 0.
The change made things different. Now there was no more trivial selects taking a lot of time but there were a lot of write queries taking a lot of time being... many of which SHOW INNODB STATUS showed as waiting on famous Innodb AUTO-INC lock.
That would be an easy excuse to blame everything on this lock, suggest upgrading to MySQL 5.1 and call it a day. Learning more about the system I however found out there are only some 20 inserts per second going to such table (which should not be the big deal even if they are serialized) plus it is not only inserts to table with auto increment were stalling but also inserts to the tables which did not have auto-increment columns and general updates. The stall was not caused by row level locks too.
Taking a closer look at Innodb flush activity showed this is checkpoint related problem. Unfortunately Innodb fuzzy checkpointing algorithm is not very smart causing stalls waiting for large portion of buffer pool is flushed before queries can proceed. As this is a common problem we have the patch for it.
What is very interesting in this case how true issues can really hide because of layers of false problems. The queries we first saw in the processlist were SELECTs but this is just because most queries in the application are selects and as soon as the thread concurrency slots were busy it is mostly selects which were stuck waiting.
The next symptom was AUTO-INC lock which again was a false one - because single insert slows down a lot of inserts pile up waiting on the AUTO-INC lock. Though this is not the main problem the main problem lives below - why the original insert is taking so long letting hundreds of AUTO-INC insert to pile up.
This was also very interesting case about optimal innodb_thread_concurrency. I continue reading (and hearing at the conferences) conflicting recommendations - some suggest to set 0 to avoid queue blocks, other suggest some low value to reduce internal contention. In reality though this option best setting is highly application dependent and you should not blindly use somebody elses advice on it. Try different values and see what works best for you. As I mentioned in the start of this post changing innodb_thread_concurrency is happily online operation.
Entry posted by peter | 7 comments
We have finally found the time to update out presentation pages - we've added the presentation on