MySQL Connections
07 March 2012

We've been having connection issues with our MySQL servers at work. Today I see a lot of connections that have been open for 1000+ seconds. Here's a quick way to find them:

mysqladmin processlist | grep [databasename] | awk '{if($12>1000) print $2, $6, $12}'
You want to grep for the database name so you don't inadvertently get and kill system processes. You could also put the condition in the awk where clause for more accuracy.

Then you can take that list and kill the mysql process ids

mysqladmin kill [id1],[id2],[id3]....
If you're really adventurous, you could pipe the ID's straight to mysqladmin kill, but that's a little dangerous.

Monitoring MySQL Queries
17 February 2012

I have the challenge of finding some seemingly random issues with MySQL connection spikes and subsequent connection abortions. MySQL Enterprise Monitor is already in place and no other indicators correspond with the connection issues. The plan is to keep a record of all queries against the server for a period of time so that when the issue happens next, we have a history of activity before and throughout the trouble period.

The traffic to the database is captured via a replicated switch port to another host so as not to impose any additional load on the server. This host will run tcpdump to capture the packets and dump them to a file. Later on, to investigate issues the Maatkit toolkit, specifically, mk-query-digest will be used to recreate and report onthe queries. Then I can look for any anomalies.

The amount of traffic going to the database is enormous. Enough to generate over 1/2 a gig of packet dumps in one minute. I wrote a simple script to create a new dump file every x minutes.


echo $$>$PIDFILE

while [ 1 ] 
    tcpdump -i eth1 port 3306 -s 65535 -x -n -q -tttt 
       > $LOGPATH/packets-$(date +%y%m%d-%H%M%S).log & PID=$!
    sleep $CAPTIME
    kill -HUP $PID

I then have an hourly cron that comes through and deletes any log files over 2 hours old

find /data/capture/logs -mmin +120 -exec rm {} ;

This will give me the history I need. However, I'm fully expecting the parsing of these logs to be extremely tedious. Here's to hoping I find something.

I recently had an issue where one of my MySQL slaves repeatedly error'd on replication due to key collisions. The replication type was row based, which is much more strict that statement. In fact, if it had been statement based, a lot of these errors wouldn't have presented themselves and the slave would have continued on happily becoming more and more inconsistent.,

Because of the huge dataset and the speed of recovery required, I did not want to rebuild the entire database. I wanted to restore only the couple tables that were causing issues.

What I wanted to do was

  1. stop replication on a good slave and the problem slave at the same point
  2. dump the table (from the good slave, obviously)
  3. drop and import the dumped tables on the problem slave
  4. restart replication

Fortunately this is achievable with by stopping the slave to be fixed and then a minute or so later the source database, capturing the binlog and position and then issuing the following on the target database:

SELECT MASTER_POS_WAIT('binlog.000001', 123456);
At this point, both databases should have had their slave processes halted at the same execution point and the dump and restore outlined above can be done. This reduced what would have been a 5 hour database copy into 15 minutes. Hopefully this will save someone else some time too.

Cloning a MySQL Slave
09 November 2011

Cloning a MySQL slave is typically very straight forward.

  1. Execute stop slave on the donor slave and capture the slave status information
  2. Stop mysql on the donor
  3. Copy the database files from the donor to the new slave
  4. Start MySQL on the new slave
  5. Execute the change master statement to start the new slave's replication process
  6. Start mysql on the donor and allow replication to catch up

Simple right? It is, if you don't run into the scenario I managed to hit. Show slave status gives you a lot of information like this:

                  Master_User: repl_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.002644
          Read_Master_Log_Pos: 1015419943
               Relay_Log_File: relay-log.000257
                Relay_Log_Pos: 68175060
        Relay_Master_Log_File: binlog.002643
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 887594041
              Relay_Log_Space: 2448352803

This information is then matched into a change master statement. The most important values here are MASTER_LOG_FILE and MASTER_LOG_POS which tell the slave thread where to start replicating from. This will be the point where the donor slave was stopped before the copy. Now, you'd think that the master log file would correspond with the Master_Log_File value from show slave status. It doesn't. You want to use Relay_Master_Log_File. Most often, these log files are the same - if replication isn't lagging. But even in instances where the slave is not behind there is a small chance you'll catch it as the IO thread is getting the next binlog file. Now if you use the log file you think is correct, you'll be setting your replication to start too far ahead. Now, you may get lucky like me and the position won't exist in the binlog you set, and you'll get a replication IO error in my case, 1236 "Exceeded max_allowed_packet", instructing you to set the max_allowed_packet larger on the master which is misleading in this case.

The CHANGE MASTER statement for the above slave status should be

The log position is the value from Exec_Master_Log_Pos.

Hopefully this will keep you from bludgeoning your forehead on your desk as I nearly did.