Friday, November 9, 2012

MySQL monitoring, backups and recovery

Being a MySQL DBA is more than just being able to understand the explain command to optimize a query.  The best optimized queries in the world are worthless with the database is down.

Some of the big responsibilities with being a MySQL DBA include monitoring, backup and recovery. This is a high level overview of those topics.

Monitoring:

A single monitor is just not good enough. If you are running a Nagios server with the MySQL plugins that is a great start. This  should not be your only solution. The first thing that should be added is another remote Nagios server to monitor the 1st master Nagios  monitor. After all if the monitor goes down how do you know?

When running a revenue generating MySQL Server it is easy to suggest using the MySQL Enterprise Monitor.  I have used the MySQL Enterprise Monitor as an additional resource for monitoring. It helps you spot trends as well as offer suggestions for tuning.

Monitor your logs. In order to be able to recover, you need to understand what happened. Monitoring log is more than just the error log. You need to watch the slow_query_log, error_log as well as OS logs. Knowing how to gather everything from network stats to memory or raid status is key.

Monitor your reports. Put in place events || etls || checks to validate reporting numbers. Every application is different so this is a custom solution per application but be able to get valid notifications when key reporting numbers fall under thresholds. This could be an early warning sign to other issues.

Monitor your backup logs and sizes. If you fail to watch the sizes of your backups as well as the logged output then I wish you the best of luck. Something like an upgrade could change your backup procedure. While you consider it to still be running just fine, maybe they error out now? A simple cron job that executes a script to record file sizes and checks for errors can put the output into a the database. Then another script can be used for reporting on that file size output. You will then have all the trends of file sizes to make positive thresholds.

Take out the human factor.  If all the monitoring notifications go to one person, then that is a single point of failure itself. Make sure fine tune notifications to remove false positives and alert more than one person. Text messages to smart phones are not enough. People are allowed to sleep and it is very easy to sleep through a txt message. I enjoy the old school pager because it simply works. If you can take advantage of asterisks then use it. A phone call wakes people up more than just a txt message.

This quick overview highlights that monitoring is more than just a single entity. It is a vast process that should never be taken for granted.

Backups:
MySQL backups are the process that everyone knows is a good idea. The problem is how many people actually do it effectively. Meaning that they work, tested, and do not harm a live revenue generating site.

mysqldump  is an effective tool for backups when it is not a live system. Stopping a slave and doing a mysqldump then restarting a slave is a very safe method for your system. The problem is that to many applications or entities either take to long to do a dump and rarely tested when it is done.

mysqldbexport is another option over mysqldump because it allows different output options.

Replication is not a backup solution. It is a step in the right direction for higher availability and you can use replication for rolling historical data with time delayed replication but you still need to have the data as a backup to be safe. If using MySQL dump or the enterprise options, yes you can use them off a slave easier just replication is not a backup solution.

Take advantage of the mysqlbinlog and use it to backup your binlogs. The binlogs are the life of your MySQL server, take care of them.

The best option in my opinion is MySQL Enterprise Backup. Nothing against Percona XtraBackup but since I already encourage the use of enterprise monitor I typically would support and stay with enterprise backup since I have access to it.  Regardless of your choice, the point is simple, take advantage of on-line backups. What happens when you have a serious replication issue and your slaves are useless? Do you really want to do a mysqldump off the master?  They also offer incremental and compression to help with your backup needs.

Recovery:

This is often the piece of the process that tested in theory but less in reality. It is often not realistic for some companies to be able to restore their backups often. The hardware required to hold the database is not cheap and can be used for other things than just backup restores. You do need to test your backups though. Maybe it is just once a quarter and you can afford to test a slave with a restored backup.  Maybe you have a virtual environment and you can test a backup into a VM just to see if it works?  Execute some reporting on this database to verify numbers. Take advantage of the different database checksum options available in the community.

Whatever it is something is better than nothing.



This again is just a overview of the concepts about monitoring, backup and recovery. Please use the links provided to find out more information.

More community and MySQL resources and options: