Article may2006.tar

Backups with rdiff-backup

Kevin Fenzi

Rdiff-backup is a backup application that takes the best features of a mirror and an incremental backup and combines them into one, easy-to-use, bandwidth-efficient application.

The first time rdiff-backup is run, it copies the entire contents of a directory tree to another target directory tree. After that initial run, it only copies changes that were made since the initial backup. Those changes are stored in a special directory, making it easy to examine and restore them. The target directory tree always contains the exact contents as of the last rdiff-backup run, making it easy to just restore the "last copy".

The primary benefits of rdiff-backup over using rsync, or similar systems, are that rdiff-backup fully supports ACLs and tightly integrates the retention of incremental change information into the backup and recovery process.

Some features of rdiff-backup include:

  • Written in Python
  • Uses the librsync module from the rsync package to efficiently store and manipulate incremental changes to files
  • Can operate locally, or via ssh to a remote machine
  • Preserves timestamps, ACLs, links, device files, user and group ownership, and modification times
  • Can run unattended to provide regular backups
  • Stores incremental changes in a compressed rsync format for space efficiency
  • After the initial backup, only changes are transmitted, making it very bandwidth friendly
  • Keeps good statistics on how many changes have taken place

Installing

Many operating systems have existing add-on packages for rdiff-backup. It's available via CentOS and Fedora Extras, Debian, Gentoo, and PLD Linux distributions using their native package management systems. It's also available from FreeBSD ports, NetBSD packages, and Mac OSX fink.

If you aren't running one of those systems, you can install rdiff-backup from source. You will need at least Python 2.2, librsync 0.9.7 or later, and optionally the pylibacl and pyxattr modules for ACL support.

Backups and How They Are Stored

A simple local backup of a directory can be done with the command:

rdiff-backup directoryA backup_of_directoryA
This will make an initial backup of all the contents of the source directory and place them in "backup_of_directoryA". After this initial run of rdiff-backup "directoryA" and "backup_of_directoryA" are almost identical. The only difference at all is that in "backup_of_directoryA" you will find a sub-directory called "rdiff-backup-data".

Rdiff-backup stores all of its files in this directory. All the other files in the tree are an exact copy of the files from directoryA. They have the same time-stamp, permissions, ownership, data, and ACLs. If you need to restore the last backed up copy of a file or directory, you can just use normal command-line tools to cp/scp/rsync that data from under the "backup_of_directoryA" tree.

Let's take a look at what is in that rdiff-backup-data directory under the top of a backup directory:

backup.log
chars_to_quote
current_mirror.2006-02-15T10:56:05-07:00.data
error_log.2006-02-15T10:56:05-07:00.data.gz
file_statistics.2006-02-15T10:56:05-07:00.data.gz
increments/
mirror_metadata.2006-02-15T10:56:05-07:00.snapshot.gz
session_statistics.2006-02-15T10:56:05-07:00.data
"backup.log" is a file that contains any errors or problems that might have occurred with your backup. It can also contain statistics about your backup (i.e., how many files were backed up, how much raw data, etc.). This file will be appended to, so you will always be able to look back and see which backups had issues and what the old stats were.

"chars_to_quote" is a file that contains a list of any characters that are considered "special" on your backup volume. This would allow you to use a different file system type to store your backups. For example, you might have a Linux system with an ext3 file system that is backing up to a vfat-formatted USB key device. Note that unless you have a very odd file system, rdiff-backup automatically detects what it needs to quote, and you shouldn't have to modify this file at all.

"current_mirror.YYYY-MM-DDTHH:MM:SS-TIMEZONE.data" keeps track of the PID of the rdiff-backup process that's currently mirroring your data, and also the time of the current backup (i.e., what's in the backup_of_directoryA/ tree).

"error_log.YYYY-MM-DDTHH:MM:SS-TIMEZONE.data.gz" is a gzipped file with any errors from a particular rdiff-backup run.

"file_statistics.YYYY-MM-DDTHH:MM:SS-TIMEZONE.data.gz" is a gzipped file that has statistics about a particular backup. This file will list if a file was changed from the last backup, the size of the file, the size of the incremental change, and the size that was copied to the backup.

The increments directory is where rdiff backup saves all the previous backup data. This directory is set up in a tree like the main backup directory, except there are additional files that rdiff-backup uses to store incremental backup data. Files and directories have full timestamps in their names so you can see which backup they are from. Files that were present in old backups, but were removed, have a .missing added to them. Files that have changed are stored in the format "filename.YYYY-MM-DDTHH:MM:SS-TIMEZONE.diff.gz", which is a gzipped rsync diff file.

"mirror_metadata.YYYY-MM-DDTHH:MM:SS-TIMEZONE.snapshot.gz" file is a gzip-compressed file that contains metadata for files in the backup. This includes things such as: File type (directory, symlink, device file, etc), Modification times, uid, username, gid, groupname, permissions, or any other items that are part of a file's metadata.

"session_statistics.YYYY-MM-DDTHH:MM:SS-TIMEZONE.data" contains the overall stats on this backup. This is a high-level summary with number of files copied, size of backup, etc.

So, now if we run another backup the same way and look in the rdiff-backup-data directory, we see:

backup.log
chars_to_quote
current_mirror.2006-02-15T11:15:01-07:00.data
error_log.2006-02-15T10:56:05-07:00.data.gz
error_log.2006-02-15T11:15:01-07:00.data.gz
file_statistics.2006-02-15T10:56:05-07:00.data.gz
file_statistics.2006-02-15T11:15:01-07:00.data.gz
increments/
mirror_metadata.2006-02-15T10:56:05-07:00.snapshot.gz
mirror_metadata.2006-02-15T11:15:01-07:00.snapshot.gz
session_statistics.2006-02-15T10:56:05-07:00.data
session_statistics.2006-02-15T11:15:01-07:00.data
Note that the new backup is about 20 minutes after the first. We see there is only one "current_mirror" file, now pointing to the second backup as being the most current. You can run backups as often as your disk space and bandwidth allow.

Remote Backups

Now that I've shown how easy it is to do local backups to another directory, let's look at remote backups. Rdiff-backup can use ssh as a underlying transport to send backups to a remote directory. You will need to make sure rdiff-backup is installed on both the client machine and the machine to which you are sending your backups.

Also note that you will need the same versions of rdiff-backups on both ends, as the protocol sometimes changes in different versions. This requires that you keep the versions of rdiff-backup on both systems closely synchronized.

Once rdiff-backup is installed on both machines, you can simply do:

rdiff-backup localdir/ user@backupmachine::/backups/backup_of_local_dir/
Rdiff-backup will prompt you for your password or use your ssh key and invoke rdiff-backup in a server mode on the remote machine. Note that you may need to be root on the remote machine to write device files or ACLs.

You can also set up rdiff-backup for unattended nightly backups. You will need to set up your backup user with an ssh key with no password, but restricted by IP address and command. There is a detailed description about how to do this on the rdiff-backup Web site.

Restoring Data from Backups

Performing backups is fun and easy with rdiff-backup, but you will probably run into a time where you need to restore data from those backups. Rdiff-backup makes this pretty easy as well. Let's look at a few common restore cases:

1. You need to restore the last backup of a file/directory:

This is very easy with rdiff-backup. You simply go to your backup directory and use standard tools to cp/scp/rsync the file or directory you want out. Since the main backup directory is always a copy of the data as of the last backup, you can just go and get it. Note that using non-ACL-aware tools might mean that you loose ACL information when copying the file back. If you need to keep your ACL information, see one of the methods below.

2. You need to restore a file/directory from last week's backup:

You can use rdiff-backup to get this file/directory from the backup machine and restore it to /tmp on your local machine:

rdiff-backup -r 7D user@backupmachine::/remote-dir/file /tmp/file
This will get the file in /remote-dir/file as it was from any backups 7 days ago. If there are no backups from exactly 7 days ago, it will use the next previous backup. You can specify all sorts of time/date combinations. See the rdiff-backup man page for more details.

3. You aren't entirely sure which version of a particular file you need, so you want to look at the various backup diffs and decide.

You can examine the rdiff-backup .diff.gz files for a particular file and when you find the one you want to restore, you can tell rdiff-backup that you want that particular diff. Do something like:

rdiff-backup user@backupmachine::/remote-dir/rdiff-backup-data/ \
  increments/file.YYYY.MM.DD.HH:MM:SS.diff.gz /tmp/file
You can also get rdiff-backup to list when incrementals were done:

rdiff-backup --list-incrementals user@backupmachine::/remote-dir/
More Tips and Examples

Initial backups of large amounts of data take quite a while, because all of your data must be copied over to the backup device. You should make sure to perform initial backups over fast links if possible.

If you are doing backups over a slow/flaky link, you might consider doing an rdiff-backup to another local directory first and then doing something like an rsync to the remote machine from there. This gets around a problem of the rdiff-backup never being able to finish.

If you are using rdiff-backup to back up an entire Linux machine, you may want to add: --exclude /selinux --exclude /sys --exclude /proc to prevent rdiff-backup from complaining about these virtual file systems.

You will want to remove old backups when you need to free up space on your backup media. You can use the --remove-older-than option to do so. This option takes any of the time options that the other rdiff-backup options take, (i.e., 1D for 1 day, 1W for 1 week, 1Y for 1 year, and 1Y2W10D for anything older than 1 year, 2 weeks, and 10 days). Note that rdiff-backup is very space efficient with incremental backups, so you can monitor and adjust how many backups you keep.

You can use the --exclude and --include options to exclude or include different file types or names. For example, if you don't want to back up any of your iso images or mp3 files, you could add a --exclude *.mp3 --exclude *.iso to your backup command.

Rdiff-backup vs. Other Backup Software

Rdiff-backup is oriented toward online disk-based backups, so it doesn't really compete with solutions that back up to tape. Because of the cost of tapes and tape drives compared to disks, it makes a lot of sense to use disks for backup.

There are several other solutions that are similar to rdiff-backup. For a long time, I used a simple backup script that used rsync to back up to a remote server. This method works fine, but incremental backups are not done. If you need a backup of a file from before it was modified, you are out of luck. Rdiff-backup wins here because of its easy ability to store as many incremental backups as you have space for.

Slightly more complex scripts with rsync can use the "hard-link" option to "cp" to build a space-efficient copy of the rsync backup directory after the rsync runs. However, creating these "incremental" copies can take a very long time. Also, if you have large files that have small additions made to them on a daily basis, like many database and log files, this rsync mechanism stores a full copy of the file every day. Rdiff-backup stores only the exact set of changes within the file from day to day.

Dirvish is a Perl-based solution that has quite a lot in common with rdiff-backup. Dirvish seems to be shipped only with Debian, though, so it isn't as easily available as rdiff-backup. It also requires a number of Perl modules to be installed. You must also set up its "Banks" and "Vaults" to do backups; whereas rdiff-backup creates the files it needs on the fly. Rdiff-backup seems to be easier to use and install than Dirvish.

There is a branch of rdiff-backup called "duplicity". This package works very much like rdiff-backup, but it uses gnupg to encrypt all backups on the remote server. This provides more security in your backups and allows you to restrict access to the data to those who have the gnupg passphrase for the backup.

Possible Pitfalls

Rdiff-backup is a great backup solution, but like any application it has some drawbacks:

  • You will need enough space on your backup media for your entire dataset, plus space for incrementals. This isn't really that big of a deal in these days of large inexpensive disks, but it's worth mentioning.
  • The initial sync of large datasets takes a very long time. My laptop with about 45 GB of data took about 15 hours to complete its initial sync over wireless. A fast network is recommended for initial backups. On the other hand, incremental backups are very fast. My laptop takes about 25 minutes to complete an incremental backup now.
  • If a backup fails due to network or other issues, the next backup will throw out the incomplete backup and start over. (See the tips section above for a workaround for that.)
  • Currently, SElinux contexts are not saved/restored properly by rdiff-backup. I did some investigation and testing, and it doesn't seem to properly detect the contexts. This could be an issue with sites that heavily use SElinux. In the meantime, you should check the SELinux contexts after any restore and manually reset them to proper values or use the "relabel" option to have SElinux re-label all files.

Conclusion

Rdiff-backup is an easy-to-install, easy-to-use backup solution that does everything that most backup solutions need to do. Try it out today.

References

Dirvish -- http://www.dirvish.com/

Duplicity -- http://www.nongnu.org/duplicity/

Librsync -- http://librsync.sourceforge.net/

Pylibacl -- http://pylibacl.sourceforge.net/

Pyxattr -- http://pyxattr.sourceforge.net/

Python -- http://www.python.org/

Rdiff-backup Web page -- http://rdiff-backup.org/

Kevin Fenzi, co-author of the Linux Security HOWTO, is a senior member of tummy.com's team. He has been working as a Unix and Linux Systems Administrator for more than 15 years. His passion is security and all of the steps needed to ensure that systems are kept safe.