Backups
with rdiff-backup
Kevin Fenzi
Rdiff-backup is a backup application that takes the best features
of a mirror and an incremental backup and combines them into one,
easy-to-use, bandwidth-efficient application.
The first time rdiff-backup is run, it copies the entire contents
of a directory tree to another target directory tree. After that
initial run, it only copies changes that were made since the initial
backup. Those changes are stored in a special directory, making
it easy to examine and restore them. The target directory tree always
contains the exact contents as of the last rdiff-backup run, making
it easy to just restore the "last copy".
The primary benefits of rdiff-backup over using rsync, or similar
systems, are that rdiff-backup fully supports ACLs and tightly integrates
the retention of incremental change information into the backup
and recovery process.
Some features of rdiff-backup include:
- Written in Python
- Uses the librsync module from the rsync package to efficiently
store and manipulate incremental changes to files
- Can operate locally, or via ssh to a remote machine
- Preserves timestamps, ACLs, links, device files, user and group
ownership, and modification times
- Can run unattended to provide regular backups
- Stores incremental changes in a compressed rsync format for
space efficiency
- After the initial backup, only changes are transmitted, making
it very bandwidth friendly
- Keeps good statistics on how many changes have taken place
Installing
Many operating systems have existing add-on packages for rdiff-backup.
It's available via CentOS and Fedora Extras, Debian, Gentoo, and
PLD Linux distributions using their native package management systems.
It's also available from FreeBSD ports, NetBSD packages, and Mac
OSX fink.
If you aren't running one of those systems, you can install rdiff-backup
from source. You will need at least Python 2.2, librsync 0.9.7 or
later, and optionally the pylibacl and pyxattr modules for ACL support.
Backups and How They Are Stored
A simple local backup of a directory can be done with the command:
rdiff-backup directoryA backup_of_directoryA
This will make an initial backup of all the contents of the source
directory and place them in "backup_of_directoryA". After this initial
run of rdiff-backup "directoryA" and "backup_of_directoryA" are almost
identical. The only difference at all is that in "backup_of_directoryA"
you will find a sub-directory called "rdiff-backup-data".
Rdiff-backup stores all of its files in this directory. All the
other files in the tree are an exact copy of the files from directoryA.
They have the same time-stamp, permissions, ownership, data, and
ACLs. If you need to restore the last backed up copy of a file or
directory, you can just use normal command-line tools to cp/scp/rsync
that data from under the "backup_of_directoryA" tree.
Let's take a look at what is in that rdiff-backup-data directory
under the top of a backup directory:
backup.log
chars_to_quote
current_mirror.2006-02-15T10:56:05-07:00.data
error_log.2006-02-15T10:56:05-07:00.data.gz
file_statistics.2006-02-15T10:56:05-07:00.data.gz
increments/
mirror_metadata.2006-02-15T10:56:05-07:00.snapshot.gz
session_statistics.2006-02-15T10:56:05-07:00.data
"backup.log" is a file that contains any errors or problems
that might have occurred with your backup. It can also contain statistics
about your backup (i.e., how many files were backed up, how much raw
data, etc.). This file will be appended to, so you will always be
able to look back and see which backups had issues and what the old
stats were.
"chars_to_quote" is a file that contains a list of any characters
that are considered "special" on your backup volume. This would
allow you to use a different file system type to store your backups.
For example, you might have a Linux system with an ext3 file system
that is backing up to a vfat-formatted USB key device. Note that
unless you have a very odd file system, rdiff-backup automatically
detects what it needs to quote, and you shouldn't have to modify
this file at all.
"current_mirror.YYYY-MM-DDTHH:MM:SS-TIMEZONE.data" keeps track
of the PID of the rdiff-backup process that's currently mirroring
your data, and also the time of the current backup (i.e., what's
in the backup_of_directoryA/ tree).
"error_log.YYYY-MM-DDTHH:MM:SS-TIMEZONE.data.gz" is a gzipped
file with any errors from a particular rdiff-backup run.
"file_statistics.YYYY-MM-DDTHH:MM:SS-TIMEZONE.data.gz" is a gzipped
file that has statistics about a particular backup. This file will
list if a file was changed from the last backup, the size of the
file, the size of the incremental change, and the size that was
copied to the backup.
The increments directory is where rdiff backup saves all the previous
backup data. This directory is set up in a tree like the main backup
directory, except there are additional files that rdiff-backup uses
to store incremental backup data. Files and directories have full
timestamps in their names so you can see which backup they are from.
Files that were present in old backups, but were removed, have a
.missing added to them. Files that have changed are stored in the
format "filename.YYYY-MM-DDTHH:MM:SS-TIMEZONE.diff.gz", which is
a gzipped rsync diff file.
"mirror_metadata.YYYY-MM-DDTHH:MM:SS-TIMEZONE.snapshot.gz" file
is a gzip-compressed file that contains metadata for files in the
backup. This includes things such as: File type (directory, symlink,
device file, etc), Modification times, uid, username, gid, groupname,
permissions, or any other items that are part of a file's metadata.
"session_statistics.YYYY-MM-DDTHH:MM:SS-TIMEZONE.data" contains
the overall stats on this backup. This is a high-level summary with
number of files copied, size of backup, etc.
So, now if we run another backup the same way and look in the
rdiff-backup-data directory, we see:
backup.log
chars_to_quote
current_mirror.2006-02-15T11:15:01-07:00.data
error_log.2006-02-15T10:56:05-07:00.data.gz
error_log.2006-02-15T11:15:01-07:00.data.gz
file_statistics.2006-02-15T10:56:05-07:00.data.gz
file_statistics.2006-02-15T11:15:01-07:00.data.gz
increments/
mirror_metadata.2006-02-15T10:56:05-07:00.snapshot.gz
mirror_metadata.2006-02-15T11:15:01-07:00.snapshot.gz
session_statistics.2006-02-15T10:56:05-07:00.data
session_statistics.2006-02-15T11:15:01-07:00.data
Note that the new backup is about 20 minutes after the first. We see
there is only one "current_mirror" file, now pointing to the second
backup as being the most current. You can run backups as often as
your disk space and bandwidth allow.
Remote Backups
Now that I've shown how easy it is to do local backups to another
directory, let's look at remote backups. Rdiff-backup can use ssh
as a underlying transport to send backups to a remote directory.
You will need to make sure rdiff-backup is installed on both the
client machine and the machine to which you are sending your backups.
Also note that you will need the same versions of rdiff-backups
on both ends, as the protocol sometimes changes in different versions.
This requires that you keep the versions of rdiff-backup on both
systems closely synchronized.
Once rdiff-backup is installed on both machines, you can simply
do:
rdiff-backup localdir/ user@backupmachine::/backups/backup_of_local_dir/
Rdiff-backup will prompt you for your password or use your ssh key
and invoke rdiff-backup in a server mode on the remote machine. Note
that you may need to be root on the remote machine to write device
files or ACLs.
You can also set up rdiff-backup for unattended nightly backups.
You will need to set up your backup user with an ssh key with no
password, but restricted by IP address and command. There is a detailed
description about how to do this on the rdiff-backup Web site.
Restoring Data from Backups
Performing backups is fun and easy with rdiff-backup, but you
will probably run into a time where you need to restore data from
those backups. Rdiff-backup makes this pretty easy as well. Let's
look at a few common restore cases:
1. You need to restore the last backup of a file/directory:
This is very easy with rdiff-backup. You simply go to your backup
directory and use standard tools to cp/scp/rsync the file or directory
you want out. Since the main backup directory is always a copy of
the data as of the last backup, you can just go and get it. Note
that using non-ACL-aware tools might mean that you loose ACL information
when copying the file back. If you need to keep your ACL information,
see one of the methods below.
2. You need to restore a file/directory from last week's backup:
You can use rdiff-backup to get this file/directory from the backup
machine and restore it to /tmp on your local machine:
rdiff-backup -r 7D user@backupmachine::/remote-dir/file /tmp/file
This will get the file in /remote-dir/file as it was from any backups
7 days ago. If there are no backups from exactly 7 days ago, it will
use the next previous backup. You can specify all sorts of time/date
combinations. See the rdiff-backup man page for more details.
3. You aren't entirely sure which version of a particular file
you need, so you want to look at the various backup diffs and decide.
You can examine the rdiff-backup .diff.gz files for a particular
file and when you find the one you want to restore, you can tell
rdiff-backup that you want that particular diff. Do something like:
rdiff-backup user@backupmachine::/remote-dir/rdiff-backup-data/ \
increments/file.YYYY.MM.DD.HH:MM:SS.diff.gz /tmp/file
You can also get rdiff-backup to list when incrementals were done:
rdiff-backup --list-incrementals user@backupmachine::/remote-dir/
More Tips and Examples
Initial backups of large amounts of data take quite a while, because
all of your data must be copied over to the backup device. You should
make sure to perform initial backups over fast links if possible.
If you are doing backups over a slow/flaky link, you might consider
doing an rdiff-backup to another local directory first and then
doing something like an rsync to the remote machine from there.
This gets around a problem of the rdiff-backup never being able
to finish.
If you are using rdiff-backup to back up an entire Linux machine,
you may want to add: --exclude /selinux --exclude /sys --exclude
/proc to prevent rdiff-backup from complaining about these virtual
file systems.
You will want to remove old backups when you need to free up space
on your backup media. You can use the --remove-older-than
option to do so. This option takes any of the time options that
the other rdiff-backup options take, (i.e., 1D for 1 day,
1W for 1 week, 1Y for 1 year, and 1Y2W10D for
anything older than 1 year, 2 weeks, and 10 days). Note that rdiff-backup
is very space efficient with incremental backups, so you can monitor
and adjust how many backups you keep.
You can use the --exclude and --include options
to exclude or include different file types or names. For example,
if you don't want to back up any of your iso images or mp3 files,
you could add a --exclude *.mp3 --exclude *.iso to your backup
command.
Rdiff-backup vs. Other Backup Software
Rdiff-backup is oriented toward online disk-based backups, so
it doesn't really compete with solutions that back up to tape. Because
of the cost of tapes and tape drives compared to disks, it makes
a lot of sense to use disks for backup.
There are several other solutions that are similar to rdiff-backup.
For a long time, I used a simple backup script that used rsync to
back up to a remote server. This method works fine, but incremental
backups are not done. If you need a backup of a file from before
it was modified, you are out of luck. Rdiff-backup wins here because
of its easy ability to store as many incremental backups as you
have space for.
Slightly more complex scripts with rsync can use the "hard-link"
option to "cp" to build a space-efficient copy of the rsync backup
directory after the rsync runs. However, creating these "incremental"
copies can take a very long time. Also, if you have large files
that have small additions made to them on a daily basis, like many
database and log files, this rsync mechanism stores a full copy
of the file every day. Rdiff-backup stores only the exact set of
changes within the file from day to day.
Dirvish is a Perl-based solution that has quite a lot in common
with rdiff-backup. Dirvish seems to be shipped only with Debian,
though, so it isn't as easily available as rdiff-backup. It also
requires a number of Perl modules to be installed. You must also
set up its "Banks" and "Vaults" to do backups; whereas rdiff-backup
creates the files it needs on the fly. Rdiff-backup seems to be
easier to use and install than Dirvish.
There is a branch of rdiff-backup called "duplicity". This package
works very much like rdiff-backup, but it uses gnupg to encrypt
all backups on the remote server. This provides more security in
your backups and allows you to restrict access to the data to those
who have the gnupg passphrase for the backup.
Possible Pitfalls
Rdiff-backup is a great backup solution, but like any application
it has some drawbacks:
- You will need enough space on your backup media for your entire
dataset, plus space for incrementals. This isn't really that big
of a deal in these days of large inexpensive disks, but it's worth
mentioning.
- The initial sync of large datasets takes a very long time.
My laptop with about 45 GB of data took about 15 hours to complete
its initial sync over wireless. A fast network is recommended
for initial backups. On the other hand, incremental backups are
very fast. My laptop takes about 25 minutes to complete an incremental
backup now.
- If a backup fails due to network or other issues, the next
backup will throw out the incomplete backup and start over. (See
the tips section above for a workaround for that.)
- Currently, SElinux contexts are not saved/restored properly
by rdiff-backup. I did some investigation and testing, and it
doesn't seem to properly detect the contexts. This could be an
issue with sites that heavily use SElinux. In the meantime, you
should check the SELinux contexts after any restore and manually
reset them to proper values or use the "relabel" option to have
SElinux re-label all files.
Conclusion
Rdiff-backup is an easy-to-install, easy-to-use backup solution
that does everything that most backup solutions need to do. Try
it out today.
References
Dirvish -- http://www.dirvish.com/
Duplicity -- http://www.nongnu.org/duplicity/
Librsync -- http://librsync.sourceforge.net/
Pylibacl -- http://pylibacl.sourceforge.net/
Pyxattr -- http://pyxattr.sourceforge.net/
Python -- http://www.python.org/
Rdiff-backup Web page -- http://rdiff-backup.org/
Kevin Fenzi, co-author of the Linux Security HOWTO, is a senior
member of tummy.com's team. He has been working as a Unix and Linux
Systems Administrator for more than 15 years. His passion is security
and all of the steps needed to ensure that systems are kept safe. |