In an earlier post I talked generally about my backup procedure for large amounts of data. In the post I discussed using RsyncX to back up staff Work drives over a network, as well as my own personal Work drive data, to a spare hard drive. Today I'd like to get a bit more specific.
I do not use, nor do I recommend the version of rsync that ships with Mac OS X 10.4. I've found it, in my own personal tests, to be extremely unreliable, and unreliability is the last thing you want in a backup program. Instead I use — and have been using without issue for years now — RsyncX. RsyncX is a GUI wrapper for a custom-built version of the rsync command that's made to properly deal with HFS+ resource forks. So the first thing you need to do is get RsyncX, which you can do here. To install RsyncX, simply run the installer. This will place the resource-fork-aware version of rsync in /usr/local/bin/. If all you want to do is run rsync from the RsyncX GUI, then you're done, but if you want to run it non-interactively from the command-line — which ultimately we do — you should put the newly installed rsync command in the standard location, which is /usr/bin/.¹ Before you do this, it's always a good idea to make a backup of the OS X version. So:
Ah! Much better! Okay. We're ready to roll with local backups.²
Creating local backups with rsync is pretty straightforward. The RsyncX version of the command acts almost exactly like the standard *NIX version, except that it has an option to preserve HFS+ resource forks. This option must be provided if you're interested in preserving said resource forks. Let's take a look at a simple rsync command:
This command will backup the contents of the Work volume to another volume called Backup. The -a flag stands for "archive" and will simply backup everything that's changed while leaving files that may have been deleted from the source. It's usually what you want. The -vv flag specifies "verbosity" and will print what rsync is doing to standard output. The level of verbosity is variable, so "-v" will give you only basic information, "-vvvv" will give you everything it can. I like "-vv." That's just the right amount of info for me. The next two entries are the source and target directories, Work and Backup. The --eahfs flag is used to tell rsync that you want to preserve resource forks. It only exists in the RsyncX version. Finally, pay close attention to the trailing slash in your source and target paths. The source path contains a trailing slash — meaning we want the command to act on the drive's contents, not the drive itself — whereas the target path contains no trailing slash. Without the trailing slash on the source, a folder called "Work" will be created inside the WorkBackup drive. This trailing slash behavior is standard in *NIX, but it's important to be aware of when writing rsync commands.
That's pretty much it for simple local backups. There are numerous other options to choose from, and you can find out about them by reading the rsync man page.
One of the great things about rsync is its ability to perform operations over a network. This is a big reason I use it at work to back up staff machines. The rsync command can perform network backups over a variety of protocols, most notably SSH. It also can reduce the network traffic these backups require by only copying the changes to files, rather than whole changed files, as well as using compression for network data transfers.
The version of rsync used by the host machine and the client machine must match exactly. So before we proceed, copy rsync to its default location on your client machine. You may want to back up the Mac OS X version on your client as well. If you have root on both machines you can do this remotely on the command line:
Backing up over the network isn't too much different or harder than backing up locally. There are just a few more flags you need to supply. But the basic idea is the same. Here's an example:
This is pretty similar to our local command. The -a flag is still there, and we've added the -z flag as well, which specifies to use compression for the data (to ease network traffic). We now also have an -e flag which tells rsync that we're running over a network, and an SSH option that specifies the protocol to use for this network connection. Next we have the source, as usual, but this time our source is a computer on our network, which we specify just like we would with any SSH connection — hostname:/Path/To/Volume. Finally, we have the --eahfs flag for preserving resource forks. The easiest thing to do here is to run this as root (either directly or with sudo), which will allow you to sync data owned by users other than yourself.
Unattended Network Backups
Running backups over the network can also be
completely automated and can run transparently in the background even on systems where no user is logged in to the Mac OS X GUI. Doing this over SSH, of course, requires an SSH connection that does not interactively prompt for a password. This can be accomplished by establishing authorized key pairs between host and client. The best resource I've found for learning how to do this is Mike Bombich's page on the subject. He does a better job explaining it than I ever could, so I'll just direct you there for setting up SSH authentication keys. Incidentally, that article is written with rsync in mind, so there are lots of good rsync resources there as well. Go read it now, if you haven't already. Then come back here and I'll tell you what I do.
I'd like to note, at this point, that enabling SSH authentication keys, root accounts and unattended SSH access is a minor security risk. Bombich discusses this on his page to some extent, and I want to reiterate it here. Suffice to say, I would only use this procedure on a trusted, firewalled (or at least NATed) network. Please bear this in mind if you proceed with the following steps. If you're uncomfortable with any of this, or don't fully understand the implications, skip it and stick with local backups, or just run rsync over the network by hand and provide passwords as needed. But this is what I do on our network. It works, and it's not terribly insecure.
Okay, once you have authentication keys set up, you should be able to log into your client machine from your server, as root, without being prompted for a password. If you can't, reread the Bombich article and try again until you get it working. Otherwise, unattended backups will fail. Got it? Great!
I enable the root account on both the host and client systems, which can be done with the NetInfo Manger application in /Applications/Utilities/. I do this because I'm backing up data that is not owned by my admin account, and using root gives me the unfettered access I need. Depending on your situation, this may or may not be necessary. For the following steps, though, it will simplify things immensely if you are root:
Now, as root, we can run our rsync command, minus the verbosity, since we'll be doing this unattended, and if the keys are set up properly, we should never be prompted for a password:
This command can be run either directly from cron on a periodic basis, or it can be placed in a cron-run script. For instance, I have a script that pipes verbose output to a log of all rsync activity for each staff machine I back up. This is handy to check for errors and whatnot, every so often, or if there's ever a problem. Also, my rsync commands are getting a bit unwieldy (as they tend to do) for direct inclusion in a crontab, so having the scripts keeps my crontab clean and readable. Here's a variant, for instance, that directs the output of rsync to a text file, and that uses an exclude flag to prevent certain folders from being backed up:
This exclusion flag will prevent backup of anything called "Archive" on the top level of mac01's Work drive. Exclusion in rsync is relative to the source directory being synced. For instance, if I wanted to exclude a folder called "Do Not Backup" inside the "Archive" folder on mac01's Work drive, my rsync command would look like this:
The above uses of rsync, as I mentioned before, will not delete files from the target that have been deleted from the source. They will only propagate changes that have occurred on the existing files, but will leave deleted files alone. They are semi-non-destuctive in this way, and this is often useful and desirable. Eventually, though, rsync backups will begin to consume a great deal of space, and after a while you may begin to run out. My solution to this is to periodically mirror my sources and targets, which can be easily accomplished with the --delete option. This option will delete any file from the target not found on the source. It does this after all other syncing is complete, so it's fairly safe to use, but it will require enough drive space to do a full sync before it does its thing. Here's our network command from above, only this time using the --delete flag:
Typically, I run the straight rsync command every other day or so (though I could probably get away with running it daily). I create the mirror at the end of each month to clear space. I back up about a half dozen machines this way, all from two simple shell scripts (daily and weekly) called by cron.
I realize that this is not a perfect backup solution. But it's pretty good for our needs, given what we can afford. And so far it hasn't failed me yet in four years. That's not a bad track record. Ideally, we'd have more drives and we'd stagger backups in such a way that we always had at least a few days backup available for retrieval. We'd also probably have some sort of backup to a more archival medium, like tape, for more permanent or semi-permanent backups. We'd also probably keep a copy of all this in some offsite, fireproof lock box. I know, I know. But we don't. And we won't. And thank god, 'cause what a pain in the
ass that must be. It'd be a full time job all its own, and not a very fun one. What this solution does offer is a cheap, decent, short-term backup procedure for emergency recovery of catastrophic data loss. Hard drive fails? No trouble. We've got you covered.
Hopefully, though, this all becomes a thing of the past when Leopard's Time Machine debuts. Won't that be the shit?
1. According to the RsyncX documentation, you should not need to do this, because the RsyncX installer changes the command path to its custom location. But if you'll be running the command over the network or as root, you'll either have to change that command path for the root account and on every client, or network backups will fail. It's much easier to simply put the modified version in the default location on each machine.