On Backups

Let me just say up front, historically I've been terrible about backing up my data. But I'm working on it.

As far as backups go, I've tried a lot of things. I am responsible for backups of staff data at work, and here is where the bulk of my trials have occurred. For my personal data I've always just archived things to CD or DVD as my drive got full, or as certain projects wrapped up, but I've never had any sort of emergency backup in case of something like a drive failure or other catastrophe. Both at home and at work, though, the main problem I've faced has been the ever-expanding amount of data I need to backup. Combined staff data typically takes up a few hundred gigabytes of disk space. And at home my Work partition (I store all user data on a partition separate from the System) currently uses 111 GB. This does not even take into account the multiple firewire drives attached to my system at any given time. All tolled, we're talking several hundred gigabytes of data on my home system alone. I don't know what "the best" way is to back all this up, but I think I have a pretty good solution both at home and at work.

The Olden Days
Back in the day, staff backups were performed with Retrospect to a SCSI DAT drive. This was in the OS9 days. The tapes each held about 60GBs, if memory serves, and this worked fine for a while. But with the high price of tapes, a limited tape budget, and ever-increasing storage needs, the Retrospect-to-tape route quickly became outmoded for me. It became a very common occurrence for me to come in on any given morning only to find that Retrospect had not completed a backup and was requesting additional tapes. Tapes which I did not have, nor could I afford to buy. Retrieving data from these tapes was also not always easy, and certainly never fast. And each year these problems grew worse as drive capacities increased, staff data grew and tape capacities for our $3000 tape drive remained the same. The tape solution just didn't scale.

Enter Mac OS X
When Mac OS X arrived on the scene, I immediately recognized the opportunity — and the need — to revise our staff backup system. First off, Retrospect support was incredibly weak for OS X in those early days. Second, even when it did get better, there continued to be many software and kernel extension problems. Third, SCSI — which most tape drives continue to use to this day — was on the way out, annoying as hell, and barely supported in OS X. Fourth, the tape capacity issue remained. On the other hand, from what I was reading, Mac OS X's UNIX underpinnings would provide what sounded like a free alternative, at least on the software side: rsync. My two-pronged revision of our backup system consisted of replacing Retrospect with rsync and replacing tape drives with ever-cheaper, ever-larger hard drives.

RsyncX
The only problem with the UNIX rsync was that it famously failed to handle HFS+ resource forks (as did, incidentally, Retrospect at the outset). This situation was quickly remedied by the open source community with the wonderful RsyncX. RsyncX is a GUI wrapper around a version of rsync that is identical in most respects to the original UNIX version except that it is capable of handling resource forks. Once I discovered RsyncX, I was off to the races, and I haven't found anything to date — incuding the Tiger version of rsync — that does what I want better.

My Process
These days I do regular, weekly staff backups using RsyncX over SSH to a firewire drive. For my personal data, I RsyncX locally to a spare drive. This is the most economical and reliable data backup solution I've found, and it's far more scalable than tape or optical media. It's also been effective. I've been able to recover data on numerous occasions for various staff members.

My system is not perfect, but here's what I do: Every day I use RsyncX to perform an incremental backup to an external hard drive. Incremental backups only copy the changes from source to target (so they're very fast), but any data that has been deleted from the source since the last backup remains on the target. So each day, all new files are appended to the backup, and any changes to files are propagated to said backup, but any files I've deleted will remain backed up. Just in case. Eventually, as I'm sure you've guessed, the data on my backup drive will start to get very large. So, at the end of each month (or as needed) I perform a mirror backup, which deletes on the target any file not found on the source, essentially creating an exact duplicate of the source. This is all run via shell scripts and automated with cron. Finally, every few months or so (okay, more like every year), I backup data that I want as part of my permanent archive — completed projects, email and what not — to optical media. I catalog this permanent archive using the excellent CDFinder.

Almost Perfect
There are some obvious holes in this system, though: What if I need to revert to a previous version of a file? What if I need a deleted file and I've just performed the mirror backup? Yes. I've thought about all of this. Ideally this would be addressed by having a third hard drive and staggering backups between the two backup drives. A scenario like this would allow me to always have a few weeks worth of previous versions of my data, while still allowing me to keep current backups as well. Alas, while I have the plan, I don't have the drives. Maybe someday. But for now this setup works fine for most of my needs and protects me and the staff against the most catastrophic of situations.

Consider Your Needs
Still, when devising a backup scheme, it's important to understand exactly what you need backups to do. Each situation presents a unique problem and has a unique set of requirements. Do you need a permanent, historical archive that's always available? Or do you simply need short-term emergency backup? Do you need versioning? What data needs to be backed up and what doesn't? For my needs previous versions are less important; emergency backups are critical. Also you need to consider how much data you have and what medium is most appropriate for storage with an eye towards the future. In my case I have a lot of data, and I always will. Hard drives are the most economical way for me to store my large backups — as data needs grow, so too do drive capacities — but they are also the most future-proof. In a few years we may not be using DVDs anymore, or tapes. But drives will be around in some form or another for the foreseeable future, and they'll continue to get bigger and bigger. And since I'm not so much worried about having a permanent archive of my backup data (except in the case of data archived to optical media), I can continually and easily upgrade my storage by either purchasing new drives every so often, or by adding additional drives as needed. And transferring the data to new media — to these new drives — will be faster than it will with any other media (tape and optical media are slow). This system scales. And while it may be less reliable over the long term than optical or tape, it's plenty reliable for our needs and easily upgradeable in the future.

Lately everyone seems to talking about backup solutions. Mark Pilgrim recently wrote an intriguing post asking how to archive vast amounts of data over the next 50 years. I don't think there's an easy answer there, and my solution would not help him one bit. But it did inspire me to share my thoughts on the matter of backups, and my own personal system. It's certainly not the be-all-end-all of backup systems, and if others have thoughts on this complex and important topic, feel free to post them in the comments. I'd be curious to hear.