Tiger Lab Migration Part 11: Panasas Crashes and Caches

The last time we visited this topic, I thought we were done. Well, turns out I was wrong.

Things are, for the most part, working well now. Finally. We're running Tiger and we've managed to iron out the bulk of the problems. There is one issue which has persisted, however: the home account RAID.

To refresh, our network RAID, which is responsible for housing all our users' home account data, is made by a company called Panasas. Near as I can figure, we've got some experimental model, 'cause boy does it crash a lot. Which is not what you want in a home account server, by any means. After upgrading the Panasas OS awhile back, the crashing had stopped. But it was only temporary. Lately the crashing is back with a vengeance. Like every couple of days it goes down. And when it goes down, it goes down hard. Like physical-reset hard. Like pull-the-director-blace-and-wait hard. Like sit-and-wait-for-the-RAID-to-rebuild hard.

Again: Not what you want in a home account server.

So we've built a new one. Actually, we've swapped our backup and home account servers. See, awhile back we decided it would be prudent to have a backup of the home account server. Just something quick 'n' dirty. Just in case. This was built on a reasonably fast custom box with a RAID controller and a bunch of drives. It's a cheap solution, but it does what we need it to, and it does it well. And now we're using it as the new home account server. So far it's been completely stable. No crashes in a week-and-a-half. Keep in mind, this is a $3000 dollar machine, not a $10,000 network RAID. It's not that fast, but it's fast enough. And it's stable. By god it's stable.

And that's what you want in a home account server.

Moving to the new server — which, by the way, is a simple Linux box running Fedora Core 4 — has afforded us the opportunity to change — or, actually, revert — the way login happens on the Macs. In the latter half of las semester, we were using a login hook that called mount_nfs because of problems with how Mac OS X 10.4.2 handled our Panasas setup, which creates a separate volume (read: filesystem) for each user account. Since we're now just using a standard Linux box to share our home accounts, which are now just folders, we have the luxury of reverting to the original method of mounting user accounts that we used last year under Mac OS X 10.3. That is, the home account server is mounted at boot time in the appropriate directory using automount, rather than with mount_nfs at login. Switching back was pretty simple: Disable the login hook (by deleting root's com.apple.loginwindow.plist file), place a Startup Item that calls the new server in /Library/StartupItems, reboot and you're done, right? Well, not quite. There's one last thing you need to do before you can proceed. Seems that, even after doing all of the above, the login hook was still running. I could delete the scripts it called, but it would still run. Know why? This will blow your mind. Cache.

Yup. It turns out — and who would have ever suspected this, as it's so incredibly stupid — login hooks get cached somewhere in /Library/Caches and will continue to run until these caches are deleted. I'm sorry, but I just have to take a minute and say, that is fucked up. Why would such a thing need to be cached? I mean maybe there's a minimal speed boost from doing this. The problem is that now you have a system level behavior that's in cache, and these caches are fairly persistent. They don't seem to reset. And they don't seem to update. This is like if your browser only used cached pages to show you websites, and never compared the cache to files on the server. You'd never be able to see anything but stale data without going and clearing the browser cache. At least in a browser — 'cause let's face it, this does happen from time to time (but not very often) — there is always some mechanism for clearing caches — a button, a menu item, a preference. In Mac OS X there is no such beast. In fact, the only way to delete caches in Mac OSX is to go to one or all of the various Cache folders and delete them by hand. Which is what I did, and which is what finally stopped the login scripts from running.

If this isn't clear evidence that Mac OS X needs some much better cache management, I don't know what is.

In any case, we're now not only happily running Tiger in the lab, but we've effectively switched over to a new home account server as well. So far, so good. Knock wood and all that. Between the home account problems, the Tiger migration, and getting ready for our server migration, this has been one of the busiest semesters ever. Though I keep this site anonymous because I write about work, I just want to give a nod to all the people who've helped me with all of the above. I certainly have not been doing all of this alone (thank god) and they've been doing kick-ass work. And, though I can't mention them by name, I really appreciate them for it. At the dawn of a new semester, we've finally worked out all of our long-standing problems and can get down to more forward-looking projects.

So ends (I hope) the Tiger Lab Migration.