We recently installed in our machine room a brand-spankin' new RAID for hosting network home accounts. We bought this RAID as a replacement for our aging, and horrendously unreliable Panasas RAID. The Panasas was a disaster for almost the entire three-year span of its lease. It used a proprietary operating system based on some flavor of *NIX (which I can't recall right at this moment), but that had all sorts of variations from a typical *NIX install that made using it as a home account server far more difficult than it ever should have been. To be fair, it was never really intended for such a use, but was rather created as a file server cluster for Linux workstations that can be easily managed directly from a web browser, as opposed to the command-line. It was really built for speed, not stability, and it was really completely the wrong product for us. (And for the record, I had nothing to do with its purchase, in case you're wondering.)
What the Panasas was, however, was instructive. For three years we lived under the shadow of its constant crashing, the near-weekly tcp dumps and help requests to the company, and angry users fed up with a system that occasionally caused them to lose data, and frequently caused their machines to lock up for the duration of a Panasas reboot, which could be up to twenty minutes. It was not fun, but I learned a lot from it, and it enabled me to make some very serious decisions.
My recent promotion to Senior Systems Administrator came just prior to the end of our Panasas lease term. This put me in the position of both purchasing a new home account server, and of deciding the fate of networked home accounts in the lab.
If I'd learned anything from the experience with the Panasas it was this: A home account server must be, above all else, stable. Every computer that relies on centralized storage for home account serving is completely and utterly dependent on that server. If that server goes down, your lab, in essence, goes down. When this starts happening a lot, people begin to lose faith in a lot of things. First and foremost, they lose faith in the server and stop using it, making your big, expensive network RAID a big, expensive waste of money. Secondly, they lose faith in the system you've set up, which makes sense because it doesn't work reliably, and they stop using it, favoring instead whatever contingency plan you've set up for the times when the server goes down. In our case, we set up a local user account for people to log into when the home account server was down. Things got so bad for a while that people began to log in using this local account more than they would their home accounts, thus negating all our efforts at centralizing home account data storage. Lastly, people begin to lose faith in your abilities as a systems administrator and lab manager. Your reputation suffers, and that makes it harder to get things done — even improvements. So, stability. Centralization of a key resource is risky, in that if that resource fails, everything else fails with it. Stability of crucial, centralized storage was key if any kind of network home account scenario was going to work.
The other thing I began to assess was the whole idea of networked home accounts themselves. I don't know how many labs use networked home accounts. I suspect there are quite a few, but there are also probably a lot of labs that don't. I know I've read about a lot of places that prefer local accounts that are not customized and that revert to some default state at every log in/out. Though I personally really like the convenience of customized network home accounts that follow you from computer to computer throughout a facility, it certainly provides a fair amount of hassle and risk. When it works it's great, but when it doesn't work, it's really bad. So I really began to question the whole idea. Is this something we really needed or wanted to continue to provide?
My ultimate decision was intimately linked to the stability of the home account server. From everything I've seen, networked home accounts can and do work extremely well when the centralized storage on which they reside is stable and reliable. And there is value to this. I talked to people in the lab. By and large, from what I could glean from my very rudimentary and unscientific conversations with users, people really like having network home accounts when they work properly. When given the choice between a generic local account or their personalized network account, even after all the headaches, they still ultimately prefer the networked account. So it behooves us to really try to make it work and work well. And, again, everything I saw told me that what this really required, more than anything else, was a good, solid, robust and reliable home account server.
So, that's what we tried our best to get. The new unit is built and configured by a company called Western Scientific, which was recommended to me by a friend. It's called the Fusion SA. It's a 24-bay storage server running Linux Fedora Core 5. We've populated 16 of the bays with 500GB drives and configured them at RAID level 5, giving us, when all is said and done, about 7TB of networked storage with room to grow in the additional bays should we ever want to do so. The unit also features a Quad-port GigE PCIX card which we can trunc for speedy network access. It's big and it's fast. But what's most important is its stability.
Our new RAID came a little later than we'd hoped, so we weren't able to test it before going live with it. Ideally, we would have gotten the unit mid-summer and tested it in the lab while maintaining our previous system as a fall-back. What happened instead was that we got the unit in about the second week of the semester, and outside circumstances eventually necessitated switching to the new RAID sans testing. It was a little scary. Here we were in the third week of school switching over to a brand new but largely untested home account server. It was at this point in time that I decided, if this thing didn't work — if it wasn't stable and reliable — networked home accounts would become a thing of the past.
So with a little bit of fancy footwork we made the ol' switcheroo, and it went so smoothly our users barely noticed anything had happened. Installing the unit was really a simple matter of getting it in the rack, and then configuring the network settings and the RAID. This was exceptionally quick and easy, thanks in large measure to the fact that Western Scientific configured the OS for us at the factory, and also to the fact that they tested the unit for defects prior to shipping it to us. In fact, our unit was late because they had discovered a flaw in the original unit they had planned to ship. Perfect! If that's the case, I'm glad it was late. This is exactly what we want from a company that provides us with our crucial home account storage. If the server itself was as reliable as the company was diligent, we most like had a winner on our hands. So, how has it been?
It's been several weeks now, and the new home account server has been up, without fail or issue, the entire time. So far our new home account server has been extremely stable (so much so that I almost forget about it, until, of course, I walk past our server room and stop to dreamily look upon its bright blue drive activity lights dutifully flickering away without pause). And if it stays that way, user confidence should return to the lab and to the whole idea of networked home accounts in fairly short order. In fact, it seems like it already has to a great extent. I couldn't be happier. And the users?... Well, they don't even notice the difference. That's the cruel irony of this business: When things break, you never hear the end of it, but when things work properly, you don't hear a peep. You can almost gauge the success or failure of a system by how much you hear about it from users. It's the ultimate in "n o news is good news." The quieter the better.
And 'round these parts of late, it's been pin-drop quiet.