Should Labs be Dual-Boot?

With today's official arrival of Windows-on-Mac, the question has already been posed, both online and to me personally. All day long people have been emailing and stopping by my office to talk about this development. It's truly incredible. And, of course, one person even suggested, as I knew someone would, that we purchase all Macs for the lab and make everything dual-boot Mac OS X and Windows. This is not the first time something like this has been proposed. In fact recently students were clamoring for dual-boot Linux/Windows systems (though what they really wanted was more Windows and fewer Linux systems). This issue just keeps coming up, and at this point I'm starting to wonder if it's not a terrible idea, and how to best implement such a thing.

My arguments in the past were essentially threefold: 1) Users who are less technically inclined would be put off by dual-boot systems, and we're really trying to make things simpler around here, so complicating the boot process was maybe not the way to go; 2) Building two systems for every computer equals twice the work; 3) Maintenance and troubleshooting system problems becomes exponentially more difficult. Let's address these point by point.

Inexperienced User Issues
I worry that new or inexperienced users would find the dual-boot process intimidating. While I still think this holds true to some extent, I must admit the bootloader provided by Apple looks astoundingly easy. Downright appealing in fact. I would expect nothing less from Apple. I think this minimizes, but does not negate, the difficulty factor for new or less experienced users. But I'm also on a big push here in this lab to simplify the way things work in general (in the grand Apple tradition). Currently everything from logging in to file sharing is just too complicated. And needlessly so. Making all our systems dual-boot would only add to the confusion. So if we ever added this wrinkle, I'd want to vastly simplify what we currently have first.

Twice the OSes Equals Twice the Work
Any computer that is dual-boot must have both Windows and Mac operating systems installed on it. Each OS must also have a full suite of applications installed. Thus, building such a system would require twice as much work as its single-boot counterpart. If there were a way to build one dual-boot system and then clone it to our other computers it would greatly simplify things. But something tells me that it won't be that easy. I have a feeling the Mac will not be able to clone the Windows partition, making rollouts an absolute nightmare. If it turns out not to be so bad, it will do a lot for recommending the dual-boot paradigm. But we shall see.

Maintenance and Troubleshooting Issues
A dual-boot system is not only twice as much work to build, it's also twice as much work to maintain. Any update that needs to be applied, be it system or software, must be done from the appropriate OS. So if a system is booted in Windows, and I have Mac updates to perform, I have to physically walk over to that system and boot it into the Mac OS before I can perform the update. Any system or hardware problem the computer might have must also go through additional new layers of troubleshooting. And if the fix requires any sort of wiping of the hard drive, restoration of the affected system will be twice as difficult as it would on a single-boot machine. Not to mention the fact that, although this probably wouldn't be the case, it is possible that a dual-boot system, because of its complexity, would be less stable than a single-boot computer.

Pros and Cons
So I ask, should the lab be dual-boot? After thinking about the above points I would answer no. At least not our lab, and at least not yet. There are situations in which multi-tasking scenarios are desirable, but there are also situations in which dedicated systems are preferable. In our lab we have a pretty good balance of dedicated Windows and Mac systems. We have enough of each system so that every user gets enough time in his or her operating system of choice. Systems are continually in use, yet no one is fighting for resources. We don't need more Windows machines, and we don't need more Macs. So the only advantage I can see to a dual-boot lab is the convenience of choosing your OS from any computer. The potential drawbacks are many, however: greater confusion, system instability and significantly more work for our systems staff, to name a few.

I can see situations in which dual-boot machines would be a huge windfall. Specifically, if you have a limited computer budget, you now have the capability of getting twice the bang for your hardware buck with an Intel Mac. Every computer can essentially be two computers, as long as it's an Intel Mac. But our computer budget is not limited, and there may still be good arguments for using non-Mac hardware for a lot of the things we do here. So I don't see lab-wide dual-boot systems for our lab anytime in the near future.

But then again, you never know.

UPDATE:
Looks like I'm not alone. Other lab administrators chime in. Here's a choice quote from rambleon.org's Jay Young that sums up exactly how I feel about all this from a sysadmin point of view:

"It’s hard enough for any one of us to keep one Operating System maintained, let alone two. Especially when you have to restart the whole system to get to the other one."

Right on, brother!

External Network Unification Part 1: Research and Development

"Systems Boy! Where are you?"

I realize posting has been lean, lo these past couple weeks. This seems to be a trend in the Mac-blog world. I think some of this has to do with the recent dearth of interesting Mac-related news. In my case, however, it's also due to a shocking lack of time brought on by my latest project.

You may or may not remember, or for that matter even care, about my ongoing Three Platforms, One Server series which deals with my efforts to unify network user authentication across Mac, Windows and Linux systems in my lab. Well, that project is all but done, except for the implementation, which we won't really get to do until the Summer, when students are sparse, time is plentiful, and love is in the air. (Sorry, but if you manage a student lab, you'll probably understand how I might romanticize the Summer months a bit.) Anyway, we've got our master plan for user authentication on our internal network pretty much down, so I've turned my attention to the external network, which is what I've been sweatily working on for the last two weeks.

Our external network (which, for the record, has only recently come under my purview) is made up of a number of servers and web apps to which our users have varying degrees of access. Currently it includes:

  1. A mail server
  2. A web host and file server
  3. A Quicktime Streaming Server
  4. A community site built on the Mambo CMS
  5. An online computer reservations system

In addition to these five systems, additional online resources are being proposed. The problem with the way all this works right now is that, as with our internal network, each of these servers and web apps relies on separate and distinct databases of users for its authentication. This is bad for a number of reasons:

  1. Creating users has to be done on five different systems for each user, which is far more time consuming and error prone than it should be
  2. Users cannot easily change their passwords across all systems
  3. The system is not in any way scalable because adding new web apps means adding new databases, which compounds the above problems
  4. Users often find this Byzantine system confusing and difficult to use, so they use it less and get less out of it

The goal here, obviously, is to unify our user database and thereby greatly simplify the operation, maintenance, usability and scalability of this system. There are a number of roadblocks and issues here that don't exist on the internal network:

  1. There are many more servers to unify
  2. Some of the web apps we use are MySQL/PHP implementations, which is technology I don't currently know well at all
  3. Security is a much bigger concern
  4. There is no one on staff, myself included (although I'm getting there), with a thorough global understanding of how this should be implemented, and these servers, databases and web apps are maintained and operated by many different people on staff, each with a different level of understanding of the problem
  5. All of these systems have been built piecemeal over the years by several different people, many of whom are no longer around, so we also don't completely understand quite how things are working now

All of these issues have led me down the path upon which I currently find myself. First and foremost, an overarching plan was needed. What I've decided on, so far, is this:

  1. The user database should be an LDAP server running some form of BSD, which should be able to host user info for our servers without too much trouble
  2. The web apps can employ whatever database system we want, so long as that system can get user information from LDAP; right now we're still thinking along the lines of MySQL and PHP, but really it doesn't matter as long as it can consult LDAP
  3. Non-user data (i.e. computers or equipment, for instance) can be held in MySQL (or other) databases; our LDAP server need only be responsible for user data

That's the general plan. An LDAP server for hosting user data, and a set of web apps that rely on MySQL (or other) databases for web app-specific data, with the stipulation that these web apps must be able to use LDAP authentication. This, to me, sounds like it should scale quite well: Want to add a new web app? Fine. You can either add to the current MySQL database, or if necessary, build another database, so long as it can get user data from LDAP, as user data is always redundant and should always be consistent. It's important to remember that the real Holy Grail here is the LDAP connection. If we can crack that nut (and we have, to some extent) we're halfway home.

This plan is a good first step toward figuring out what we need to do in order to move forward with this in any kind of meaningful way. As I mentioned, one of the hurdles here is the fact that this whole thing involves a number of different staff members with various talents and skill sets, so I now at least have a clear, if general, map that I can give them, as well as a fairly clear picture in my mind of how this will ultimately be implemented. Coming up with a plan involved talking to a number of people, and trying out a bunch of things. Once I'd gathered enough information about who knew what and how I might best proceed, I started with what I knew, experimenting with a Mac OSX server and some web apps I downloaded from the 'net. But I quickly realized that this wasn't going to cut it. If I'm going to essentially be the manager for this project, it's incumbent upon me to have a much better understanding of the underlying technologies, in particular: MySQL, PHP, Apache and BSD, none of which I'd had any experience with before two weeks ago.

So, to better understand the server technology behind all this, I've gone and built a FreeBSD server. On it I've installed MySQL, PHP and OpenLDAP. I've configured it as a web server running a MySQL database with a PHP-based front-end, a web app called MRBS. It took me a week, but I got it running, and I learned an incredible amount. I have not set up the LDAP database on that machine as yet, however. Learning LDAP will be a project unto itself, I suspect. To speed up the process of better understanding MySQL and PHP (and foregoing learning LDAP for the time being), I also installed MRBS on a Tiger Server with a bunch of LDAP users in the Open Directory database. MRBS is capable of authenticating to LDAP, and there's a lovely article at AFP548 that was immensely helpful getting me started. After much trial and error I was able to get it to work. I now have a web application that keeps data accessed via PHP in a MySQL database, but that gets its user data from the LDAP database on the Tiger Server. I have a working model, and this is invaluable. For one, it gives me something concrete to show the other systems admins, something they can use as a foundation for this project, and a general guide for how things should be set up. For two, it gives us a good idea of how this all works, and something we can learn from and modify our own code with. A sort of Rosetta stone, if you will. And, finally, it proves that this whole undertaking is, indeed, quite possible.

So far, key things I've learned are:

  1. MySQL is a database (well, I knew that, but now I really know it)
  2. PHP is a scripting/programming language that can be used to access d atabases
  3. MySQL is not capable of accessing external authentication databases (like LDAP)
  4. PHP, however, does feature direct calls to LDAP, and can be used to authenticate to LDAP servers
  5. PHP will be the bridge between our MySQL-driven web apps and our LDAP user database

So that is, if you've been wondering, what I've been doing and thinking about and working on for the past two weeks. Whew! It's been a lot of challenging but rewarding work.

This is actually a much bigger, much harder project than our internal network unification. For one, I'm dealing with technologies with which I'm largely unfamiliar and about which I must educate myself. For two, there are concerns — like security in particular — which are much more important to consider on an external network. Thirdly, there are a great many more databases and servers that need to be unified. Fourth, scalability is a huge issue, so the planning must be spot on. And lastly, this is a team effort. I can't do this all myself. So a lot of coordination among a number of our admins is required. In addition to being a big technical challenge for me personally, this is a managerial challenge as well. So far it's going really well, and I'm very lucky to have the support of my superiors as well as excellent co-systems administrators to work with. This project will take some time. But I really think it will ultimately be a worthwhile endeavor that makes life better for our student body, faculty, systems admins and administrative staff alike.

Re-Binding to a Mac Server

Last semester we had a lot of problems in the lab. Our main problems were due to two things: the migration to Tiger, and problems with our home account server. Our Tiger problems have largely gone away with the latest releases, and we've replaced our home account server with another machine, and, aside from a minor hiccup here and there, things seem to have quieted down. The Macs are running well, and there hasn't been a server disconnect in some time. It's damn nice.

There has been one fairly minor lingering problem, however. For some reason our workstation Macs occasionally and randomly lose their connection to our authentication server — our Mac Server. When this happens, the most notable and problematic symptom is users' inability to log in. Any attempt at login is greeted with the login screen shuffle. You know, that thing where the login window shakes violently at the failed login attempt. This behavior is an indication that the system does not recognize either the user name or the password supplied, which makes sense, because when the binding to the authentication server is broken, for all intents and purposes, the user no longer exists on that system.

I've looked long and hard to find a reason for, and a solution to this problem. I have yet to discover what causes the systems to become unbound from the server (though I'm starting to suspect some DNS funkiness, or anomalies in my LDAP database as the root cause at this point). There is no pattern to it, and there is nothing helpful in the logs. Only a message that the machine is unable to bind to the server — if it happens at boot; nothing about why, and nothing if it happens while the machine is on, which it sometimes does. It's a mystery. And until recently, the only fix I could come up with was to log on to the unbound machine and reset the server in the Directory Access application. Part of my research involved looking for a command-line way to do this so that I wouldn't have to log in and use the GUI every time this happened, as it happens fairly often, and the GUI method is slow and cumbersome, especially when you want to get that machine back online ASAP.

It took me a while, but I have found the magic command, at a site called MacHacks. Boy is it simple. You just have to restart DirectoryService:

sudo killall DirectoryService

This forces the computer to reload all the services listed in the Directory Access app, and rebind to any servers that have been set up for authentication. I've added the command to the crontab and set it to run every 30 minutes. That should alleviate the last of our lab problems.

Hopefully the rest of this semester will be as smooth sailing as the past two weeks. I could use a little less systems related drama right now.

Tiger Lab Migration Part 11: Panasas Crashes and Caches

The last time we visited this topic, I thought we were done. Well, turns out I was wrong.

Things are, for the most part, working well now. Finally. We're running Tiger and we've managed to iron out the bulk of the problems. There is one issue which has persisted, however: the home account RAID.

To refresh, our network RAID, which is responsible for housing all our users' home account data, is made by a company called Panasas. Near as I can figure, we've got some experimental model, 'cause boy does it crash a lot. Which is not what you want in a home account server, by any means. After upgrading the Panasas OS awhile back, the crashing had stopped. But it was only temporary. Lately the crashing is back with a vengeance. Like every couple of days it goes down. And when it goes down, it goes down hard. Like physical-reset hard. Like pull-the-director-blace-and-wait hard. Like sit-and-wait-for-the-RAID-to-rebuild hard.

Again: Not what you want in a home account server.

So we've built a new one. Actually, we've swapped our backup and home account servers. See, awhile back we decided it would be prudent to have a backup of the home account server. Just something quick 'n' dirty. Just in case. This was built on a reasonably fast custom box with a RAID controller and a bunch of drives. It's a cheap solution, but it does what we need it to, and it does it well. And now we're using it as the new home account server. So far it's been completely stable. No crashes in a week-and-a-half. Keep in mind, this is a $3000 dollar machine, not a $10,000 network RAID. It's not that fast, but it's fast enough. And it's stable. By god it's stable.

And that's what you want in a home account server.

Moving to the new server — which, by the way, is a simple Linux box running Fedora Core 4 — has afforded us the opportunity to change — or, actually, revert — the way login happens on the Macs. In the latter half of las semester, we were using a login hook that called mount_nfs because of problems with how Mac OS X 10.4.2 handled our Panasas setup, which creates a separate volume (read: filesystem) for each user account. Since we're now just using a standard Linux box to share our home accounts, which are now just folders, we have the luxury of reverting to the original method of mounting user accounts that we used last year under Mac OS X 10.3. That is, the home account server is mounted at boot time in the appropriate directory using automount, rather than with mount_nfs at login. Switching back was pretty simple: Disable the login hook (by deleting root's com.apple.loginwindow.plist file), place a Startup Item that calls the new server in /Library/StartupItems, reboot and you're done, right? Well, not quite. There's one last thing you need to do before you can proceed. Seems that, even after doing all of the above, the login hook was still running. I could delete the scripts it called, but it would still run. Know why? This will blow your mind. Cache.

Yup. It turns out — and who would have ever suspected this, as it's so incredibly stupid — login hooks get cached somewhere in /Library/Caches and will continue to run until these caches are deleted. I'm sorry, but I just have to take a minute and say, that is fucked up. Why would such a thing need to be cached? I mean maybe there's a minimal speed boost from doing this. The problem is that now you have a system level behavior that's in cache, and these caches are fairly persistent. They don't seem to reset. And they don't seem to update. This is like if your browser only used cached pages to show you websites, and never compared the cache to files on the server. You'd never be able to see anything but stale data without going and clearing the browser cache. At least in a browser — 'cause let's face it, this does happen from time to time (but not very often) — there is always some mechanism for clearing caches — a button, a menu item, a preference. In Mac OS X there is no such beast. In fact, the only way to delete caches in Mac OSX is to go to one or all of the various Cache folders and delete them by hand. Which is what I did, and which is what finally stopped the login scripts from running.

If this isn't clear evidence that Mac OS X needs some much better cache management, I don't know what is.

In any case, we're now not only happily running Tiger in the lab, but we've effectively switched over to a new home account server as well. So far, so good. Knock wood and all that. Between the home account problems, the Tiger migration, and getting ready for our server migration, this has been one of the busiest semesters ever. Though I keep this site anonymous because I write about work, I just want to give a nod to all the people who've helped me with all of the above. I certainly have not been doing all of this alone (thank god) and they've been doing kick-ass work. And, though I can't mention them by name, I really appreciate them for it. At the dawn of a new semester, we've finally worked out all of our long-standing problems and can get down to more forward-looking projects.

So ends (I hope) the Tiger Lab Migration.

Three Platforms, One Server Part 4: Redundancy

One of the major hurdles in our server unification project, mentioned in Part 1 of this series, is that of redundancy. In the old paradigm, each platform's users were hosted by a separate server. Mac users authenticated to a Mac Server, Windows users to a Windows Server, and Linux users to an NIS server. While this is exactly what we're trying to avoid by hosting all users on a single server, it does have one advantage over this new approach: built-in redundancy. That is, if one of our authentication servers fails, only the users on the platform hosted by said server are affected. For example, if our Windows Server fails, Windows users cannot login, but Mac users and Linux users can. In our new model, where all authentication for all platforms is hosted by a single server, if that server fails, no user can log in anywhere.

Servers are made to handle lots of different tasks and to keep running and doing their jobs under extreme conditions. To a certain extent, that is the very nature of being a server. To serve. Hence the name. So servers need and tend to be very robust. Nevertheless, they do go down from time to time. That's just life. But in the world of organizations that absolutely must have constant, 24 hour, 'round-the-clock uptime, this unavoidable fact of life is simply unacceptable. Fortunately for me I do not inhabit such a world. But, also fortunately for me, this notion of constant uptime has provided solutions to the problem of servers crashing. And while I probably won't lose my job if a server crashes periodically, and no one is going to lose millions of dollars from the down-time, no SysAdmin likes it when he has to tell his users to go home for the night while he rebuilds the server. It just sucks. So we all do our best to keep key systems like servers available as much as possible. It's just part of the deal.

So how are we going to do this? Well, one of the reasons I decided to use a Mac for this project is that it has built-in server replication for load balancing, and, yes failover. We're not too concerned with the load balancing; failover is what we're after. Failover is essentially a backup database that is a replica of a primary database, and that takes over in the case of a failure of the primary database. Mac Server has this built-in, and from what I read, it should be fairly easy to set up. Which is exactly what we're about to do.

The first thing we need is our primary server. This is the main server. The one that gets used 99% of the time (hopefully). We have this (or at least a test version of it) built already as discussed in Part 1. What we need next is what is called the replica. The replica is another Mac OSX Server machine that is set to be an "Open Directory Replica," rather than an "Open Directory Master."

So I've built a plain old, vanilla, Mac Server, and set it initially to be a Standalone Server. I've given it an IP address, and done the requisite OS and security upgrades. (Oy! What a pain!) In the Server Admin application, I set the new server to be an "Open Directory Replica." I'll be asked for some information here. Mainly, I'll need to tell this replica what master server to replicate. Specifically I'm asked to provide the following at the outset:

IP address of Open Directory master:

Root password on Open Directory master:

Domain administrator's short name on master:

Domain administrator's password on master:

(The domain administrator, by the way, is the account used to administer the LDAP database on the master.)

Once I fill in these fields I'll get a progress bar, and then, once the replica is established, I'm basically done. There are a few settings I can tweak. For instance, I can set up secure communications between the server with SSL. But for my purposes, this would be overkill. I'm pretty much going with the out-of-the-box experience at this point. So for setup, that should be it. Setting up a replica is pretty easy stuff.

Establishing the Replica: Could it Be Any Easier?

(click for larger view)

Now here comes the fun part: testing. What happens if our primary server goes offline? Will the replica take over authentication services? Really? I'd like to be sure. What I'm going to do now is test the behavior of the Master/Replica servers to make sure it acts as intended. The best way I know to do this is to simulate a real-world crash. So I am binding one of my clients to my Master server, with Replica in place. Then I'm going to pull the plug. In theory, users should still be able to login to the bound client. Let's try it...

Bang! It works! I'm a bit surpsrised; last time I tried it, years ago, it (or I) failed. This time, though, it worked. We bound a client to the Master, our mother-ship server. Authentication worked as expected. (We knew we were bound to the new server because the passwords are different.) And then we killed it. We killed the master and logged out. There was some beachballing at logout. But after a few minutes -- like two or three, not a long wait at all -- we were able to complete logout, and then log right back in as though nothing had happened. I tell you, it was a thing of beauty.

So let's briefly recap where we've been and what's left to do.

Where we've been:

  • We've built our Mama Server. Our authentication server for the entire lab.
  • We've figured out how to migrate our users to Mama, and how to handle the required password change.
  • We've solved the inherent problems with Windows clients and figured out a few solutions for handling them involving quotas and various roaming profile locations.
  • We've built and tested the operation of the Open Directory Replica, and it is good.

What's left to do:

  • Well, honestly, not a whole Hell of a lot.
  • The next step, really, is real-world testing. We have a basic model of how our servers and clients should be configured, and it's basically working. To really test this, we'll need to take some actual clients from the lab and set them up to use the new system.
  • Stress testing (i.e. seeing if we can break the system, how it holds up under load, etc.) would also be good, and might be something to do over Winter break a bit, and definitely in the Summer. To do this, we'll need to set up several client systems, and get users (guinea pigs) to do some real work on them all at the same time.
  • Once stress testing is done, if all is well, I'm pretty sure we can go ahead and implement the change. I can't foresee any other problems.

So I'm at a stopping point. There's not much else I can do until the break, at which point I'll be sure and post my test results.

Hope to see you then!