One of the major hurdles in our server unification project, mentioned in Part 1 of this series, is that of redundancy. In the old paradigm, each platform's users were hosted by a separate server. Mac users authenticated to a Mac Server, Windows users to a Windows Server, and Linux users to an NIS server. While this is exactly what we're trying to avoid by hosting all users on a single server, it does have one advantage over this new approach: built-in redundancy. That is, if one of our authentication servers fails, only the users on the platform hosted by said server are affected. For example, if our Windows Server fails, Windows users cannot login, but Mac users and Linux users can. In our new model, where all authentication for all platforms is hosted by a single server, if that server fails, no user can log in anywhere.
Servers are made to handle lots of different tasks and to keep running and doing their jobs under extreme conditions. To a certain extent, that is the very nature of being a server. To serve. Hence the name. So servers need and tend to be very robust. Nevertheless, they do go down from time to time. That's just life. But in the world of organizations that absolutely must have constant, 24 hour, 'round-the-clock uptime, this unavoidable fact of life is simply unacceptable. Fortunately for me I do not inhabit such a world. But, also fortunately for me, this notion of constant uptime has provided solutions to the problem of servers crashing. And while I probably won't lose my job if a server crashes periodically, and no one is going to lose millions of dollars from the down-time, no SysAdmin likes it when he has to tell his users to go home for the night while he rebuilds the server. It just sucks. So we all do our best to keep key systems like servers available as much as possible. It's just part of the deal.
So how are we going to do this? Well, one of the reasons I decided to use a Mac for this project is that it has built-in server replication for load balancing, and, yes failover. We're not too concerned with the load balancing; failover is what we're after. Failover is essentially a backup database that is a replica of a primary database, and that takes over in the case of a failure of the primary database. Mac Server has this built-in, and from what I read, it should be fairly easy to set up. Which is exactly what we're about to do.
The first thing we need is our primary server. This is the main server. The one that gets used 99% of the time (hopefully). We have this (or at least a test version of it) built already as discussed in Part 1. What we need next is what is called the replica. The replica is another Mac OSX Server machine that is set to be an "Open Directory Replica," rather than an "Open Directory Master."
So I've built a plain old, vanilla, Mac Server, and set it initially to be a Standalone Server. I've given it an IP address, and done the requisite OS and security upgrades. (Oy! What a pain!) In the Server Admin application, I set the new server to be an "Open Directory Replica." I'll be asked for some information here. Mainly, I'll need to tell this replica what master server to replicate. Specifically I'm asked to provide the following at the outset:
IP address of Open Directory master:
Root password on Open Directory master:
Domain administrator's short name on master:
Domain administrator's password on master:
(The domain administrator, by the way, is the account used to administer the LDAP database on the master.)
Once I fill in these fields I'll get a progress bar, and then, once the replica is established, I'm basically done. There are a few settings I can tweak. For instance, I can set up secure communications between the server with SSL. But for my purposes, this would be overkill. I'm pretty much going with the out-of-the-box experience at this point. So for setup, that should be it. Setting up a replica is pretty easy stuff.
Now here comes the fun part: testing. What happens if our primary server goes offline? Will the replica take over authentication services? Really? I'd like to be sure. What I'm going to do now is test the behavior of the Master/Replica servers to make sure it acts as intended. The best way I know to do this is to simulate a real-world crash. So I am binding one of my clients to my Master server, with Replica in place. Then I'm going to pull the plug. In theory, users should still be able to login to the bound client. Let's try it...
Bang! It works! I'm a bit surpsrised; last time I tried it, years ago, it (or I) failed. This time, though, it worked. We bound a client to the Master, our mother-ship server. Authentication worked as expected. (We knew we were bound to the new server because the passwords are different.) And then we killed it. We killed the master and logged out. There was some beachballing at logout. But after a few minutes -- like two or three, not a long wait at all -- we were able to complete logout, and then log right back in as though nothing had happened. I tell you, it was a thing of beauty.
So let's briefly recap where we've been and what's left to do.
Where we've been:
- We've built our Mama Server. Our authentication server for the entire lab.
- We've figured out how to migrate our users to Mama, and how to handle the required password change.
- We've solved the inherent problems with Windows clients and figured out a few solutions for handling them involving quotas and various roaming profile locations.
- We've built and tested the operation of the Open Directory Replica, and it is good.
What's left to do:
- Well, honestly, not a whole Hell of a lot.
- The next step, really, is real-world testing. We have a basic model of how our servers and clients should be configured, and it's basically working. To really test this, we'll need to take some actual clients from the lab and set them up to use the new system.
- Stress testing (i.e. seeing if we can break the system, how it holds up under load, etc.) would also be good, and might be something to do over Winter break a bit, and definitely in the Summer. To do this, we'll need to set up several client systems, and get users (guinea pigs) to do some real work on them all at the same time.
- Once stress testing is done, if all is well, I'm pretty sure we can go ahead and implement the change. I can't foresee any other problems.
So I'm at a stopping point. There's not much else I can do until the break, at which point I'll be sure and post my test results.
Hope to see you then!