Three Platforms, One Server Part 9: Replica Problems

This post will be a short one. Promise. We're finishing up this project, and so far it looks like it's going to be a success. We're just adding the last little bits and finishing touches right now, but we've been putting out internal authentication server through it's paces for the past couple of months, and all seems well.

In a few places along the travails of this series I mentioned the need for a fail over in the event our master authentication server goes belly up. The rationale is that, with all our eggs in the one server basket, if our master goes down, no one can log in — not on Windows, not on Mac, not on Linux. Fortunately, Mac OS X Server allows for what's known as a replica. A replica is a read-only copy of your Open Directory data hosted by another server. But it's a bit more than that. The replica also provides what's known as fail over. That is, if the master server goes down, the replica knows to take over and start serving authentication to the clients. The replica, in effect, becomes the master in the event of the master's absence, until said master returns to service, at which point the replica gracefully hands control back to the master. You can actually have multiple replicas for fail over, and for redundancy in the event of slow or separate networks. Brilliant! I've wanted one for a long time. And now I have it.

Setting up a replica is easy as pie: Set the server's "role" to "Replica", point at its master, authenticate and you're off to the races. That is, if you've set everything just perfectly on both your master server and the replica. That's the thing about Mac OS X server, and always has been: if you set it up wrong initially, you're in for a potential world of hurt. As I rediscovered last week.

After building my replica, I next went on the requisite testing spree. The test involved physically pulling the network cable from the master server, and then observing the behavior of the clients. Initially, the replica seemed to do a fine job of picking up right where the master left off. After about two minutes' time, clients could log right back in. But after about 5 minutes' time, nearly every client on our network beachballed indefinitely, and any attempt to login would hang the machine. This hanging was so sudden and so thorough, it actually froze my machine mid-cube-effect as I attempted to fast-user-switch to the login window. Some fail over! Not cool.

Fast User Switch: Frozen in Time
(click image for larger view)

So I've spent a great deal of time figuring out the solution. The logs were no help. Google turned out to be a wash. Manuals? Pfft! For the first time in quite a while, it was an Apple Knowledge Base Article that offered the fix. The article was written for people who were having problems getting a 10.4.2 replica to remain a replica. Apparently there was an issue that involved these servers reverting back to "Standalone" roles after being switched from "Replica" to "Standalone" and back again. Though this was not my problem, nor did any of the symptoms reflect what I was seeing, I finally decided to try the recommendations in the article as they seemed fairly universally applicable, and as I was desperate and had tried everything else. The article essentially details methods for cleaning out every part of the master database that references the replica, and then re-promoting the replica to a clean master. Honestly, my master database looked pretty clean to me, but there was one bit of advice that I was not aware of, and it's my suspicion that this was what did the trick for me: The OD Administrator of the replica's database can NOT have the same UID or short name as that of the master. The article recommends creating a separate OD Admin account on the master, and using this separate account when binding the replica to the master. Honestly, I had no fucking idea. Would've been nice if this had been more explicitly mentioned in the manuals. Believe me, I've read the section entitled "Setting Up an Open Directory Replica" numerous times by now. It's not there. Fortunately, it's in that article, and now I know (and knowing's half the battle).

And now you know.

So, in a couple weeks we go live with our unified internal network. I'll let you know how it goes.

Actually, my replica seems to still not be working, at least not very reliably. What happens is that about two minutes after the plug is pulled on the master, the replica picks up. At this point, clients — both Mac and Windows — can successfully log in. Shortly thereafter — maybe four minutes later — we start having problems: the Macs can't log in, or only some can log in; Windows machines log in intermittently — one time it works, the next time it doesn't, it works after a reboot, then it doesn't; and, perhaps strangest of all, network connections to the Macs — ARD, ssh, anything but ping — become all but impossible, hanging at the attempts. This is totally a guess, but it seems to me like the clients are having serious trouble binding to the replica. They keep attempting to do so, with some initial or intermittent success, and in their attempts network connections get locked up and the machines bog down. It's almost like the replica server is saying, "Yes, you can bind to me," and then changing it's mind and saying. "Wait, no you can't. Never mind. Screw you." Again, I'm only guessing. There is nothing clear cut in the logs, and I can't find anything in Apple's Discussion forums or Knowledge Base that specifically addresses my problem. I only pray that it isn't a problem with my master server, but the master works perfectly, and it seems to me that a replica of a perfectly working master should work perfectly. The current replica is running on a Mac mini with limited RAM, and a 10/100 BT NIC, and I want to rule out potential problems that might be caused by the hardware as well as the software set up. So my next step will be to build a new replica from scratch on a G5. I'll let you know if that solves the problem.

Another thing I absolutely should mention: For Windows replication, the replica server must be set up as a Backup Domain Controller (BDC). This is done in the Server Admin application in the Windows section. It's fairly straightforward to set up, so I won't go into detail, but just for the record, it's important to be aware of this, and I wasn't until recently, so I mention it here for completeness' sake.

Having this replica isn't absolutely critical to our plan. That is, we can go forward with this plan without the replica. But having a working replica will provide an important safety net that I'd really like to have working as the semester begins. There's no good way to test it while the semester is in session. So I'm working hard to get it up and running in the final week of the summer.

So much for this post being short. More to come.

I built the new server today. From scratch. On a G5. No joy. I honestly don't know what the problemcould be. I can only guess that something either broke with the latest 10.4.7 update, or that there's something slightly off with my master server and it's causing problems on the replica. But it's weird, because if I bind directly to the replica using Directory Access it works perfectly, which leads me to suspect a problem on the client. But it affects Windows machines as well, so that doesn't quite figure either. I hate to admit it, but I'm stumped. And, unfortunately, I don't have time to worry about it right now. I will revisit this issue at a later date, when I get some time. When I do, I'll probably post a new entry with the solution, that is, if I'm able to find a solution. I hate this kind of thing. Is anyone else having a similar problem? I feel like I'm going nuts. And I can't believe I've spent so much time on something that should be really, really easy. Fuck. What a bummer.

I've pretty much given up on this for now. No time. And no good time to test, what with the students coming back next week. But today I noticed something strange, and it occurred to me it might be related to my replica problems. Today, when trying to make an AFP connection to the master server from a client using a simple "Connect to Server..." I got a Kerberos prompt that refused my admin credentials. Hmmm... Kerberos problems... On the master... Not good... So who knows? I may be rebuilding the master server at some point. But not now. Oh, Lordy, not now.