External Network Unification Part 4: The CMS Goes Live

NOTE: This is the latest article in the External Network Unification project series. It was actually penned, and was meant to be posted several weeks ago, but somehow got lost in the shuffle. In any case, it's still relavant, and rather than rewrite it accounting for the time lapse, I present it here in it's original form, with a follow-up at the end.
-systemsboy

Last Thursday, August 10th, 2006 marked a milestone in the External Network Unification project: We've migrated our CMS to Joomla and are using external authentication for the site. Though it was accomplished somewhat differently than I had anticipated, accomplished it was, nonetheless, and boy we're happy. Here's the scoop.

Last time I mentioned I'd built a test site — a copy of our CMS on a different machine — and had some success, and that the next step was to build a test site on the web server itself and test the LDAP Hack on the live server authenticating to a real live, non-Mac OSX LDAP server. Which is what I did.

Building the Joomla port on the web server was about as easy as it was on the test server. I just followed the same set of steps and was done in no time. Easy. And this time I didn't have to worry about recreating any of the MySQL databases since, on the web server, they were already in place as we want them and were working perfectly. So the live Joomla port was exceedingly simple.

LDAP, on the other hand, is not. I've been spoiled by Mac OS X's presentation of LDAP in its server software. Apple has done a fantastic job of simplifying what, I recently discovered, is a very complicated, and at times almost primitive, database system. Red Hat has also made ambitious forays into the LDAP server arena, and I look forward to trying out their offerings. This time out my LDAP server was built by another staff systems admin. He did a great job in a short space of time on what I can only imagine was, at times, a trying chore. The LDAP server he built, though, worked and was, by all standards, quite secure. Maybe too secure.

When trying to authenticate our Joomla CMS port with the LDAP hack, nothing I did worked. And I tried everything. Our LDAP server does everything over TLS for security, and requires all transactions to be encrypted, and I'm guessing that the LDAP Hack we were using for the CMS just couldn't handle that. In some configurations login information was actually printed directly to the browser window. Not cool!

Near the point of giving up, I thought I'd just try some other stuff while I had this port on hand. The LDAP Hack can authenticate via two other sources, actually: IMAP and POP. Got a mail server? The LDAP Hack can authenticate to it just like your mail client does. I figured it was worth a shot, so I tried it. And it worked! Perfectly! And this gave me ideas.

The more I thought about it, the more I realized that our LDAP solution is nowhere near ready for prime-time. I still believe LDAP will ultimately be the way to go for our user databases. But for now what we want to do with it is just too complicated. The mere act of user creation on the LDAP server, as it's built now anyway, will require some kind of scripting solution. I also now realize that we will most likely need a custom schema for the LDAP server, as it will be hosting authentication and user info for a variety of other servers. For instance, we have a Quicktime Streaming Server, and home accounts reside in a specific directory on that machine. But on our mail server, the home account location is different. This, if I am thinking about it correctly, will need to be handled by some sort of custom LDAP schema that can supply variable data with regards to home account locations based on the machine that is connecting to it. There are other problems too. Ones that are so abstract to me right now I can't even begin to think about writing about them. Suffice to say, with about two-and-a-half solid weeks before school starts, and a whole list of other projects that must get done in that time frame, I just know we won't have time to build and test specialized LDAP schemas. To do this right, we need more time.

By the same token, I'm still stuck — fixated, even — on the idea of reducing as many of the authentication servers and databases, and thus a good deal of the confusion, as I possibly can. Authenticating to our mail server may just be the ticket, if only temporarily.

The mail server, it turns out, already hosts authentication for a couple other servers. And it can — and is now — hosting authentication for our CMS. That leaves only two other systems independently hosting user data on the external network: the reservations system (running on it's own MySQL user database) and the Quicktime Streaming server, which hosts local Netinfo accounts. Reservations is a foregone conclusion for now. It's a custom system, and we won't have time to change it before the semester starts. (Though it occurs to me that it might be possible for Reservations to piggyback on the CMS and use the CMS's MySQL database for authentication — which of course now uses the mail server to build itself — rather than the separate MySQL database it currently uses. But this will take some effort.) But if I can get the Quicktime Streaming Server to authenticate to the mail server — and I'm pretty hopeful here — I can reduce the number of authentication systems by one more. This would effectively reduce by more than half the total number of authentication systems (both internal ones — which are now all hosted by a Mac OS X server — and external ones) currently in use.

Right now — as of Thursday, August 10th, 2006 — we've gone live with the new CMS, and that brings our total number from eight authentication systems down to four. That's half what we had. That awesome. If I can get it down to three, I'll be pleased as punch. If I can get it down to two, I'll feel like a super hero. So in the next couple weeks I'll be looking at authenticating our Quicktime server via NIS. I've never done it, but I think it's possible, either through the NIS plugin in Directory Access, or by using a cron-activated shell script. But if not, we're still in better shape than we were.

Presenting the new system to the users this year should be far simpler than it's ever been, and new user creation should be a comparative cakewalk to years past. And hopefully by next year we can make it even simpler.

FOLLOW-UP:
It's been several weeks since I wrote this article, and I'm happy to report that all is well with our Joomla port and the hack that allows us to use our mail server for authentication. It's been running fine, and has given us no problems whatsoever. With the start of the semester slamming us like a sumo wrestler on crack, I have not had a chance to test any other servers against alternative authentication methods. There's been way too much going on, from heat waves to air conditioning and power failures. It's been a madhouse around here, I tell ya. A madhouse! So for now, this project is on hold until we can get some free time. Hopefully we can pick up with it again when things settle, but that may not be until next summer. In any case, the system we have now is worlds better than what we had just a few short months ago. And presenting it to the users was clearer than it's ever been. I have to say, I'm pretty pleased with how it's turing out.

Three Platforms, One Server Part 12: AFP Network Home Accounts

I hit another minor but infuriating snag in my plan to unify the network, though this one was all Apple. It's another case of Apple making serious changes to the way you're supposed to set your server and clients and never really trumpeting much about it. Seems everything I used to do with my server and clients is done either slightly — or in some cases radically — differently in Tiger than it was in Panther. I must admit, I never checked the manuals on this, but something as simple setting up AFP networked home accounts has become a much more complex process in Tiger than it ever was in Panther, and it took me quite a while to figure out what I had to do to make it work like it did in the Panther glory days.

Now, to remind, we don't really use AFP networked home accounts for most users. Our users' home accounts live on an NFS server — a separate machine from our authentication server — which is auto-mounted on each client at boot. The only value the authentication server stores for most users' home accounts is where those home accounts can be found on the client machines, which in our case is /home. So I haven't had to worry too much about AFP network home accounts. Until last week.

There is one exception to the above standard. Certain Macromedia products do not function properly when the user's home account is located on our NFS server, for some reason. In particular, Flash is unable to read the publishing templates, effectively disabling HTML publishing from the app. This has been a long term problem and has affected every version of Flash since we moved to our NFS home account server almost three years ago. Our solution has been to create a special user — the FlashUser — whose home account lives in an AFP network home account. When people need to work with Flash, they are encouraged to use this FlashUser account so that they can use the publishing features. This is inconvenient, but it works and we're used to it, so we keep doing it until we find a better solution. Unfortunately, when I built my master authentication server (actually, when I rebuilt it) I forgot to add the FlashUser. The teacher of the Flash class eventually came to my office and asked about the account, and I told him it should just take a minute or two to get it set up. Boy was I wrong.

The FlashUser account was always a simple AFP network user account. The user's account information and actual home account data were both stored on the same server, the data shared via an AFP share point, and configured in Workgroup Manager's "Sharing" pane as an auto-mounted home folder. According to this article guest access must be enabled on the AFP share to auto-mount. Well, that's new in 10.4, and I had no idea. And beyond that, I think it sucks. Why should guest access be enabled on a home account share point? This didn't used to be the case, and it seems way less secure to me in that by doing this you open the entire AFP home account share to unauthenticated access. Bad. Why in the hell would they change this? Not only is it less secure, but it breaks with tradition in ways that could (and in my cases did) cause serious headaches for admins setting up AFP networks home accounts.

Fortunately, after many hours of trial and error, I discovered a way to accomplish the same thing without enabling guest access on the share. It is possible, but it's quite a pain. Nevertheless, it's what I did.

Guest access can be foregone if trusted directory binding is set up between the client and the computer (which still makes no sense to me. You either have to go totally insecure, or set up an insanely secure system. Seems like we could skip the trusted binding thing if you'd just let us set up the shares sans guest access like we used to do.) Trusted binding is a bit of a pain to set up in that, as far as I know at this point, the only way to set it up is to go to every machine, log in, and run the Directory Access application. Apple really, really, really needs to give us admins some command-line tools for controlling DA. The current set is paltry at best, though I do need to look and see if there is one for setting up client-server binding (there might be, in fact it might be called dsconfigladp though this tool can not be used for setting authentication sources, for some ungodly reason, and I have yet to try it for the purposes of directory binding). But before you set up your clients, you must be sure "Enable Directory Binding" on your server is checked in the Server Admin application. By default it's not. And, at least in my case, after enabling directory binding, I had to restart my server. Fun stuff. Also, I'm fairly certain this requires a properly functioning Kerberos setup on the server, so if you're just starting out, be sure to get Kerberos running right.


Directory Access: Enable Directory Binding
(click image for larger view)

Next you need to go to your client machines one by one and bind them to the server. If you've already joined your clients to a server for authentication, you can simply navigate to the server configuration in DA, click "Edit..." to edit it, and you'll be presented with a "Bind..." button that will walk you through the process. If you haven't yet joined a client you will be asked to set up trusted directory binding when you do. From there you need to enter the hostname of the client machine, the name of a directory administrator on your server, and that user's password. In my case, I needed to also reboot the client machine. Like I said, kind of a pain.


Directory Access: Configure LDAP
(click image for larger view)

Directory Access: Edit Server Config
(click image for larger view)

Directory Access: Hit "Bind..." to Set Up Trusted Directory Binding

(click image for larger view)

But that's it. You're done. (Well, 25 computers later, that is.) I now have my F lashUser set up again, and auto-mounting its home account in the usual way (none of that "Guest Access" shit), albeit with considerably more effort than it took in Panther. This is, in my opinion, just one more reason to hate Tiger, which I generally do, as long-term readers are well aware. It's another case in which Tiger has gotten harder and more complicated to use and set up, with no apparent advantage, or at least no immediate one.

I can only hope this is going somewhere good, because from my perspective the basic things I've come to rely on in both Mac OS X Server and Client (can you say "search by name?") have gotten harder to do in Tiger. Significantly harder. And that's too bad, 'cause that's just not what I'm looking for from Apple products. If I wanted something difficult to use, I'd switch to Linux. (I'd never switch to Windows.) And if Apple software continues along its current trajectory — the Tiger trend towards a more difficult OS — and Linux software continues along its current trend towards easier software, you may see more Linux switchers than ever. Apple's draw is ease-of-use. The more they move away from that idea in the implementation of their software, the less appealing they become, to me personally, as a computing platform, and the less distinguishable they become from their competitors.

But for now, I'm sticking by my favorite computer platform. Panther was an amazing OS — it truly "just worked". It's kind of sad that Tiger has been the OS that's made start considering, however peripherally, other options. Here's hoping they do better with Leopard.

BTW, here are a couple more links of interest regarding the topic of AFP newtork home accounts in Tiger. I don't have time to read them right now, but they may prove interesting at some point.

Creating a Home Directory for a Local User at a Server
Creating a Custom Home Directory
Automatically Mounting Share Points for Clients
Setting Up an Automountable AFP Share Point for Home Directories

UPDATE:
A fellow SysAdmin and blogger, Nigel, from mind the explanatory gap (one of my systems faves) has some beefs with this article, and shares some of his own experiences which are quite contrary to what I've reported. Just to be clear, this article reflects my own experiences, and there's a bit more to my own scenario than I shared in the article, mainly because I didn't think the details were important. They may be, and I discuss it at greater length in my response to Nigel's comment. Thanks, Nigel, for keeping me on my toes on this one.

I'm not really sure what's going on here, but if I get to the bottom of it, I'll certainly report about it, either here or in a follow up post. But please read the comments for the full story, as there's really a lot missing in the article, and things start to get clearer in the comments.

Three Platforms, One Server Part 11: From the BDC to the lookupd

Well, I did not have time to test my replica on Windows clients. I did, however, set up my BDC (Backup Domain Controller) on said replica and re-bind my Windows clients to the new master server once the replica was in place. Oddly, after doing so, Windows logins got a bit sketchy: sometimes they worked, sometimes not. I just figured it was a fluke and would go away. (Yes, that's my version of the scientific method. But that's the suck thing about intermittent problems. Very hard to track, or even — as in this case — be sure they exist.) Anyway, today the Windows login flakiness persisted, and was starting to really be a problem. So I investigated. A good friend and colleague recommended I check Windows' Event Manager (which I did not know about — hey, I'm still new at Windows — until today). There I saw messages that referenced problems connecting to the replica, which shouldn't have been happening. Thinking this might have something to do with the login problems, I turned off Windows services on the replica. Sure enough, Windows logins immediately began working perfectly. I had only two Windows services running on the replica: the BDC, which is supposed to provide domain controller failover should the PDC (the Primary Domain Controller on the master server) cease to function; and Windows file sharing, which hosts a backup of our Windows roaming profile drive. I'm not sure which service caused the problem as I simply stopped all services. So when I get a chance I will test each service individually and see which is the culprit. Hopefully it's the file sharing, because if we can at least keep the BDC running, we have some authentication failover: in the event of a master failure, users would still be able to log in, though their roaming profiles would be temporarily unavailable. If it's the BDC causing problems, then we effectively have no failover for Windows systems, which would blow big, shiny, green chunks. If that's the case, I give up. Yup, you heard me. I give up. With no clues, failing patience, a serious lack of time, and no good time to test it, I'd pretty much be giving up on the BDC, at least until I got some better info or until the winter break. Or both. For all I know, this is a bug in Tiger Server.

On the plus side, I was able to observe some good behavior for a change on my Mac clients. In the last article I'd mentioned that it's the clients that are responsible for keeping track of the master and replica servers, and that they get this info from the master when they bind to it, and that this info was probably refreshed automatically from time to. Well, this does indeed seem to be the case. Mac clients do periodically pull new replica info from the master, as evidenced by the presence of the replica in the DSLDAPv3PlugInConfig.plist file where once none existed, and on machines I'd not rebound. Nice. Guess I won't be needing to rebind the Mac clients after all. For those interested in theories, I believe this gets taken care of by lookupd. If I understand things correctly, lookupd manages directory services in Mac OS X, particularly the caches for those services. Mac OS X caches everything, and in Mac OS X, even Directory Service settings are cached. DNS. NetInfo. SMB, BSD, NIS. All cached. Most of these caches — like DNS, for example — have pretty short life spans. But some services don't need refreshing so often. Things like Open Directory services stay in cache for a bit longer. There's even a way to check and set the time-to-live for various service caches, but I'm not quite there yet. But I believe it's lookupd that grabbed the new settings from the server, or at least expired the cache that tells the client to go get those settings. In any case, there's a lookupd command I've found increasingly handy if you've just made changes to certain settings and they're not yet active on your system:

sudo lookupd -flushcache

This will, obviously, flush the lookupd cache, and refresh certain settings. For instance, DNS changes sometimes take a bit of time to become active. This command will take care of that lag. My favorite use, though, is on my server. See, when I create a new OD user, I use the command dscl. Apparently, using the Workgroup Manger GUI will flush the cache, and the user will be instantly recognized by the system. Smart. But if, like me, you use a script that calls dscl to add or remove an OD user (remember, OD users are managed by a directory service, as are local NetInfo users, for that matter), the system won't become aware of said user until the cache is flushed. I used to add a new user, id them, and sometimes get nothing for the first few minutes. Or freakier, delete a user and still be able to id them in the shell. Until I found out more about lookupd I thought I was going crazy. Now I just know to include the above command in my AddUser and DeleteUser scripts. Nice. Nicer still to know I'm not losing my mind. At least not in the case of adding or removing users.

Anyway, when I get 'round to my final Windows services tests, I will post an update.

God, I'm sick of this replica shit.

Three Platforms, One Server Part 10: The Saga Continues

Last week I was having all manner of problems setting up a replica of my master authentication server. After a significant deal of effort I believe I have solved these problems, though I won't know for sure without further testing, which I will be unable to perform anytime soon due the beginning of the Fall semester, which is Tuesday.

Getting this working has been a saga of nearly epic proportions. I'd long suspected that my replica problems were due to latent Kerberos problems on the master, which occurred in the initial stages of its installation and setup, and I still believe this is indeed the case. Loathe as I was to rebuild a seemingly perfectly working master, I was considering trying it when circumstances beyond my control, and I believe completely unrelated to the problems I was having, forced my hand. Towards the end of the week the master server committed suicide.

I was doing some work on the server when I noticed that the system drive was inexplicably full. Odd, considering that this is a 20GB partition, and the system, apps, and my files should have been using no more than 6GB. Where had all that space gone? Some quick investigating led me to the folder /var/spool/atprintd which was hogging about 10-15GB of disk space. (I'm recalling all this from memory, so details might not be perfectly accurate.) A surely related symptom of all of this was that, whenever I went to set print preference for computers in Workgroup Manager's "Computers" list, I'd get a spinning wheel, but never a graphic or access to the control. Didn't think much of 'til I discovered this print queue spooling dangerously out of control. Still, no big, I figured, and I deleted the spool files, and deleted the queue (I wasn't even using it) and rebooted the machine. Problem solved, right? Wrong.

After reboot the master server formerly know as "Open Directory Master" listed as it's role "Connected to a Directory System" in Server Admin's Open Directory settings pane. Huh? Exactly what Open Directory System was it connected to? Itself, apparently, as Directory Access showed it as being connected to 127.0.0.1, which is what it should be set to if it's a master. For some reason the server had somehow relinquished its role as master and handed it over to none other than itself. Talk about schizophrenic. Now, I have no idea — nor time to figure out at this point — how all this role stuff is managed down in the guts of the system. In fact, with the semester looming and all the replica problems I'd been having, I took this as a sign that maybe it was time to rebuild the server. And that's what I did.

Mac OS X Server, for all its relative ease-of-use, sure is amazingly finicky about a great many things when it comes to installation and initial setup. And once it's set up wrong, it seems the only way to reset all its various databases and services is to wipe it clean and start from scratch. There are settings (like the IP address) that get so ingrained into every database and config file that it's nearly impossible to find every instance and change it. Kerberos, the LDAP database, and UNIX flat files all need to accurately know about things like the IP address and the Fully Qualified Domain Name of the server before you start any services. And DNS had better be working perfectly somewhere on the network you're OS X Server is on, and it better have a perfect record for your server or you're going to just be screwed. Maybe not so screwed that you can't fix the problem, but probably screwed enough that rebuilding your server from scratch is the preferable option. I believe what caused all my problems — and I haven't proven this yet, but I believe it to be so — was a dot. That's right, a dot. One, single, stupid little dot I forgot to enter in my reverse DNS table. You know the one. You've surely forgotten it yourself at some point. The entry should look like:
34.1 IN PTR server.systemsboy.domain.com.

But instead you wrote:
34.1 IN PTR server.systemsboy.domain.com

Ooops! See the difference? No dot. No period at the end of the sentence. Kids, punctuation counts in DNS. And DNS counts on Mac OS X Server. Big time.

So now you're fucked and you don't even realize it. You go and you build your server, and things seem fine, and then all of a sudden the simplest of things — building your replica — suddenly don't work, and you've no idea why. There are no error messages, no warning dialogs. Things just don't work.

There were some other facts that I learned about over the course of this saga as regards the finickiness of Mac OS X Server. In Tiger Server 10.4+, f you want to host a replica, Kerberos has to be running properly. The Apple folks will tell you that once you set your server to "Open Directory Master" Kerberos will just set itself up and start automagically, which it will, as long as your DNS is set up right. Otherwise it will not do anything, and, as per the manual, you can start it up later, by hand, in Server Admin. But if you've already gone and done a bunch of other Kerberos-dependent stuff on your server — stuff like adding users, or creating a replica — and you then try to Kerberize your server, you're most likely quite fucked. And there's no easy way out. You're best bet is to rebuild the server from scratch. The alternative is to scour the numerous databases and config files in an attempt to correct the problem. But you'll never be sure it's right. So you rebuild the server.

I rebuilt the server.

It wasn't so bad. I'm getting really good at batch importing users into Open Directory (which will be the subject of a later post). The worst part was that the Windows machines all had to be rejoined to the new server by hand. Otherwise, with a proper DNS configuration — and, just to be on the safe side, with the master and the replica added to the /etc/hosts table — rebuilding the server was as easy as it's supposed to be. And now I have a server in which I have a great deal more confidence. Kerberos, in particular, seems to be working properly now. The weird error messages and inability to authenticate via Kerberos using the OD admin user are finally gone.

If I may make an aside to wax bitchy here for one moment: It seems to me that Apple needs to make a much bigger deal of all these dependencies. Proper server operation — especially Kerberos configuration — is heavily reliant on proper DNS configuration. I've known that for a while, mostly from forums and admin sites rather than from Apple's server documentation. But also, replication is dependent upon proper Kerberos configuration, which I never knew until I scoured the forums. I don't even use Kerberos, so I never worried about it being configured properly. And back in Panther you could easily replicate without it. That's changed and it would be nice if someone hollered about it quite loudly in the manuals. Also, it would be really great to have a way to put everything back to spec. FQDN assignment is done automatically by the server, and you no longer have the option to do it by hand. This happens quite early in the boot process and is completely dependent upon proper DNS. But if you screw up your DNS, there appears to be no safe way to go back and redo the FQDN assignation. Everything gets hosed by one little mistake. I'd like a button that resets everything on the server — Kerberos, FQDN, IP address and OD Database — back to the settings just after a clean install, and then brings you to the Setup Assistant to redo all the settings. At least then you wouldn't have to reinstall everything. (You do realize, of course, that there is no "Archive and Install" featurefor Mac OS X Server. So any install requires the multi-megabyte software updates, and the reinstallation of any users, shares, computers and the like. Which is a big, fat, nasty pain.)

Okay, so the burning question: What about the replica? Did the rebuild yield a usable replica? The answer is not pretty: I think so.

Once I had my master server rebuilt (and cloned for easy restore!) I tried the replica. Replication seemed to go fine, but then, it did before too, so that wasn't necessarily reassuring. Once the replica was built I pulled the plug on the master. This time, things were even worse than before. No authentication was able to take place on the clients. They just didn't see the replica. But why?

One of the things I learned through all of this is that it's the client who maintains the list of replicas. It gets this list from the master OD server when it binds to said server — i.e. when you open Directory Access and set it up — and it probably also refreshes the settings here periodically, perhaps when mcx cache is refreshed, but I'm not really sure. You can see all these settings in one particular preference file, called DSLDAPv3PlugInConfig.plist, located in /Library/Preferences/DirectoryService. One of the things you'll see in this file is the IP address of your master server, and if there are any replicas, they will be listed as well. Checking this file on my clients showed that they had not gotten the replica info from the new server. They had no idea about the replica. So I plugged the master back in, and re-bound a client machine. It got the replica info. I unplugged the master again, and after a few minutes, my client could again authenticate to the replica. It worked! Finally!

Now all of this occurred on Friday at around 9PM. I'd worked about 52 hours by that point. Testing the replica on a larger scale, and on Windows in particular, would have required effort I was no longer capable of nor willing to exert — i.e. un-joining Windows systems from the master, removing them from the computer list, rejoining them, pulling the plug, testing the behavior, and the like, all of which would have taken at least another hour or so to do properly. And whether it worked or not, I'd still have to go rebind every machine one by one the following week, just to be safe. So I quit. And that's why, when you ask me, "Did the replica work?" I can only answer, "I think so."

To qualify: I think we have a winner. It seems to work on the Macs, and I think it will work on Windows (though how well it works on Windows, and if it works like we hope it does is anyone's guess). The previous difficulties seem to have been caused by Kerberos-related problems on the master OD server, but I haven't thoroughly tested it yet. And if it doesn't work, then that's that. We'll probably give up on the whole replica idea for the semester, since the only way to test it is to effectively bring down the network, and I'm not so willing to pull plugs when school is in session. So next week I will test a Windows machine if I can find a good time to do it. Either way, I will be rebinding all Macs and Windows machines to the new server in the hopes that replication is indeed working. But if there's no time to test it, we may not know for sure until the master server dies. Sure. It's a little scary. But I've seen worse.

One last thing: I did have the strangest problem with my latest master/replica setup during testing. After pulling the plug on the master, and then plugging it back in, I was unable to authenticate to the master as the directory admin user from WGM, Server Admin or ARD. I was able to ssh in, but I could not reboot the machine from the shell because sudo from both the directory admin account and the local admin account would not accept the root password. I was totally locked out. The only thing I could do was hard reboot the server by holding the power button on the machine. Fortunately, after doing so, the server began behaving normally again, and root access was again right with the universe. Sure did make me nervous though.

Anyway, thus ends, essentially, the saga of our replica, at least for now, as well as, for the most part, the saga of Three Platforms, One Server. We're now in the final stage of the project: going live. Class begins on Tuesday, and we'll get to see how our master authentication server holds up under load. If disaster ensues, expect more posts on the subject. If all goes well — knock wood with me, people — I will post a final round-up article and this series will be concluded. Also, if I get a chance to more thoroughly test my new replica I will post the results either here or in another article.

Okay, everyone. Have a great school year!

Three Platforms, One Server Part 9: Replica Problems

This post will be a short one. Promise. We're finishing up this project, and so far it looks like it's going to be a success. We're just adding the last little bits and finishing touches right now, but we've been putting out internal authentication server through it's paces for the past couple of months, and all seems well.

In a few places along the travails of this series I mentioned the need for a fail over in the event our master authentication server goes belly up. The rationale is that, with all our eggs in the one server basket, if our master goes down, no one can log in — not on Windows, not on Mac, not on Linux. Fortunately, Mac OS X Server allows for what's known as a replica. A replica is a read-only copy of your Open Directory data hosted by another server. But it's a bit more than that. The replica also provides what's known as fail over. That is, if the master server goes down, the replica knows to take over and start serving authentication to the clients. The replica, in effect, becomes the master in the event of the master's absence, until said master returns to service, at which point the replica gracefully hands control back to the master. You can actually have multiple replicas for fail over, and for redundancy in the event of slow or separate networks. Brilliant! I've wanted one for a long time. And now I have it.

Setting up a replica is easy as pie: Set the server's "role" to "Replica", point at its master, authenticate and you're off to the races. That is, if you've set everything just perfectly on both your master server and the replica. That's the thing about Mac OS X server, and always has been: if you set it up wrong initially, you're in for a potential world of hurt. As I rediscovered last week.

After building my replica, I next went on the requisite testing spree. The test involved physically pulling the network cable from the master server, and then observing the behavior of the clients. Initially, the replica seemed to do a fine job of picking up right where the master left off. After about two minutes' time, clients could log right back in. But after about 5 minutes' time, nearly every client on our network beachballed indefinitely, and any attempt to login would hang the machine. This hanging was so sudden and so thorough, it actually froze my machine mid-cube-effect as I attempted to fast-user-switch to the login window. Some fail over! Not cool.


Fast User Switch: Frozen in Time
(click image for larger view)

So I've spent a great deal of time figuring out the solution. The logs were no help. Google turned out to be a wash. Manuals? Pfft! For the first time in quite a while, it was an Apple Knowledge Base Article that offered the fix. The article was written for people who were having problems getting a 10.4.2 replica to remain a replica. Apparently there was an issue that involved these servers reverting back to "Standalone" roles after being switched from "Replica" to "Standalone" and back again. Though this was not my problem, nor did any of the symptoms reflect what I was seeing, I finally decided to try the recommendations in the article as they seemed fairly universally applicable, and as I was desperate and had tried everything else. The article essentially details methods for cleaning out every part of the master database that references the replica, and then re-promoting the replica to a clean master. Honestly, my master database looked pretty clean to me, but there was one bit of advice that I was not aware of, and it's my suspicion that this was what did the trick for me: The OD Administrator of the replica's database can NOT have the same UID or short name as that of the master. The article recommends creating a separate OD Admin account on the master, and using this separate account when binding the replica to the master. Honestly, I had no fucking idea. Would've been nice if this had been more explicitly mentioned in the manuals. Believe me, I've read the section entitled "Setting Up an Open Directory Replica" numerous times by now. It's not there. Fortunately, it's in that article, and now I know (and knowing's half the battle).

And now you know.

So, in a couple weeks we go live with our unified internal network. I'll let you know how it goes.

UPDATE 1:
Actually, my replica seems to still not be working, at least not very reliably. What happens is that about two minutes after the plug is pulled on the master, the replica picks up. At this point, clients — both Mac and Windows — can successfully log in. Shortly thereafter — maybe four minutes later — we start having problems: the Macs can't log in, or only some can log in; Windows machines log in intermittently — one time it works, the next time it doesn't, it works after a reboot, then it doesn't; and, perhaps strangest of all, network connections to the Macs — ARD, ssh, anything but ping — become all but impossible, hanging at the attempts. This is totally a guess, but it seems to me like the clients are having serious trouble binding to the replica. They keep attempting to do so, with some initial or intermittent success, and in their attempts network connections get locked up and the machines bog down. It's almost like the replica server is saying, "Yes, you can bind to me," and then changing it's mind and saying. "Wait, no you can't. Never mind. Screw you." Again, I'm only guessing. There is nothing clear cut in the logs, and I can't find anything in Apple's Discussion forums or Knowledge Base that specifically addresses my problem. I only pray that it isn't a problem with my master server, but the master works perfectly, and it seems to me that a replica of a perfectly working master should work perfectly. The current replica is running on a Mac mini with limited RAM, and a 10/100 BT NIC, and I want to rule out potential problems that might be caused by the hardware as well as the software set up. So my next step will be to build a new replica from scratch on a G5. I'll let you know if that solves the problem.

Another thing I absolutely should mention: For Windows replication, the replica server must be set up as a Backup Domain Controller (BDC). This is done in the Server Admin application in the Windows section. It's fairly straightforward to set up, so I won't go into detail, but just for the record, it's important to be aware of this, and I wasn't until recently, so I mention it here for completeness' sake.

Having this replica isn't absolutely critical to our plan. That is, we can go forward with this plan without the replica. But having a working replica will provide an important safety net that I'd really like to have working as the semester begins. There's no good way to test it while the semester is in session. So I'm working hard to get it up and running in the final week of the summer.

So much for this post being short. More to come.

UPDATE 2:
I built the new server today. From scratch. On a G5. No joy. I honestly don't know what the problemcould be. I can only guess that something either broke with the latest 10.4.7 update, or that there's something slightly off with my master server and it's causing problems on the replica. But it's weird, because if I bind directly to the replica using Directory Access it works perfectly, which leads me to suspect a problem on the client. But it affects Windows machines as well, so that doesn't quite figure either. I hate to admit it, but I'm stumped. And, unfortunately, I don't have time to worry about it right now. I will revisit this issue at a later date, when I get some time. When I do, I'll probably post a new entry with the solution, that is, if I'm able to find a solution. I hate this kind of thing. Is anyone else having a similar problem? I feel like I'm going nuts. And I can't believe I've spent so much time on something that should be really, really easy. Fuck. What a bummer.

UPDATE 3:
I've pretty much given up on this for now. No time. And no good time to test, what with the students coming back next week. But today I noticed something strange, and it occurred to me it might be related to my replica problems. Today, when trying to make an AFP connection to the master server from a client using a simple "Connect to Server..." I got a Kerberos prompt that refused my admin credentials. Hmmm... Kerberos problems... On the master... Not good... So who knows? I may be rebuilding the master server at some point. But not now. Oh, Lordy, not now.