Three Platforms, One Server Part 11: From the BDC to the lookupd

Well, I did not have time to test my replica on Windows clients. I did, however, set up my BDC (Backup Domain Controller) on said replica and re-bind my Windows clients to the new master server once the replica was in place. Oddly, after doing so, Windows logins got a bit sketchy: sometimes they worked, sometimes not. I just figured it was a fluke and would go away. (Yes, that's my version of the scientific method. But that's the suck thing about intermittent problems. Very hard to track, or even — as in this case — be sure they exist.) Anyway, today the Windows login flakiness persisted, and was starting to really be a problem. So I investigated. A good friend and colleague recommended I check Windows' Event Manager (which I did not know about — hey, I'm still new at Windows — until today). There I saw messages that referenced problems connecting to the replica, which shouldn't have been happening. Thinking this might have something to do with the login problems, I turned off Windows services on the replica. Sure enough, Windows logins immediately began working perfectly. I had only two Windows services running on the replica: the BDC, which is supposed to provide domain controller failover should the PDC (the Primary Domain Controller on the master server) cease to function; and Windows file sharing, which hosts a backup of our Windows roaming profile drive. I'm not sure which service caused the problem as I simply stopped all services. So when I get a chance I will test each service individually and see which is the culprit. Hopefully it's the file sharing, because if we can at least keep the BDC running, we have some authentication failover: in the event of a master failure, users would still be able to log in, though their roaming profiles would be temporarily unavailable. If it's the BDC causing problems, then we effectively have no failover for Windows systems, which would blow big, shiny, green chunks. If that's the case, I give up. Yup, you heard me. I give up. With no clues, failing patience, a serious lack of time, and no good time to test it, I'd pretty much be giving up on the BDC, at least until I got some better info or until the winter break. Or both. For all I know, this is a bug in Tiger Server.

On the plus side, I was able to observe some good behavior for a change on my Mac clients. In the last article I'd mentioned that it's the clients that are responsible for keeping track of the master and replica servers, and that they get this info from the master when they bind to it, and that this info was probably refreshed automatically from time to. Well, this does indeed seem to be the case. Mac clients do periodically pull new replica info from the master, as evidenced by the presence of the replica in the DSLDAPv3PlugInConfig.plist file where once none existed, and on machines I'd not rebound. Nice. Guess I won't be needing to rebind the Mac clients after all. For those interested in theories, I believe this gets taken care of by lookupd. If I understand things correctly, lookupd manages directory services in Mac OS X, particularly the caches for those services. Mac OS X caches everything, and in Mac OS X, even Directory Service settings are cached. DNS. NetInfo. SMB, BSD, NIS. All cached. Most of these caches — like DNS, for example — have pretty short life spans. But some services don't need refreshing so often. Things like Open Directory services stay in cache for a bit longer. There's even a way to check and set the time-to-live for various service caches, but I'm not quite there yet. But I believe it's lookupd that grabbed the new settings from the server, or at least expired the cache that tells the client to go get those settings. In any case, there's a lookupd command I've found increasingly handy if you've just made changes to certain settings and they're not yet active on your system:

sudo lookupd -flushcache

This will, obviously, flush the lookupd cache, and refresh certain settings. For instance, DNS changes sometimes take a bit of time to become active. This command will take care of that lag. My favorite use, though, is on my server. See, when I create a new OD user, I use the command dscl. Apparently, using the Workgroup Manger GUI will flush the cache, and the user will be instantly recognized by the system. Smart. But if, like me, you use a script that calls dscl to add or remove an OD user (remember, OD users are managed by a directory service, as are local NetInfo users, for that matter), the system won't become aware of said user until the cache is flushed. I used to add a new user, id them, and sometimes get nothing for the first few minutes. Or freakier, delete a user and still be able to id them in the shell. Until I found out more about lookupd I thought I was going crazy. Now I just know to include the above command in my AddUser and DeleteUser scripts. Nice. Nicer still to know I'm not losing my mind. At least not in the case of adding or removing users.

Anyway, when I get 'round to my final Windows services tests, I will post an update.

God, I'm sick of this replica shit.