Three Platforms, One Server Part 12: AFP Network Home Accounts

I hit another minor but infuriating snag in my plan to unify the network, though this one was all Apple. It's another case of Apple making serious changes to the way you're supposed to set your server and clients and never really trumpeting much about it. Seems everything I used to do with my server and clients is done either slightly — or in some cases radically — differently in Tiger than it was in Panther. I must admit, I never checked the manuals on this, but something as simple setting up AFP networked home accounts has become a much more complex process in Tiger than it ever was in Panther, and it took me quite a while to figure out what I had to do to make it work like it did in the Panther glory days.

Now, to remind, we don't really use AFP networked home accounts for most users. Our users' home accounts live on an NFS server — a separate machine from our authentication server — which is auto-mounted on each client at boot. The only value the authentication server stores for most users' home accounts is where those home accounts can be found on the client machines, which in our case is /home. So I haven't had to worry too much about AFP network home accounts. Until last week.

There is one exception to the above standard. Certain Macromedia products do not function properly when the user's home account is located on our NFS server, for some reason. In particular, Flash is unable to read the publishing templates, effectively disabling HTML publishing from the app. This has been a long term problem and has affected every version of Flash since we moved to our NFS home account server almost three years ago. Our solution has been to create a special user — the FlashUser — whose home account lives in an AFP network home account. When people need to work with Flash, they are encouraged to use this FlashUser account so that they can use the publishing features. This is inconvenient, but it works and we're used to it, so we keep doing it until we find a better solution. Unfortunately, when I built my master authentication server (actually, when I rebuilt it) I forgot to add the FlashUser. The teacher of the Flash class eventually came to my office and asked about the account, and I told him it should just take a minute or two to get it set up. Boy was I wrong.

The FlashUser account was always a simple AFP network user account. The user's account information and actual home account data were both stored on the same server, the data shared via an AFP share point, and configured in Workgroup Manager's "Sharing" pane as an auto-mounted home folder. According to this article guest access must be enabled on the AFP share to auto-mount. Well, that's new in 10.4, and I had no idea. And beyond that, I think it sucks. Why should guest access be enabled on a home account share point? This didn't used to be the case, and it seems way less secure to me in that by doing this you open the entire AFP home account share to unauthenticated access. Bad. Why in the hell would they change this? Not only is it less secure, but it breaks with tradition in ways that could (and in my cases did) cause serious headaches for admins setting up AFP networks home accounts.

Fortunately, after many hours of trial and error, I discovered a way to accomplish the same thing without enabling guest access on the share. It is possible, but it's quite a pain. Nevertheless, it's what I did.

Guest access can be foregone if trusted directory binding is set up between the client and the computer (which still makes no sense to me. You either have to go totally insecure, or set up an insanely secure system. Seems like we could skip the trusted binding thing if you'd just let us set up the shares sans guest access like we used to do.) Trusted binding is a bit of a pain to set up in that, as far as I know at this point, the only way to set it up is to go to every machine, log in, and run the Directory Access application. Apple really, really, really needs to give us admins some command-line tools for controlling DA. The current set is paltry at best, though I do need to look and see if there is one for setting up client-server binding (there might be, in fact it might be called dsconfigladp though this tool can not be used for setting authentication sources, for some ungodly reason, and I have yet to try it for the purposes of directory binding). But before you set up your clients, you must be sure "Enable Directory Binding" on your server is checked in the Server Admin application. By default it's not. And, at least in my case, after enabling directory binding, I had to restart my server. Fun stuff. Also, I'm fairly certain this requires a properly functioning Kerberos setup on the server, so if you're just starting out, be sure to get Kerberos running right.


Directory Access: Enable Directory Binding
(click image for larger view)

Next you need to go to your client machines one by one and bind them to the server. If you've already joined your clients to a server for authentication, you can simply navigate to the server configuration in DA, click "Edit..." to edit it, and you'll be presented with a "Bind..." button that will walk you through the process. If you haven't yet joined a client you will be asked to set up trusted directory binding when you do. From there you need to enter the hostname of the client machine, the name of a directory administrator on your server, and that user's password. In my case, I needed to also reboot the client machine. Like I said, kind of a pain.


Directory Access: Configure LDAP
(click image for larger view)

Directory Access: Edit Server Config
(click image for larger view)

Directory Access: Hit "Bind..." to Set Up Trusted Directory Binding

(click image for larger view)

But that's it. You're done. (Well, 25 computers later, that is.) I now have my F lashUser set up again, and auto-mounting its home account in the usual way (none of that "Guest Access" shit), albeit with considerably more effort than it took in Panther. This is, in my opinion, just one more reason to hate Tiger, which I generally do, as long-term readers are well aware. It's another case in which Tiger has gotten harder and more complicated to use and set up, with no apparent advantage, or at least no immediate one.

I can only hope this is going somewhere good, because from my perspective the basic things I've come to rely on in both Mac OS X Server and Client (can you say "search by name?") have gotten harder to do in Tiger. Significantly harder. And that's too bad, 'cause that's just not what I'm looking for from Apple products. If I wanted something difficult to use, I'd switch to Linux. (I'd never switch to Windows.) And if Apple software continues along its current trajectory — the Tiger trend towards a more difficult OS — and Linux software continues along its current trend towards easier software, you may see more Linux switchers than ever. Apple's draw is ease-of-use. The more they move away from that idea in the implementation of their software, the less appealing they become, to me personally, as a computing platform, and the less distinguishable they become from their competitors.

But for now, I'm sticking by my favorite computer platform. Panther was an amazing OS — it truly "just worked". It's kind of sad that Tiger has been the OS that's made start considering, however peripherally, other options. Here's hoping they do better with Leopard.

BTW, here are a couple more links of interest regarding the topic of AFP newtork home accounts in Tiger. I don't have time to read them right now, but they may prove interesting at some point.

Creating a Home Directory for a Local User at a Server
Creating a Custom Home Directory
Automatically Mounting Share Points for Clients
Setting Up an Automountable AFP Share Point for Home Directories

UPDATE:
A fellow SysAdmin and blogger, Nigel, from mind the explanatory gap (one of my systems faves) has some beefs with this article, and shares some of his own experiences which are quite contrary to what I've reported. Just to be clear, this article reflects my own experiences, and there's a bit more to my own scenario than I shared in the article, mainly because I didn't think the details were important. They may be, and I discuss it at greater length in my response to Nigel's comment. Thanks, Nigel, for keeping me on my toes on this one.

I'm not really sure what's going on here, but if I get to the bottom of it, I'll certainly report about it, either here or in a follow up post. But please read the comments for the full story, as there's really a lot missing in the article, and things start to get clearer in the comments.

Three Platforms, One Server Part 11: From the BDC to the lookupd

Well, I did not have time to test my replica on Windows clients. I did, however, set up my BDC (Backup Domain Controller) on said replica and re-bind my Windows clients to the new master server once the replica was in place. Oddly, after doing so, Windows logins got a bit sketchy: sometimes they worked, sometimes not. I just figured it was a fluke and would go away. (Yes, that's my version of the scientific method. But that's the suck thing about intermittent problems. Very hard to track, or even — as in this case — be sure they exist.) Anyway, today the Windows login flakiness persisted, and was starting to really be a problem. So I investigated. A good friend and colleague recommended I check Windows' Event Manager (which I did not know about — hey, I'm still new at Windows — until today). There I saw messages that referenced problems connecting to the replica, which shouldn't have been happening. Thinking this might have something to do with the login problems, I turned off Windows services on the replica. Sure enough, Windows logins immediately began working perfectly. I had only two Windows services running on the replica: the BDC, which is supposed to provide domain controller failover should the PDC (the Primary Domain Controller on the master server) cease to function; and Windows file sharing, which hosts a backup of our Windows roaming profile drive. I'm not sure which service caused the problem as I simply stopped all services. So when I get a chance I will test each service individually and see which is the culprit. Hopefully it's the file sharing, because if we can at least keep the BDC running, we have some authentication failover: in the event of a master failure, users would still be able to log in, though their roaming profiles would be temporarily unavailable. If it's the BDC causing problems, then we effectively have no failover for Windows systems, which would blow big, shiny, green chunks. If that's the case, I give up. Yup, you heard me. I give up. With no clues, failing patience, a serious lack of time, and no good time to test it, I'd pretty much be giving up on the BDC, at least until I got some better info or until the winter break. Or both. For all I know, this is a bug in Tiger Server.

On the plus side, I was able to observe some good behavior for a change on my Mac clients. In the last article I'd mentioned that it's the clients that are responsible for keeping track of the master and replica servers, and that they get this info from the master when they bind to it, and that this info was probably refreshed automatically from time to. Well, this does indeed seem to be the case. Mac clients do periodically pull new replica info from the master, as evidenced by the presence of the replica in the DSLDAPv3PlugInConfig.plist file where once none existed, and on machines I'd not rebound. Nice. Guess I won't be needing to rebind the Mac clients after all. For those interested in theories, I believe this gets taken care of by lookupd. If I understand things correctly, lookupd manages directory services in Mac OS X, particularly the caches for those services. Mac OS X caches everything, and in Mac OS X, even Directory Service settings are cached. DNS. NetInfo. SMB, BSD, NIS. All cached. Most of these caches — like DNS, for example — have pretty short life spans. But some services don't need refreshing so often. Things like Open Directory services stay in cache for a bit longer. There's even a way to check and set the time-to-live for various service caches, but I'm not quite there yet. But I believe it's lookupd that grabbed the new settings from the server, or at least expired the cache that tells the client to go get those settings. In any case, there's a lookupd command I've found increasingly handy if you've just made changes to certain settings and they're not yet active on your system:

sudo lookupd -flushcache

This will, obviously, flush the lookupd cache, and refresh certain settings. For instance, DNS changes sometimes take a bit of time to become active. This command will take care of that lag. My favorite use, though, is on my server. See, when I create a new OD user, I use the command dscl. Apparently, using the Workgroup Manger GUI will flush the cache, and the user will be instantly recognized by the system. Smart. But if, like me, you use a script that calls dscl to add or remove an OD user (remember, OD users are managed by a directory service, as are local NetInfo users, for that matter), the system won't become aware of said user until the cache is flushed. I used to add a new user, id them, and sometimes get nothing for the first few minutes. Or freakier, delete a user and still be able to id them in the shell. Until I found out more about lookupd I thought I was going crazy. Now I just know to include the above command in my AddUser and DeleteUser scripts. Nice. Nicer still to know I'm not losing my mind. At least not in the case of adding or removing users.

Anyway, when I get 'round to my final Windows services tests, I will post an update.

God, I'm sick of this replica shit.

Three Platforms, One Server Part 10: The Saga Continues

Last week I was having all manner of problems setting up a replica of my master authentication server. After a significant deal of effort I believe I have solved these problems, though I won't know for sure without further testing, which I will be unable to perform anytime soon due the beginning of the Fall semester, which is Tuesday.

Getting this working has been a saga of nearly epic proportions. I'd long suspected that my replica problems were due to latent Kerberos problems on the master, which occurred in the initial stages of its installation and setup, and I still believe this is indeed the case. Loathe as I was to rebuild a seemingly perfectly working master, I was considering trying it when circumstances beyond my control, and I believe completely unrelated to the problems I was having, forced my hand. Towards the end of the week the master server committed suicide.

I was doing some work on the server when I noticed that the system drive was inexplicably full. Odd, considering that this is a 20GB partition, and the system, apps, and my files should have been using no more than 6GB. Where had all that space gone? Some quick investigating led me to the folder /var/spool/atprintd which was hogging about 10-15GB of disk space. (I'm recalling all this from memory, so details might not be perfectly accurate.) A surely related symptom of all of this was that, whenever I went to set print preference for computers in Workgroup Manager's "Computers" list, I'd get a spinning wheel, but never a graphic or access to the control. Didn't think much of 'til I discovered this print queue spooling dangerously out of control. Still, no big, I figured, and I deleted the spool files, and deleted the queue (I wasn't even using it) and rebooted the machine. Problem solved, right? Wrong.

After reboot the master server formerly know as "Open Directory Master" listed as it's role "Connected to a Directory System" in Server Admin's Open Directory settings pane. Huh? Exactly what Open Directory System was it connected to? Itself, apparently, as Directory Access showed it as being connected to 127.0.0.1, which is what it should be set to if it's a master. For some reason the server had somehow relinquished its role as master and handed it over to none other than itself. Talk about schizophrenic. Now, I have no idea — nor time to figure out at this point — how all this role stuff is managed down in the guts of the system. In fact, with the semester looming and all the replica problems I'd been having, I took this as a sign that maybe it was time to rebuild the server. And that's what I did.

Mac OS X Server, for all its relative ease-of-use, sure is amazingly finicky about a great many things when it comes to installation and initial setup. And once it's set up wrong, it seems the only way to reset all its various databases and services is to wipe it clean and start from scratch. There are settings (like the IP address) that get so ingrained into every database and config file that it's nearly impossible to find every instance and change it. Kerberos, the LDAP database, and UNIX flat files all need to accurately know about things like the IP address and the Fully Qualified Domain Name of the server before you start any services. And DNS had better be working perfectly somewhere on the network you're OS X Server is on, and it better have a perfect record for your server or you're going to just be screwed. Maybe not so screwed that you can't fix the problem, but probably screwed enough that rebuilding your server from scratch is the preferable option. I believe what caused all my problems — and I haven't proven this yet, but I believe it to be so — was a dot. That's right, a dot. One, single, stupid little dot I forgot to enter in my reverse DNS table. You know the one. You've surely forgotten it yourself at some point. The entry should look like:
34.1 IN PTR server.systemsboy.domain.com.

But instead you wrote:
34.1 IN PTR server.systemsboy.domain.com

Ooops! See the difference? No dot. No period at the end of the sentence. Kids, punctuation counts in DNS. And DNS counts on Mac OS X Server. Big time.

So now you're fucked and you don't even realize it. You go and you build your server, and things seem fine, and then all of a sudden the simplest of things — building your replica — suddenly don't work, and you've no idea why. There are no error messages, no warning dialogs. Things just don't work.

There were some other facts that I learned about over the course of this saga as regards the finickiness of Mac OS X Server. In Tiger Server 10.4+, f you want to host a replica, Kerberos has to be running properly. The Apple folks will tell you that once you set your server to "Open Directory Master" Kerberos will just set itself up and start automagically, which it will, as long as your DNS is set up right. Otherwise it will not do anything, and, as per the manual, you can start it up later, by hand, in Server Admin. But if you've already gone and done a bunch of other Kerberos-dependent stuff on your server — stuff like adding users, or creating a replica — and you then try to Kerberize your server, you're most likely quite fucked. And there's no easy way out. You're best bet is to rebuild the server from scratch. The alternative is to scour the numerous databases and config files in an attempt to correct the problem. But you'll never be sure it's right. So you rebuild the server.

I rebuilt the server.

It wasn't so bad. I'm getting really good at batch importing users into Open Directory (which will be the subject of a later post). The worst part was that the Windows machines all had to be rejoined to the new server by hand. Otherwise, with a proper DNS configuration — and, just to be on the safe side, with the master and the replica added to the /etc/hosts table — rebuilding the server was as easy as it's supposed to be. And now I have a server in which I have a great deal more confidence. Kerberos, in particular, seems to be working properly now. The weird error messages and inability to authenticate via Kerberos using the OD admin user are finally gone.

If I may make an aside to wax bitchy here for one moment: It seems to me that Apple needs to make a much bigger deal of all these dependencies. Proper server operation — especially Kerberos configuration — is heavily reliant on proper DNS configuration. I've known that for a while, mostly from forums and admin sites rather than from Apple's server documentation. But also, replication is dependent upon proper Kerberos configuration, which I never knew until I scoured the forums. I don't even use Kerberos, so I never worried about it being configured properly. And back in Panther you could easily replicate without it. That's changed and it would be nice if someone hollered about it quite loudly in the manuals. Also, it would be really great to have a way to put everything back to spec. FQDN assignment is done automatically by the server, and you no longer have the option to do it by hand. This happens quite early in the boot process and is completely dependent upon proper DNS. But if you screw up your DNS, there appears to be no safe way to go back and redo the FQDN assignation. Everything gets hosed by one little mistake. I'd like a button that resets everything on the server — Kerberos, FQDN, IP address and OD Database — back to the settings just after a clean install, and then brings you to the Setup Assistant to redo all the settings. At least then you wouldn't have to reinstall everything. (You do realize, of course, that there is no "Archive and Install" featurefor Mac OS X Server. So any install requires the multi-megabyte software updates, and the reinstallation of any users, shares, computers and the like. Which is a big, fat, nasty pain.)

Okay, so the burning question: What about the replica? Did the rebuild yield a usable replica? The answer is not pretty: I think so.

Once I had my master server rebuilt (and cloned for easy restore!) I tried the replica. Replication seemed to go fine, but then, it did before too, so that wasn't necessarily reassuring. Once the replica was built I pulled the plug on the master. This time, things were even worse than before. No authentication was able to take place on the clients. They just didn't see the replica. But why?

One of the things I learned through all of this is that it's the client who maintains the list of replicas. It gets this list from the master OD server when it binds to said server — i.e. when you open Directory Access and set it up — and it probably also refreshes the settings here periodically, perhaps when mcx cache is refreshed, but I'm not really sure. You can see all these settings in one particular preference file, called DSLDAPv3PlugInConfig.plist, located in /Library/Preferences/DirectoryService. One of the things you'll see in this file is the IP address of your master server, and if there are any replicas, they will be listed as well. Checking this file on my clients showed that they had not gotten the replica info from the new server. They had no idea about the replica. So I plugged the master back in, and re-bound a client machine. It got the replica info. I unplugged the master again, and after a few minutes, my client could again authenticate to the replica. It worked! Finally!

Now all of this occurred on Friday at around 9PM. I'd worked about 52 hours by that point. Testing the replica on a larger scale, and on Windows in particular, would have required effort I was no longer capable of nor willing to exert — i.e. un-joining Windows systems from the master, removing them from the computer list, rejoining them, pulling the plug, testing the behavior, and the like, all of which would have taken at least another hour or so to do properly. And whether it worked or not, I'd still have to go rebind every machine one by one the following week, just to be safe. So I quit. And that's why, when you ask me, "Did the replica work?" I can only answer, "I think so."

To qualify: I think we have a winner. It seems to work on the Macs, and I think it will work on Windows (though how well it works on Windows, and if it works like we hope it does is anyone's guess). The previous difficulties seem to have been caused by Kerberos-related problems on the master OD server, but I haven't thoroughly tested it yet. And if it doesn't work, then that's that. We'll probably give up on the whole replica idea for the semester, since the only way to test it is to effectively bring down the network, and I'm not so willing to pull plugs when school is in session. So next week I will test a Windows machine if I can find a good time to do it. Either way, I will be rebinding all Macs and Windows machines to the new server in the hopes that replication is indeed working. But if there's no time to test it, we may not know for sure until the master server dies. Sure. It's a little scary. But I've seen worse.

One last thing: I did have the strangest problem with my latest master/replica setup during testing. After pulling the plug on the master, and then plugging it back in, I was unable to authenticate to the master as the directory admin user from WGM, Server Admin or ARD. I was able to ssh in, but I could not reboot the machine from the shell because sudo from both the directory admin account and the local admin account would not accept the root password. I was totally locked out. The only thing I could do was hard reboot the server by holding the power button on the machine. Fortunately, after doing so, the server began behaving normally again, and root access was again right with the universe. Sure did make me nervous though.

Anyway, thus ends, essentially, the saga of our replica, at least for now, as well as, for the most part, the saga of Three Platforms, One Server. We're now in the final stage of the project: going live. Class begins on Tuesday, and we'll get to see how our master authentication server holds up under load. If disaster ensues, expect more posts on the subject. If all goes well — knock wood with me, people — I will post a final round-up article and this series will be concluded. Also, if I get a chance to more thoroughly test my new replica I will post the results either here or in another article.

Okay, everyone. Have a great school year!

Hey! My Box.net-Shared iCal Calendars Stopped Working

Just a follow-up to a recent, popular post. I recently noticed that my iCal calendars — the ones I share via my Box.net account — stopped publishing, displaying a warning badge over the broadcast icon.


WTF: My Calendar Share Stopped Working!
(click image for larger view)

So I tried seeing if I could still connect to Box.net via the web. Yup. Good to go. Next I checked to see if I could still connect via the Finder and WebDAV. Nope. No go. Just sits and spins. There's the problem: No WebDAV, no calendar share. (I. Can't seem. To stop. Talking in. Short. Truncated. Sentences.)


Connecting to https://www.box.net: Or Not
(click image for larger view)

After having no luck Googling a solution, I decided I'd try to figure things out myself. And after some poking around I found the problem. And solution. The problem is the "s". See it? The one after the "http"? There in the "Connecting To Server" dialog. There you go. That's the culprit. That "s" means you're attempting to connect using a variation on the http protocol (called https, if you can believe) that transmits over a different port (port 443) than that of standard http (port 80), and that uses an additional encryption layer for security. Seems Box.net has stopped using the protocol for WebDAV communications, and is, at least for now, using standard http. Removing the "s" from my calendar shares fixed them right up.

Best way I know to do that is to select the published calendar, choose "Change Location..." from the "Calendar" pull-down menu, and in the field marked "Base URL:" change the "http" in the URL to "https". Hit publish, and everything should be right as rain.


Fixed: Ahh! That's Better!
(click image for larger view)

I'm not sure why the good folks at Box.net decided to change the connection protocol for WebDAV, nor why they failed to inform anyone (as far as I could tell, anyway). WebDAV support is a beta feature at Box.net, apparently, so I suppose we should expect some changes from time to time. Either way I'm sure glad they haven't pulled the service altogether. Hard to get too mad when the price is so nice.

UPDATE:
About five minutes after posting this article I got a comment from someone named Aaron who appears to work at Box.net. Aaron wrote:

"Sorry for the scare. Dav should be back to normal in the next few hours."

That was last night. I'm still having some weirdness, but I have to admit to being far too tired to really do any serious investigating. Thus far, I'm unable to connect to Box.net with the Finder using https or http. Neither seems to work. Oddly, publishing via iCal using http does work, but still not with https. Strange. Not a big deal. Just strange. That's about all I've energy to try. Mainly I wanted to just point out that the Box.net folks seem to really be committed to the whole WebDAV thing, and that's great. And they appear to be listening, which is also great. Thanks, Box.net folks. And thanks, Aaron.

Now off to bed with me.

UPDATE 2:
Not sure when this started working properly, but publishing calendars via the https protocol is functioning normally again. Yay!
(Updated Sept. 4, 2006, 6:30 PM)

Filed Under: Internet Applications MacOSX Server

Three Platforms, One Server Part 9: Replica Problems

This post will be a short one. Promise. We're finishing up this project, and so far it looks like it's going to be a success. We're just adding the last little bits and finishing touches right now, but we've been putting out internal authentication server through it's paces for the past couple of months, and all seems well.

In a few places along the travails of this series I mentioned the need for a fail over in the event our master authentication server goes belly up. The rationale is that, with all our eggs in the one server basket, if our master goes down, no one can log in — not on Windows, not on Mac, not on Linux. Fortunately, Mac OS X Server allows for what's known as a replica. A replica is a read-only copy of your Open Directory data hosted by another server. But it's a bit more than that. The replica also provides what's known as fail over. That is, if the master server goes down, the replica knows to take over and start serving authentication to the clients. The replica, in effect, becomes the master in the event of the master's absence, until said master returns to service, at which point the replica gracefully hands control back to the master. You can actually have multiple replicas for fail over, and for redundancy in the event of slow or separate networks. Brilliant! I've wanted one for a long time. And now I have it.

Setting up a replica is easy as pie: Set the server's "role" to "Replica", point at its master, authenticate and you're off to the races. That is, if you've set everything just perfectly on both your master server and the replica. That's the thing about Mac OS X server, and always has been: if you set it up wrong initially, you're in for a potential world of hurt. As I rediscovered last week.

After building my replica, I next went on the requisite testing spree. The test involved physically pulling the network cable from the master server, and then observing the behavior of the clients. Initially, the replica seemed to do a fine job of picking up right where the master left off. After about two minutes' time, clients could log right back in. But after about 5 minutes' time, nearly every client on our network beachballed indefinitely, and any attempt to login would hang the machine. This hanging was so sudden and so thorough, it actually froze my machine mid-cube-effect as I attempted to fast-user-switch to the login window. Some fail over! Not cool.


Fast User Switch: Frozen in Time
(click image for larger view)

So I've spent a great deal of time figuring out the solution. The logs were no help. Google turned out to be a wash. Manuals? Pfft! For the first time in quite a while, it was an Apple Knowledge Base Article that offered the fix. The article was written for people who were having problems getting a 10.4.2 replica to remain a replica. Apparently there was an issue that involved these servers reverting back to "Standalone" roles after being switched from "Replica" to "Standalone" and back again. Though this was not my problem, nor did any of the symptoms reflect what I was seeing, I finally decided to try the recommendations in the article as they seemed fairly universally applicable, and as I was desperate and had tried everything else. The article essentially details methods for cleaning out every part of the master database that references the replica, and then re-promoting the replica to a clean master. Honestly, my master database looked pretty clean to me, but there was one bit of advice that I was not aware of, and it's my suspicion that this was what did the trick for me: The OD Administrator of the replica's database can NOT have the same UID or short name as that of the master. The article recommends creating a separate OD Admin account on the master, and using this separate account when binding the replica to the master. Honestly, I had no fucking idea. Would've been nice if this had been more explicitly mentioned in the manuals. Believe me, I've read the section entitled "Setting Up an Open Directory Replica" numerous times by now. It's not there. Fortunately, it's in that article, and now I know (and knowing's half the battle).

And now you know.

So, in a couple weeks we go live with our unified internal network. I'll let you know how it goes.

UPDATE 1:
Actually, my replica seems to still not be working, at least not very reliably. What happens is that about two minutes after the plug is pulled on the master, the replica picks up. At this point, clients — both Mac and Windows — can successfully log in. Shortly thereafter — maybe four minutes later — we start having problems: the Macs can't log in, or only some can log in; Windows machines log in intermittently — one time it works, the next time it doesn't, it works after a reboot, then it doesn't; and, perhaps strangest of all, network connections to the Macs — ARD, ssh, anything but ping — become all but impossible, hanging at the attempts. This is totally a guess, but it seems to me like the clients are having serious trouble binding to the replica. They keep attempting to do so, with some initial or intermittent success, and in their attempts network connections get locked up and the machines bog down. It's almost like the replica server is saying, "Yes, you can bind to me," and then changing it's mind and saying. "Wait, no you can't. Never mind. Screw you." Again, I'm only guessing. There is nothing clear cut in the logs, and I can't find anything in Apple's Discussion forums or Knowledge Base that specifically addresses my problem. I only pray that it isn't a problem with my master server, but the master works perfectly, and it seems to me that a replica of a perfectly working master should work perfectly. The current replica is running on a Mac mini with limited RAM, and a 10/100 BT NIC, and I want to rule out potential problems that might be caused by the hardware as well as the software set up. So my next step will be to build a new replica from scratch on a G5. I'll let you know if that solves the problem.

Another thing I absolutely should mention: For Windows replication, the replica server must be set up as a Backup Domain Controller (BDC). This is done in the Server Admin application in the Windows section. It's fairly straightforward to set up, so I won't go into detail, but just for the record, it's important to be aware of this, and I wasn't until recently, so I mention it here for completeness' sake.

Having this replica isn't absolutely critical to our plan. That is, we can go forward with this plan without the replica. But having a working replica will provide an important safety net that I'd really like to have working as the semester begins. There's no good way to test it while the semester is in session. So I'm working hard to get it up and running in the final week of the summer.

So much for this post being short. More to come.

UPDATE 2:
I built the new server today. From scratch. On a G5. No joy. I honestly don't know what the problemcould be. I can only guess that something either broke with the latest 10.4.7 update, or that there's something slightly off with my master server and it's causing problems on the replica. But it's weird, because if I bind directly to the replica using Directory Access it works perfectly, which leads me to suspect a problem on the client. But it affects Windows machines as well, so that doesn't quite figure either. I hate to admit it, but I'm stumped. And, unfortunately, I don't have time to worry about it right now. I will revisit this issue at a later date, when I get some time. When I do, I'll probably post a new entry with the solution, that is, if I'm able to find a solution. I hate this kind of thing. Is anyone else having a similar problem? I feel like I'm going nuts. And I can't believe I've spent so much time on something that should be really, really easy. Fuck. What a bummer.

UPDATE 3:
I've pretty much given up on this for now. No time. And no good time to test, what with the students coming back next week. But today I noticed something strange, and it occurred to me it might be related to my replica problems. Today, when trying to make an AFP connection to the master server from a client using a simple "Connect to Server..." I got a Kerberos prompt that refused my admin credentials. Hmmm... Kerberos problems... On the master... Not good... So who knows? I may be rebuilding the master server at some point. But not now. Oh, Lordy, not now.