Tiger Lab Migration Part 8: Home Account Woes

Boy have we got troubles. Sure I tested everything. Sure I did everything I could to make sure this upgrade went smoothly. But does that ever matter? No. It doesn't.

Everything tested out fine. Tiger was able, from day one, to mount our NFS home account server -- a network RAID known as the Panasas. It was able to read files, to write files. It seemed fine. But we'd had our fair share of troubles with the Panasas, even in the beginning, even in Panther. Certain Macromedia products -- actually, Flash MX 2004, to be specific -- had certain functional disabilities. In particular, when any user was logged onto the Mac as a network user whose home account was located on the Panasas, he would find himself unable to publish HTML previews from within the Flash MX 2004 application. The error message was something along the lines that Flash was unable to read the HTML templates, which live in the user's home account. Oddly, on first launch, Flash could write the templates, but for some reason, it could not read them for the preview. If you have any Flash knowledge whatsoever (which I don't -- I was informed by student experts) this renders Flash essentially useless. The problem did not, however, exhibit itself with network home accounts mounted on Apple shares, neither via AFP nor NFS. So the workaround was to create a special account for Flash developers, which they could log into when doing Flash work, and which was mounted directly on the MacServer and shared via AFP. Hence was born our FlashDev account, which has worked fine all year.

Jump ahead to the present: we've upgraded to Tiger. It seems to have gone fine. Suddenly, certain apps begin acting strangely. Illustrator can't save files. Word also has trouble saving files, and behaves erratically, complaining of "permissions problems" and the inability to write temp files to a "network disk." Final Cut, too, begins acting flakey. Shit. It's like one of these sci-fi-horror flicks where the brain transfer experiment seem to have gone perfectly, but then, gradually, the patient begins acting... Different somehow...

And then all Hell breaks loose.

Tracking down all the problems, I finally came to the conclusion that a number of applications are unable to read their preference files in Tiger when those files reside on the Panasas. This perhaps causes a chain reaction that makes these apps unable to properly write or save files: FCP writes 0 byte files; Illustrator CS just can't save your document. Oddly, Illustrator CS2 doesn't seem to suffer from this problem, and all these apps (except, of course, Flash) seem to work properly from within the Panther environment.

Seems to me like a nasty reaction between Tiger and the Panasas home account server.

I've done some testing, and I plan to do more. So far, I've tried booting into a Panther client and found that the problem is greatly reduced in Panther (though it always existed to some extent, particularly, again, with Flash) and exacerbated in Tiger. I've also tried resharing the Panasas NFS export via AFP from the MacServer. I was fairly sure this would provide a reasonable, if temporary, workaround, but it didn't. I want to try it again, just to be sure, but if resharing via AFP does not provide a cure, it would indicate that the fault lies somewhere with the Panasas. The other possible culprit is Tiger's implementation of NFS. To test this, I am building a more generic (non-Apple, non-Panasas, non-proprietary) NFS server to test from, on a Linux install on my new Dell laptop.

So, best laid plans, and all that. What a frickin' mess.

I'll keep you posted as I learn more.

I built a Linux install for the express purpose of testing networked Mac home accounts on a non-proprietary (non-Mac, non-Panasas) NFS mount. When running a Mac home account from this mount, the Final Cut problem disappeared, and the application acted normally. The same was true for Microsoft Word. Illustrator CS still behaved badly, though, as did Flash MX 2004. It's clear now that these problems result both from Tiger's implementation of NFS as well as Panasas's. So, the breakdown goes something like this:

  • The Flash problem is probably Flash/NFS-specific (though possibly Mac-specific as well)
  • The Illustrator problem is Tiger/NFS-specific
  • The Final Cut and Word problems are Tiger/Panasas-specific

None of this makes my job any easier. It kind of sucks to realize that there is no magic bullet, because that means that there probably is no single fix. Still, it's useful to know.

The issue with Final Cut is not a matter of the application being unable to read files; it's a matter of it being unable to write them. Essentially, FCP cannot write a preference file. Launching with no preference file, FCP creates a 0 byte preference file in the user's home account. Also, if you try to save the FCP project file to the NFS share, it will be a 0 byte file and unopenable. Saving the FCP project file to a local volume results in a perfectly usable FCP project file. The complete inability to write files seems to be unique to Final Cut. Other applications (Illustrator, Flash) seem to have trouble reading files, but writing them seems to work fine (which is why I assumed this was the case in FCP). Word (now updated to 11.2) also has trouble writing files, but only when saving an open, modified document. That is, if you create a new document in Word, and hit "Save," it saves just fine. If you keep the document open, make changes, then hit "Save" again, it complains that it can't write to the shared disk. Hit "Save" again, and Word will create a new document and use the first string of characters in the document as the name. "Save As..." however, works as expected.

It occurs to me (through the suggestion of other, smarter folks) that maybe this is not completely an NFS problem, particularly in the case of the apps that have problems writing files (Word and FCP). Those apps work just fine when we mount our homes on a generic NFS export. So it's possible that the write problems we're having are filesystem-related, not NFS-related. This theory holds a lot of water considering the fact that the Panasas uses its own, proprietary file system.

I've been in contact with the Panasas folks, and I'm working with them in trying to determine the problem. So far, I've sent them tcpdump results from both the Panasas and the Mac while FCP tries to save a file. I'm told that there are a "massive number of 'file not found' type errors" from the dump. That can't be good. I've also sent them a few file listings from the home account of the user in question. Meanwhile, I'm wondering if the next Tiger update will do anything to correct these issues, or if upgrading the Panasas shelf would help, while at the same time, trying to come up with a reasonable workaround, in case Panasas is unable to reslove the issue. The only thing I can think of is to move the Mac home accounts to another server. Which would suck hard. Two steps backward.

Still in constant contact with Panasas. I'm sending them log files now, and planning to do a ktrace, which will be a new experience for me. Also, I'm seeing this issue occur with other applications now. In particular, Macromedia Director MX 2004 is unable to write files to the Panasas mount. This is particularly odd because under Panther that application had a completely different problem: it just wouldn't launch. So I'm currently dealing with three seperate incident reports: one for the problem noted here, one for a blade crash that happened a couple of weeks ago, and one for the original Flash problem from our Panther days. It's a pain. Fortunately, the Panasas people are very nice, professio nal, and helpful.

So, this does seem to be a filesystem issue at its core. Sort of. After much work with the Panasas guy, we've narrowed down the problem considerably.

Mac applications write temporary files -- files that are in use but that haven't yet been saved -- to various locations depending on the app. Final Cut writes its temporary files to a folder called .TemporaryItems at the top of whatever volume the user is working from. So in our model, the top volume is /home, and the sboy home account is inside /home. The whole thing looks like this: /home/sboy. Final Cut wants to write its temp files in /home/.TemporaryItems. And it is able to do this. If I look in /home/.TemporaryItems, I find all the files I've been unable to save to my user account, perfectly intact. So, temporary files go in /home/.TemporaryItems, and saved files in /home/sboy. Now, here's the grind: The Panasas setup requires that each home account be a seperate volume. So, /home is one big volume, and /sboy is another volume inside /home. This means that when Final Cut tries to move the temporary files from the .TemporaryItems folder, which is located on the volume /home, to the sboy user account, which is another volume at /home/sboy, it is moving the files across filesystems. And we get an error.

Interestingly, moving files across filesystems when Panasas is mounted on a Linux box produces the same error, but Linux does something to compensate for this problem, and is then able to save the file. But the error still gets produced. It's just that Mac OS X 10.4.2 (this did not happen in Panther) does not have the mechanism to correct for this error. It gives up and leaves the file in its original location: the /home/.TemporaryItems directory.

Also, this explains why everything works fine when I mount the user's home account on my Linux NFS export. In that case, everything is on the same volume, and there is no moving of files across filesystems. It's got nothing to do with NFS at all.

I'm curious to know why this doesn't happen in Panther. Is it possible that Panther has the same error correction function we see on Linux? What about earlier versions of 10.4? Do they have this mechanism? Will 10.4.3 have it? I surely hope so. I'll be looking into who I should contact at Apple about all of this.

Incidentally, the Panasas guys have figured all this out by examining ktrace and tcpdump files made while the program ran and the error occured. I did not figure any of this out on my own. All I did was run the commands and upload the files. Still, it's a nifty bit of sleuthing. I'm learing a lot.

So, after mulling this over for a while, a rather obvious workaround occured to me. Currently, we mount the entire Panasas volume on our systems. This volume includes all home accounts within it, as well as the .TemporaryItems folder at its top level. Each user's home account on Panasas is a separate volume, and so is the top-level volume that contains .TemporaryItems, so saving files must cross filesystems (from the .TemporaryItems folder on /home to the user's folder). The workaround is to mount each user's home account individually. Doing so causes FCP to treat the user's home account as the active volume, and to use the .TemporaryItems folder in said home account. In this scenario, no filesystem boundaries are crossed, and no error occurs. I tried it, and it works. FCP (and Word, incidentally) can now both save files to the user's home account normally, and application preference stick. Final Cut is again usable.


I'm testing this on a couple machines before I go and announce it to the community and put it on all the Macs, just to make sure it works properly. But it's looking good. I'm glad to know that the past two weeks I've spent on this haven't been in vain.