Infrastructure

There are a bunch of legacy issues at my new job. Many of them, I believe (I'm not completely sure, I'm still pretty new after all), stem from the once heavy use of IRIX and its peculiarities. We are only just reaching a point at which we can do away, once and for all, with that platform. It's a very complex ecosystem we're dealing with here. Complex, and in some spots delicate. But what surprises me more than a little is that, despite the fact that my new job is at one of the most respected institutions in the country — and one of the more advanced computer labs as well — I find myself facing many of the same issues I tackled in my last position. Those challenges have a great deal to do with creating a simpler, more elegant, more efficient computing environment. And, though the user base has changed dramatically — we're dealing with a much more technically sophisticated group of professionals here, not students — and the technological and financial resources are much more vast, the basic goals remain the same, as do many of the steps to accomplishing those goals. And the one thing those steps all have in common, at least at this stage of the game, is infrastructure.

What makes infrastructure so crucial? And why is it so often overlooked?

Infrastructure is key to creating efficient, and therefore more productive, work environments, whether you're in art, banking, science, you name it. If your tools are easy and efficient to use you can work faster and make fewer mistakes. This allows you to accomplish more in a shorter period of time. Productivity goes up. Simple. Infrastructure is a lot like a kitchen. If it's laid out intelligently and intuitively you can cook marvels; if it's not you get burned.

Infrastructure, for our intents and purposes, is the back-end of the computing environment. Everything you do between computers — that is, every interaction that takes place between your computer and another computer, be it file server, authentication server, web server, what have you — relies on some sort of infrastructure. I'm referring to network infrastructure here, to be sure, but also to the processes for adding and accessing computer resources in any given facility. How, for instance, do we roll out new workstations? Or update to the latest operating system?

Typically, it is Systems Administrators — the very people who know (or should know) the importance of infrastructure — that tend to work between computers the most. We of all people should all know how important a solid infrastructure is for even the simple act of basic troubleshooting: if your infrastructure is solid and predictable, the paths to troubleshoot are greatly lessened and simplified, making your job easier and making you better at it at the same time. Yet infrastructure, time and again, is left to stagnate for a variety of reasons.

I'd like to enumerate a few of those reasons, at least the ones I suspect factor most strongly:

  1. Infrastructure is difficult Infrastructure planning, like, say, interface design, is complicated and often requires numerous iterations, coordinated effort and a willingness to change to successfully implement.
  2. Infrastructure requires coordination Infrastructure changes often require re-educating the user base and creating a collective understanding as well as clear policies on how things are supposed to work.
  3. Infrastructure is not sexy (to most people) The benefits of infrastructure reorganization are often not immediately apparent, or even immediately beneficial for that matter. You might not see the benefits until long after a reorganization.
  4. Infrastructure can be expensive If an infrastructure requires a major overhaul, the cost can be high. Couple that with less-than-immediate benefits and you tend to often meet with a great deal of resistance from the money people who feel that they'd be better served buying new workstations than a faster switch.
  5. Change is scary You know it is.

I've been extraordinarily lucky in that I've been able to learn about infrastructure in a sheltered environment — that of the Education sector — that allowed me copious downtime (unheard of elsewhere) and a forgiving user base. (Students! Pfft!) I'm still pretty lucky in that A) I'm working somewhere where people, for the most part, get it; and B) I have some time, i.e. I'm not just being brought in as a consultant. This last bit is really fortunate because it affords me both ample opportunity to gain an understanding of the environment I'm trying to change as well as the time in which to change it. This is not to say that this sort of thing can't be done in consulting. But it's certainly a much harder sell, and one I'm glad I don't really have to make to such a degree.

Still, with all that, I've got my work cut out for me.

When I first arrived on the scene, The New Lab was using (actually, still is to a large extent) NIS for user authentication. Now this is something I know a bit about, and I can tell you (if you even remember what NIS is anymore) NIS is very, very passe. And for good reason. NIS is like the Web 1.0 of user authentication: it uses flat files rather than databases and is extremely cumbersome and inflexible. Moreover, it is not well-suited to cross-platform operation. It is completely end-of-life and obsolete. To continue to invest in NIS is silly. So one of my first duties was to build an Open Directory server, which relies on numerous databases, each suited to authentication for a given platform. The OD server will be both easier to use (creating users is a breeze) and more capable than any NIS server could ever hope to be (by allowing cross-platform integration down the line, if desired). But so far, for some reason, no one's done this. Partly, maybe, it's just inertia: NIS works fine enough, it's not that big a problem. And maybe it's partly happening now because this is something I just happen to know a lot about and can make it happen quickly and effectively. Because of my background, I also see it as a huge problem: by slowing down the user creation process, you're hindering productivity. And not just physical productivity, but mental productivity. If I have to spend twenty minutes creating a user, not only have I wasted that time on something trivial, but I've expended far too much mental energy for a task that should be simple. And this makes it more difficult to get back to Work That Matters. Again, the beauty of being on staff is that I have time to introduce this gradually. To gradually switch machines over to the new server. To gradually get the user base used to the slightly new way of doing things before we move on to the next item up for bid.

So far, so good.

I've talked to my fellow co-workers as well, and they're all primed to make changes happen. That's really good. We're talking about all kinds of things: re-mapping the network, asset management with DHCP, redoing the scheduling system, and others I can't even think of right now. User authentication took years at my old job. It was, in many ways, a much more complex network than this new one (we don't manage an external network, thank God). But this place has its own set of complexities and challenges, and though the authentication server is basically done, there are a whole host of things I could see happening in the realm of infrastructure. And they're all right there... See them? Just over the horizon.

Should be fun.

There are a few basic things I like to keep in mind when preparing for and making major infrastructure changes. These are the types of problems I look for that give me purpose:

  1. Repeat offenders What problems crop up again and again on the user side? What questions get asked over and over? These are indicators that something is functioning sub-optimally, or that a process could be more intuitive.
  2. Personal frustration What parts of my job are frustrating or infuriating to me? These are usually indicative of a problem, as I tend to get frustrated with things that don't work well. Either that or I need more coffee.
  3. Redundant errors Is there a process that tends to yield mistakes on a regular basis? If so it could probably use some automation or clarification at some point. Sometimes all you need is a clear policy or workflow.
  4. Long-term problems Is there something that everyone complains about, but that just "never gets fixed?" Betcha ten bucks it's an infrastructure problem.
  5. The workflow How do people in the facility currently work? What's the pipeline for getting things done? Are they spending their time working on tech issues when they should be working on production? How could this be easier?

There are probably more, but these are the general things I'm thinking about when considering infrastructure changes. And the better I can understand the people and the technology in a facility the more informed my decisions can be with regards to those changes.

Finally, there are some basic notions I keep in mind when proceeding with infrastructure changes:

  1. Simplify The simpler solution is almost always best, both for the admin and the user. Building a simple solution to a problem is often exceedingly difficult, and I might point out, not necessarily simple on the back-end. But a simple workflow is an efficient one and is usually my prime directive.
  2. Centralize It's important to know when to centralize. Not everything benefits from centralization, obviously. If it did we'd all be using terminals. Or web apps. For everything. But properly centralizing the right resources can have a dramatic affect on the productivity of a facility.
  3. Distribute Some resources should be distributed rather than (or in addition to being) centralized. Some things will need redundancy and failover, particularly resources that are crucial to the operation of the facility.
  4. Educate Change doesn't work if no one knows about it. It's important to explain to users what's changing and also why. Though I've been met with resistance to changes that would actually make a user's job easier (this is typical), making them aware of what and why the change is happening is the first step in getting them to see the light.

It's true that infrastructure changes can be a bit of a drag. They are difficult. They're hard to justify. They piss people off. But in the end they make everything work better. And as SysAdmins — who are probably more intimate with a facility's resources than anyone — we stand as much to gain (if not more!) than our users. And they stand to gain quite a bit. It's totally win-win.

Default Shell Hell

There's a common occurrence in the world of systems administration. Once I describe it you'll probably all nod you're heads knowingly and go, "Yeah, that happens to me all the time." It happened to me recently, in fact.

I was attempting to set a Linux system to authenticate via a freshly-built LDAP server — something I've done many, many times — and it just wasn't working. I could authenticate and log in fine via the shell, but no matter what I tried, whenever I would attempt to log in to Gnome, I'd get an error message saying that my session was ended after less than 10 seconds, that maybe my home account was wonky or I was out of disk space, and that I could read some error messages about the problem in a log called .xsession-errors in my home account.

Of course, certain that my home account was fine and that I had plenty of disk space, the first thing I checked was the .xsession-errors log, which yielded little useful information, and which information led me on a complete and utter wild goose chase. From everything I could glean from this rather sparse log, there seemed to be a problem with Gnome or X11 not recognizing the user. I showed the error to some UNIX-savvy co-workers, one of whom demonstrated that, when booting into run-level 3, logging in and then starting X, login worked fine, thus proving my hypothesis. So began several days of research into Linux run-levels, Gnome, X11, PAM, NSS Switch and LDAP authentication on Linux. All of which was exceptionally informative, but which, of course, failed to yield a positive result.

The final, desperate measure was to scour every forum I could, and try every possible fix therein. And, lo and behold, there, at the bottom of some obscure post on some unknown Linux forum (okay, maybe not that unknown), was my answer: set the default shell. Could it be so simple?

But wait, wasn't the default shell set on my server already?

I checked my server, and sure enough, because of a typo in my Record Descriptor header, the default shell had not been set for my users. Seems X11/Gnome needs this to be explicitly specified in an LDAP environment, because in said environment it is (for some reason that remains beyond me) unable to read the system default.

Setting the default shell for users on my LDAP server (yes, it is a Mac OS X Server) did the trick, and I can now log in normally to Linux over LDAP.

So, after days of researching a problem the solution all boiled down to one, dumb, overlooked setting on my server, a fact I found referenced only at the bottom of some strange and obscure internet forum. Sound familiar? What, pray tell then, should we call this phenomenon? We really need a term for it. Or a perhaps an axiom? Maybe a law or a razor or a constant. Something like:

"For every seemingly complex OS problem there is almost always an astoundingly simple solution which can usually be found at the bottom of one of the more obscure internet forums."

A corollary of which might go something like:

"Always check the bottoms of forums first."

We'll call it Systems Boy's Razor. Yeah, that should do nicely.

If anyone has any better suggestions here, I'm always open. Feel free to let 'em rip in the comments. Otherwise, check your default shells, people. Or at least make sure you have them set.

NetBoot Part 5

So far this NetBoot/NetInstall thing is working out a thousand times better than I ever thought it would. I wish I'd done this years ago. Not only does it save time, it also reduces errors. This is often one of the most overlooked features of automating a process: the less human interaction in the process, the fewer mistakes can be made. I have only to compare the set of instructions I gave to last year's crew for building a new system to the instructions for using the new NetInstall system to see evidence of this truism. The list of human actions to take — and, thus, potentially screw up — is significantly shorter using the new process. And that's a beautiful thing.

At this point I've converted about half the staff to Leopard with the NetInstall system, and for the most part it's been quick and painless for both me and them. Contrast with years past, where upgrading staff computers — which are both the most customized, and the most important to preserve the data of — has been fraught with tension and minor hiccups. This year I almost feel like I've forgotten something, it's been so easy. But staff would surely let me know if there were problems. (I'm so knocking wood right now.)

I've also had an opportunity to test building multiple machines simultaneously. Yesterday I built five Macs at the same time, and, amazingly, all five built in about the same time it takes to build one — about a half an hour. I'm astounded. We should be able to build our new lab workstations this summer in a day. And still have time for a long lunch. And for the most part I'll be able to offload that job to my assistants.

As I finish up the system, I've realized some things. First of all, it sort of reminds me of software development — or at least what I imagine software development to be like — because I'm building little tiny components that all add up to a big giant working whole. Also, as I write components, I find myself able to reuse them, or repurpose them for certain, specific scenarios. So, in a sense, the more I build, the easier the building becomes, as I imagine is true in software development. Organization is also key. I find myself with two repositories: one contains the "build versions" — all the resources needed to build the packages — and one contains the finished products — the packages themselves — organized into something resembling the physical organization (packages for staff computers in one area, packages for workstations in another, for instance). It's shockingly fascinating to work on something like this, something that's built from tiny building blocks and that relies very heavily on good organization. I'm really enjoying it so far, and I'm a little sad that the groundwork is built and it's nearly done. There's just something fundamentally satisfying about building a solid infrastructure. I guess that's just something I innately like about my job.

The next step in this process, as I've alluded, will be to do a major build, i.e. our new batch of workstations when they come in the summer, and an update of all our existing computers — all-in-all about 40 machines. Between now and then there are sure to be some updates, so I'll probably update my base config before we do the rest of the lab. And then will come the fun. I will report back with all the juicy details when that happens, in what will probably be the final installment of this series.

See you in summertime!

NetBoot Part 4

So this is going great. I have a really solid Base OS Install, and a whole buttload of packages now. Packages that set everything from network settings to custom and specialized users. I can build a typical system in about 45 minutes, and I can do most of the building from my office (or any other computer in the lab that has ARD installed).

I'm also getting fairly adept at making packages. A good many of my packages are just scripts that make settings to the system, so I'm getting pretty handy with the bash and quite intimate with dscl. But, perhaps most importantly, I'm learning how to make all sorts of settings in Leopard via the command-line that I never knew how to do.

The toughest one so far has been file sharing. In our lab we share all our Work partitions to the entire internal network over AFP and SMB. In the past we used SharePoints to modify the NetInfo database to do so, but this functionality has all been moved over to Directory Services. To complicate matters, SAMBA no longer relies simply on standard SMB configuration files in standard locations, and the starting and stopping of the SMB daemon is handled completely by launchd. So figuring this all out has been a headache. But I think I've got it!

Setting Up AFP

Our first step in this process is setting up the share point for AFP (AppleFileshareProtocol) sharing. This wasn't terribly difficult to figure out, especially now that I've been using Directory Services to create new users. To create an AFP share in Leopard, you use dscl. Once you grok the syntax of dscl it's fairly easy to use. It basically goes like this:

command node -action Data/Source value

The "Data Source" is the thing you're actually operating on. I like to think of it as a plist entry in the database — like a hierarchically structured file — which it basically is, or sometimes I envision the old-style NetInfo structures. To get the needed values for my new share, I used dscl to look at a test share I'd created in the Sharing Preferences:

dscl . -read SharePoints/TEST

The output looked like this:

dsAttrTypeNative:afp_guestaccess: 1

dsAttrTypeNative:afp_name: TEST

dsAttrTypeNative:afp_shared: 1

dsAttrTypeNative:directory_path: /Volumes/TEST

dsAttrTypeNative:ftp_name: TEST

dsAttrTypeNative:sharepoint_group_id: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXX

dsAttrTypeNative:smb_createmask: 644

dsAttrTypeNative:smb_directorymask: 755

dsAttrTypeNative:smb_guestaccess: 1

dsAttrTypeNative:smb_name: TEST

dsAttrTypeNative:smb_shared: 1

AppleMetaNodeLocation: /Local/Default

RecordName: TEST

RecordType: dsRecTypeStandard:SharePoints

Okay. So I needed to use dscl to create a record in the SharePoints data source with all these values. Fortunately, the "sharepoint_group_id" is not required for the share to work, because I'm not yet sure how to generate that number. But create the share with all the other values and you should be okay:

sudo dscl . -create SharePoints/my-share

sudo dscl . -create SharePoints/my-share afp_guestaccess 1

sudo dscl . -create SharePoints/my-share afp_name My-Share

sudo dscl . -create SharePoints/my-share afp_shared 1

sudo dscl . -create SharePoints/my-share directory_path /Volumes/HardDrive

sudo dscl . -create SharePoints/my-share ftp_name my-share

sudo dscl . -create SharePoints/my-share smb_createmask 644

sudo dscl . -create SharePoints/my-share smb_directorymask 755

sudo dscl . -create SharePoints/my-share smb_guestaccess 1

sudo dscl . -create SharePoints/my-share smb_name my-share

sudo dscl . -create SharePoints/my-share smb_shared 1

This series of commands will create a share called "My-Share" out of the drive called "HardDrive."

After modifying the Directory Services database, it's always smart to restart it:

sudo killall DirectoryService

And we need to make sure AFP is running by starting the daemon and reloading the associated Launch Daemons:

sudo AppleFileServer

sudo launchctl unload /System/Library/LaunchDaemons/com.apple.AppleFileServer.plist

sudo launchctl load -F /System/Library/LaunchDaemons/com.apple.AppleFileServer.plist

Not the easiest process, but not too bad. SMB was much tougher to figure out.

Setting Up SMB

Setting up SMB works similarly, but everything is in a completely different and not-necessarily standard place. To wit, Leopard has two different smb.conf files: one that's auto-generated (and which you should not touch) in /var/db, and one in the standard /etc location. Fortunately, it turned out, I didn't have to modify either of these. But still, it led to some confusion. The way SMB is managed in Leopard is rather roundabout and interdependent. Information about SMB share is stored in flat files — one per share — in /var/samba/shares. So, to create our "my-share" share, we need a file named for the share (but all lower-case):

sudo touch /var/samba/shares/my-share

And in that file we need some basic SMB info to describe the share:

#VERSION 3

path=/Volumes/HardDrive

comment=HardDrive

usershare_acl=S-1-1-0:F

guest ok=yes

directory mask=755

create mask=644

Next — and this was the tough part to figure out — we need to modify one, single, very important preference file that basically informs Launch Services that SMB should now be running:

sudo defaults write /Library/Preferences/SystemConfiguration/com.apple.smb.server "EnabledServices" '(disk)'

This command modifies the file com.apple.smb.server.plist in our /Library/Preferences/SystemConfiguration folder. That file is watched by launchd such that when it is modified thusly, launchd knows to start and run the smbd daemon in the appropriate fashion. Still, for good measure, I like to reload the LaunchDaemon for the SMB server by hand. Don't need to, but it's a nice idea:

sudo launchctl unload /System/Library/LaunchDaemons/com.apple.smb.server.preferences.plist

sudo launchctl load -F /System/Library/LaunchDaemons/com.apple.smb.server.preferences.plist

That's pretty much it! There are a few oddities: For one, the new share will not initially appear in the Sharing Preferences pane, nor will the Finder show it as a Shared Folder when you open the window.

Shared Folder: This Won't Show Without a Reboot

(click image for larger view)

But the share will be active, and all will be right with the world after a simple reboot. (Isn't it always!) Also, if you haven't done it already, you may have to set permissions on your share using chmod in order for anyone to see it.

I was kind of surprised at how hard it was to set up file sharing via the command-line. But I'm glad I stuck with it and figured it out. It's good knowledge to have.

Hopefully someone else will find it useful as well.

NetBoot Part 3

I've become quite the package whiz, if I do say so myself. Actually, I'm probably doing something ass-backwards, but still, I wanted to share some of my working methods as they seem to be, well... Um... Working...

One of the things I'm doing is using packages to run shell scripts that make computer settings (like network settings and user-creation) rather than actually installing files.

PackageMaker: I Prefer the 10.4 Version of Packages
(click image for larger view)

This can be done in PackageMaker by taking some creative liberties with preflight and/or postflight scripts. The only hitch is that PackageMaker insists that you install at least some files onto the target system.

PackageMaker: Installing Scripts to /tmp
(click image for larger view)

So the majority of my packages contain only a single script. That script first gets installed to /tmp, thus fulfilling PackageMaker's "must install files" directive.

PackageMaker: A Postflight Script
(click image for larger view)

The script then runs as a posflight script, and the last line of the script deletes the instance of the script in /tmp, just for good measure.

Shell Script: Removing the Script from /tmp
(click image for larger view)

It could be argued that there's no reason to create packages from scripts, that you could just as easily run the scripts directly in ASR, but packages offer a couple of advantages. For one, packages leave receipts, so it's easy to check and see if something's been set on a computer. For two, packages are easy to deal with; assistants and other SysAdmins know how they work and can easily understand how to use them. Need to change a machine's settings? Don't run a script. Hell, don't even bother opening System Preferences. Just open and run a package. What could be easier (and less error-prone, I might add)? From an ease-of-use perspective, packages have a huge advantage. And ease-of-use adds efficiency. Which is why I not-so-suddenly find myself in the envious position of being able to build systems in about half the time (or less!) it used to take. That's a huge improvement!

Using this method (and sound DNS) I've been able to write packages that configure network settings, create computer-specific users, set custom disk and file permissions, set up autofs, bind to our authentication server and set up SSH for password-less login.

Next on the list: File Sharing!

Should be fun.