Brahma: The Duke Physics Department's Beowulf Compute Cluster

This is the official home page for the Duke University Physics Department's Brahma Beowulf Project. Please feel free to explore this website. There are a number of things on the site itself that may be of use or interest to individuals interested in beowulf-style cluster computing.

This site is maintained by rgb. It and all works linked thereupon authored by Robert G. Brown are Copyright 2003 (or as indicated in the document) and made available through a modified Open Publication License unless superceded by another license directly associated with the document. (Current site version 2.2-1)


Home Resources Users Clusters Cluster Stats Research

Photo Tour of Brahma


The Brahma cluster is really the union of many clusters of varying age and speed, owned by a number of research groups in the physics department. It is housed in a recently renovated "cluster/server room" in the bowels of the Duke Physics building. The dimensions of this room are roughly 5 meters by 15 meters -- a long, somewhat narrow room. The room has various pipes passing through it against the wall and a somewhat low (and crowded) ceiling, but the floor is solid concrete (good for bolting two-post racks in). The room is soundproof, it can be securely locked, and it can be kept very cold without bothering anyone.

________
Seth Vidal (the sethbot), godparent of yum, key component of a scalable cluster in his own person

Contents

________
The whole server room, as seen looking down the center.

Looking down the center of the room, on the left from front to back are the qcd cluster and the few remaining department servers in tower cases (only the shelf post visible), the main department server rack, the nano cluster racks, the rama cluster and switch racks, the CHAMP racks. Then from the right rear back to the front are the GENIE cluster (ISDS), the ganesh cluster (shelving on right), the brahma 2 cluster (brown Dell Poweredge towers on the right), and the brahma 3 cluster (only the backs, camera case on top).

Also visible as detail are the large overhead AC delivery and return ducting, powerpoles everywhere, one of two power panels on the left, and a rolling workbench.

Back to top

________
Brahma 2 and Brahma 3.

Seven Dell dual 933 MHz PIII's and twelve (remaining) Dell dual 400 MHz PIII's are what remains of the Intel equipment grant cluster, which were the second and third generation of brahma systems, respectively. Both clusters are still in fairly heavy use -- the 933's as part of CHAMP and the nuclear theory project and the older 400's for Monte Carlo and the occasional student project

Back to top

________
Ganesh.

Ganesh is 15 1300 MHz Athlons in tower units on heavy duty steel shelving from Home Depot, but I didn't really get a good picture of it because five nodes are broken and out of the rack, and the work bench was tied into some of the nodes for a private experiment as well and in the way.

Anyway, ganesh is the white units on the left, more of brahma 2 is in the (incredibly expensive) wire rack in the middle behind the workbench, and brahma 3 is just visible on the right. As an interesting detail, note the pipes running down the wall within 3 feet of the floor. We had asked for these to be removed during the renovation, but it proved to be impossible.

Back to top

________
CHAMP

CHAMP is 23 Dual Athlon 1800+'s (originally Tyan 2460, which proved to be a miserable piece of crap as a motherboard, now remounted with Tyan 2466's). These are the "c-nodes", c00-c22. Also in the rack are four CHAMP P4 "p-nodes" p00-p03, and a dual Xeon from Penguin Computing purchased as a test node for possible future cluster expansion.

As noted elsewhere, the CHAMP project also uses the 14 P3 CPUs in Brahma 3 to bring its total processor count to 64. Well, 66. Well, more like 70, if you include various desktops that can also be used, and QCD belongs to one of the CHAMP theorists...

One reason that we're fairly easygoing about cluster boundaries and sharing. What matters is getting our work done, not who "owns" what.

Interesting details: Note the gaps in the racks where we left out nodes -- useful (for the time being, at least) to put a small monitor or laptop). The powerpoles are clearly visible, as is the ceiling-mount wiring tray overhead. The primary cluster switches are just visible at the top left. This is nothing fancy -- we bought two HP 4000M's, snitched the cards and power supply from one and put it in the other to make a dual power "HP 8000M". Odd, but it is cheaper this way.

Back to top

________
Rama

Rama is the rack closest to the wall in the middle, next to the rack holding the cluster switch(es). Actually, rama has a gigabit switch at the top as well, although it doesn't really need it at this point yet. Rama is 16 dual Athlon 1900+'s on Tyan 2466 motherboards. Even the 2466 is not problem free, but if you get the right BIOS flash on it it seems to work all right. The 1.6 GHz Athlons prove especially effective at doing Brown and Ciftan's Monte Carlo code, performing as well as a 2.4 GHz P4. Go figure.

One of the room's two power panels is visible next to rama. Nano is the pair of racks on the left, and CHAMP is again visible as the two racks on the right.

Back to top

________
Nano

Nano is 32 Dual Athlon 1900+'s, purchased shortly after rama but by the Nanoscale Physics group. In a lot of ways it is our "prettiest" cluster in that it was purchased all at once, installed so it uniformly fills two racks (with the useful gaps for holding pliers, shown), and has had relatively little difficulty at least compared to CHAMP and rama so it has remained racked and neat.

Note that we installed nano facing rama to permit easy access to the two power panels and two work around a mid-wall "pillar in brick". The department server rack and department server shelf (which also holds qcd) are visible behind and to the left.

Back to top

________
QCD

QCD (like the p-nodes) is a "mini" cluster (qcd1-qcd4) of Alphas. For a little while these were the fastest nodes in Brahma, but they were expensive, hot, and (by the time one factors in all the extra work required to run an alpha distribution as well as an x86 distribution) really expensive. At this point we have more stringent rules on acceptable hardware configurations as human time is really our most expensive and "scarcest" resource.

This picture also gives a decent view of the rooms air conditioner heat exchanger/blower, which is roughly the size of my whole office and sounds like a 747 in flight. The overhead ductwork is also visible -- it is just high enough that nobody quite bonks their head in the room, not even Icon (who is quite tall).

QCD is the four "identical" towers in the middle shelf. The other towers are the remnant towers still in service as department servers. We are transitioning over to rackmount only as we replace aging equipment, both to conserve space and to increase our server hardware quality and decrease downtime and maintenance hassle. Basically, the physics LAN has the highest ratio of systems to administrators (as far as I know) in the entire University, and our two administrators both do lots of other things. We have to be almost brutally efficient and scalable in order for them to stay sane and happy.

Back to top

________
Genie (ISDS)

Genie is a cluster belonging to the Duke Institute for Statistics and Decision Sciences (ISDS) that is housed in our cluster space. It has lots of dual Athlon nodes -- for details visit the dbug site on the left and click on clusters, cluster summary. I'm putting its picture in even though it isn't technically part of Brahma because it is in this space. Who knows, one day we might even share resources like this across departmental boundaries the way we currently share within departmental boundaries, however difficult that is to imagine.

Note the sad pile of dead Athlons from ganesh on the right, waiting to be fixed. I suspect, but cannot prove, that they are casualties of the miswiring of our server room, which originally shared a single neutral per three circuit phases in the power poles. This is a really bad idea when all three circuits will provide electricty to a lots of switching power supplies at close to (each) circuit's capacity. All 16 nodes ran perfectly for a year in our old (properly wired) server room but died, of various "causes" within a few months of the move into the new (improperly wired) server room. Correlation isn't causality, but you gotta wonder...

Back to top

Room Infrastructure Details

The pictures below explore various features of the room's infrastructure in more detail, primarily as a guide to would-be cluster engineers. Even though your own environment may be very different (how many people will be building a cluster room in a renovated nuclear accelerator staging and storage room, formerly occupied by giant mutant cockroaches drawn to the water slowly dripping into its large, lead sink) it can still be useful to see at least one way to set things up.

As you can see from the pictures above, Brahma is nothing if not eclectic -- shelf mounted single and dual towers, rackmount duals and singles, a mix of speeds, and architectures, and form factors, with at least three different CPU manufacturers represented (all running linux, thank goodness). So even if we only do something a little bit, we do a lot of different things and you might find a picture or remark in here that can help you.

________
Campus Backbone Switches

It is probably best to "follow the network". We therefore begin with a picture of the primary network drop in the room. This is where fiber feeds terminate and serve various high speed switches connecting the room with both the main department backbone (mostly terminated in a different closet on the other side of the building) and with the campus backbone. Duke has lots of fiber, foresightfully run during a major campus wiring project back in the 90's, which now presents us with the possibility of e.g. connecting our cluster room with other campus cluster rooms with a high bandwidth fiber to form a general campus grid.

This little mini-rack is tucked in behind the door in an alcove that seemed "made for it". It has its own UPS (not shown). Its UPS has even failed. What fun!

Back to top

________
Ceiling Mount Cabling Tray

Leaving the switch, we observe a thick bundle of cat 5 cables neatly collected in a cabling tray depending from the ceiling. This tray splits (just barely shown) into separate trays serving the two halves of the room on either side of the huge central AC duct.

Back to top

________
Wall Distribution Point

Here is where at least one small sheaf of wires terminates -- in a wall mounted patch block. This particular one serves the department servers, which are hooked directly into the main distribution switches in the room to keep them "close" latency-wise to the main department LAN.

A couple of amusing details -- tennis balls protect human backs and delicate hardware from gas and air (yes, these too were to have been capped in the ceiling and removed from the wall during the renovation). Note also the thermometer on top of the department server rack/stack -- we are somewhat obsessed with room temperature, with a variety of automated and non-automated sensors and thermometers all over the room.

Note also the Raritan Master Console KVM beneath the monitor. I got this originally for the brahma cluster, but it is actually a lot more useful for the servers (since one actually almost never needs to login to cluster nodes from a console, and rather FREQUENTLY need to login to servers from the console).

Back to top

________
Thermal Kill Switch

As further evidence of our obsession, observe this thermal kill switch for the room power. This is basically our last line of defense against meltdown -- if the room temperature exceeds a fairly high cutoff this switch will kill all power to everything in the room. We came dangerously close to triggering it a couple of times this winter when facilities shut down the chiller because it was, well, Winter!

After some vigorous discussion, the chiller was turned back on, and after a few more near disasters it was even pulled out of the automated system that KEPT shutting it down. This was good -- in our original meetings on room specifications it was clearly and unequivocably indicated that the room would require air conditioning 24x7 all year long. Even if it IS cold outside.

Note as well another thermometer. This seems like a good place to really know the temperature...and a bad place for a "hot spot".

Back to top

________
Rackmount Patch Panel (front)

Other wiring runs down to the end of the room, but most of the clusters are on self-contained networks -- flat with the department LAN but on 192.168 blocks and on their own switches. Here is a typical patch panel arrangement at the top of a rack pair. All cluster nodes in the rack(s) are wired up through a special bracket between the two racks and patched into the patch panel.

Back to top

________
Rackmount Patch Panel (back)

In the back, to avoid an ugly mess and LOTS of unbundled cables, special L-connectors carry cable bundles up onto the overhead cable tray. The thick sheaf of yellow cat 5 wires coming up from the node NICs and going over the top to the front of the panel is visible in the space between the racks.

Note the silvered insulation on the overhead ductwork that delivers cold air. This was "urgently" added when the ductwork started to sweat and drip onto the -- floor. Fortunately.

Back to top

________
Nano Patch Panel (back)

Here's another pair of patch panels, again showing the neat cabling. Here we were smarter, and put the neat L-connectors facing front and the patch panel itself facing the rear, where all the NICS and their wiring are. Note also the clever loops through the wiring tray that take tension off the wires at the top and take up their slack.

Back to top

________
Main Cluster Switch

So where do all those cable bundles go? Here, to the patch panels next to the switch. This isn't really chaotic -- it is fairly neat and easy to service. One set of wires comes up directly from the neighboring rama cluster. The bundles, collected with velcro cable ties, come from the patch panels connected to the tops of other racks and go into the switch. Nuttin' to it. On the left side of the switch (under all of those cables) one cable runs from a gigabit port to a gig port on the main room switch (where the department servers also connect) making the nodes two switch hops away from the department server(s) with a gigabit pipe in between.

Note, however, that the switch itself is humble 100BT. Most of the department computing so far is embarrassingly parallel, "grid" like computing, and 100BT, NIS, NFS are more than adequate to support it. There is a 24 port gigabit switch at the top of the rama cluster, just visible on the left, waiting for some code to be completed that might actually need it. The rama cluster has gigabit NICS, and in the future most cluster nodes will likely have gigabit NICS as they are increasingly becoming "standard" on server-class motherboards.

Back to top

________
Kill-a-Watt

OK, the name is a bit of an awful pun (and the picture is a bit blurry), but the device is fabulously useful. This is a truly inexpensive ($20-$50, depending on where you find it and how you buy it) inline combination power meter, voltmeter, ammeter, and more. It measures, for example, both VA (voltage x current AMPLITUDES), Watts (time average of instantaneous voltage x instantaneous current), Power Factor (ratio of the second to the first). It can keep a running total of average power drawn by the line it monitors, just like a the meter the power company reads.

I cannot convey how important it is in cluster engineering to be able to measure system draw. The node pictured (dual AMD Athlon 1900+MP on a Tyan 2466 motherboard) draws (as shown) 169 watts idle. Loaded, it draws about 230 watts. With its power factor, instead of pulling an RMS current around 2 Amps loaded, it is closer to 2.6. All of this significantly affects how many nodes one can safely put on a single 20 Amp circuit, as well as expected heat production and AC capacity estimates.

Back to top

________
Room Heat Exchanger/Blower

Let's return to Air Conditioning for a moment. Here is a close-up of the room's blower unit. Warped perspective limits appreciation here (as in so many places:-) but the unit is about two meters, cubed. It is LOUD and runs 100% of the time. One can hardly hear the room's four hundred plus cooling fans over it.

This unit is not the room's "air conditioner". It is only half of it. The other half is the "chiller", and is located faraway (and God knows where) on campus. The chiller delivers cold water through insulated pipes...

Back to top

________
Chilled Water Pipes

...like these. Very much like these. In fact, these are they. One pipe carries cold water in, the other pipe carries still cold, but warmer water out, presumably carrying all the heat released by all the nodes in each passing second back to the chiller itself where it can be forcibly removed and dumped as waste into the outside air.

Back to top

________
Cold Air Protective Gear, Human

When things work just right, the ambient room air stays down well below 70F. We actually prefer it to stay down at or below 60F, but have achieved a fairly stable and directed air balance where air is that cold primarily only right in front of the nodes, about 65F in the mixed central room air, and about 75F right behind the nodes on its way up to the ceiling and the return air ducts.

This makes the room air anywhere from mild to chilly to downright cold, as the actual air coming out of the blower is maybe 50-55F and yes, blowing. If you happen to be working right in front of a pile of nodes with a 50F high speed wind directed straight at your head, it is a real threat to your health.

Sometimes one remembers to dress warmly for the server room. Sometimes it is winter, and one just is dressed warmly. Other times it is summer, hellishly hot outside, and one is wearing loud Hawaiian shirts, stylish shorts, and sports sandals, one is. At least if one happens to be me. At those times having a jacket on permanent reserve in the server room is a potential lifesaver.

This is a genuine NASA jacket, contributed by Icon.

Back to top

________
Workbench

Wait, wasn't this the "ganesh" picture somewhere above? Yes, but here it is presented as a picture of the workbench, another cluster essential amenity.

Note the cables, the monitor and keyboard, the wheels, the adequate room. On other pictures, you'll notice that the room also contains a couple of nice, high work chairs. It has its own electric rechargable tools, regular hand tools, packs of cable ties, boxes of cables of various sorts. One can derack a node, do maintenance on it, and pop it back in in a matter of minutes, at least if its problem can be solved at all in a matter of minutes.

We're planning to build an even neater "monitor cart" out of one of the small-footprint rolling workstation units, a small flatpanel monitor, a UPS (so it can be rolled around with or without a power cord that needs to be plugged in), maybe a small KVM. That will let us fix this workbench up even better, as we won't need it to hold the room's node monitor as it sometimes does now.

Back to top

Miscellaneous Pictures

A few miscellaneous pictures follow that don't fit particularly smoothly into the narrative above but still support or portray elements of room infrastructure.

________
Rack Base Detail

This picture shows some of the detail of how the two-post racks are affixed to the floor. With four big, heavy bolts set deeply into the underlying concrete. The loaded racks are heavy and, with front-mounted cases (no side rails or midpoint mounts), have a significant torque on them. Having a loaded rack fall over would be bad if it were good and disasterous otherwise.

Note also that the floor is already laid out for future racks to be installed as they are needed. In fact, there is a two post rack, unassembled, leaning against the back wall (visible in the whole room photo at the very top). We haven't yet used half the capacity of the room, although we will have to start retiring some of the space-consuming shelfmount clusters when we've filled the racks we have and about four or five more that there is already room laid out for.

Back to top

________
Rack Closeup (CHAMP)

Here is another detail shot of racked nodes in the CHAMP cluster. Each 2U case has its own key and is screwed into the rack posts with four heavy-duty rack screws. Note also the central cabling channel between the two racks.

Most of these nodes were built by Intrex (a local computer vendor), but our large orders "stressed" Intrex a bit and the quality control on the node assembly has not been what we might have hoped for (although Intrex has been very good about standing behind their work). This isn't really their fault -- we order very specific configurations that they may not have built before, and alas they are one of a near infinity of local vendors trapped by Microsoft's ruinous OEM agreements, so they have little direct experience with linux. A lonely 1U Penguin Computing node appears on the rack as a prototype that may prove more reliably built and linux-tested, although also more expensive.

The picture also illustrates two valuable suggestions, especially for a DHCP/PXE shop. One thing Intrex did do for us is to record the NIC address of each node as they burned it in and label the node accordingly. This makes it easy to put the NIC address into suitable tables on the DCHP/PXE server and establish a node's "identity". Once established, we use our own label maker and label the node by name. On a good day, we'll even get a rack stacked up in order by name and IP number, although the CHAMP cluster had so much trouble initially with the Tyan 2460 motherboards that the nodes are still largely out of order.

Back to top

________
Rack Back (nano)

This shows the (remarkably neat) wiring at the back of the nano nodes. Seth wishes that rama were as neat. However, I'm an intrinsically sloppy human and "neat enough" rules the day.

Back to top

________
Shelf Back (qcd/server shelf)

For comparison, here is the back of the QCD/Department server shelf. It is still reasonably neat and cable-tied up, but not as neat as a rack. Note the clever "mini-shelf" in the middle, holding cable, cable ties, and some tools. Note also that this once again shows good old heavy duty steel shelving from Home Depot. This shelving costs between $60 (18"x36"x72") and $100 (24"x48"x72"), and will literally hold a ton or more. The holes in the uprights are ideal for fastening e.g. power strips to, the particle board shelves can easily be drilled or screwed into.

Back to top

________
Shelving Again

Here is another view of -- the shelving. Two shelves, ten good-sized towers. Note the ubiquitous tennis balls, this time protecting the kidneys and arms and heads of systems administrators from the sharp corners on the support uprights.

Note well that the wire shelving holding the Brahma 2 cluster cost $500 back when I foolishly thought there was special virtue to shelving built "for" computers.

Sigh.

Back to top

________
Department Server Rack

This little cutie houses the department's primary servers. Four posts with sliding rails and wheels! Top shelf for monitor and KVM! Servers with nifty blinking blue lights! RAIDs! All under "right now" service contracts, all in a single compact package that could even be wheeled out of the building in the event of a fire, assuming that anybody is crazy enough to risk their life to rescue -- data.

It looks really cool with the lights out. In fact, the whole server room looks cool with the lights out, as each node has a little green state LED on the front panel. I've taken some lights-out pictures, and eventually I may post them just for fun.

Back to top


Seth, Looking Grumpy


________
Grumpy Seth

This is Seth K. Vidal, Sysadmin Extraordinaire, Looking Grumpy.

Why is he looking grumpy? A cynic would say because he is a Grump. A more astute cynic would say because he is being photographed for these chronicles (and he hates being photographed, especially by me in one of my "bug Seth" moods). However, it could also be something on his screen (so devoutly studied) that makes him Grumpy.

Maybe yum has a bug (parenthood is never easy). Maybe the temperature in the server room is rising. Maybe physical facilities is planning to shut down the power in the building next week and just thought to tell him about it. Or maybe all of these things are, by strange chance, true all at the same time.

Or maybe he is looking...

Back to top

________
DULUG

...at DULUG, which happens to "grace" his office.

This is the man that ofttimes suggests that I clean up rama's wiring;-)

(Fortunately, no photographs are available of my office. A photographer couldn't get through the door.)

This is why we require highly scalable cluster design (for all its form factor eclecticism). We have to keep Seth from being Grumpy.

Back to top

Icon, Looking Happy


________
Happy Icon

This is Icon Riabitsev, looking pretty happy.

Why is he looking happy? Hard to say, but Icon pretty much always looks happy, or at least has a gentle little buddha-smile. This is a fortunate thing, as Icon is a big guy.

Icon is a very bright guy and a true linux/web/cluster systems guru. Because of the usual visa problems, Icon will be looking for work in a country more open to immigrants by 2005. If you think you might want to hire him, give him a shout.

Let's see, we have Grumpy, we have Happy, we have Doc (that would probably be me, rgb, photo regretfully unavailable) and the cluster room is definitely deep within the earth. There's a metaphor in here somewhere, I'm struggling to find it...

Back to top

Home Resources Users Clusters Cluster Stats Research

This page is maintained by Robert G. Brown: rgb@phy.duke.edu