The Duke Physics Department is not new to cluster computing, nor did its cluster computing efforts begin with the beowulf in 1995. The department has run a Unix-based TCP/IP client/server network of commodity workstations since roughly the mid-80's. Originally this network was built around a PDP-11, a Sun 110M server with many serial ports and a serial port terminal, a handful of Sun 3's, a SGI Iris, and a Sun 386i.
A number of these systems were being used to perform computations in granular flow (Robert Behringer), quantum optics and condensed matter physics (Robert G. Brown and Mikael Ciftan) and of course there were numerous computations run on specific topics on individual systems over the years (we are a physics department, after all:-).
Over almost a ten year period, workstations (mostly Sun Sparcstations, but with SGI Iris's, a SGI 220S, and various PC's augmenting the mix) were added to the department's original thinwire 10Base2 network, and the group of Berndt Mueller in particular began using the network of systems for high performance computing in nuclear physics. SunOS did an excellent job of multitasking, so it was very natural to start to spread numerical jobs around and run them "in parallel" in the background on a workstation where some user was logged in and working interactively (but not computationally) in the foreground.
In about 1993, Brown learned about PVM -- one of the original parallel libraries designed to turn a cluster of networked workstations in to a "virtual supercomputer". Inside six months, his computations were running on top of PVM, using virtually all the systems in the physics department network as a compute cluster. In this way the jobs rapidly accumulated supercomputer-level quantities of time and permitted him to publish very precise Monte Carlo results without the tremendous expense of a real supercomputer, using what amounted to "free" computing (otherwise wasted idle cycles on desktop workstations).
This worked so well that by 1994 Brown was using all of the physics department and the math department's architecturally similar Sun LAN as well, over thirty two Sparc CPUs all total, in a single ongoing computation, and looking for more "free" compute power.
This was found in 1995 in form of new Academic Computing clusters set up by Duke all over campus for student use. These were (mostly) Sparc 5 systems running Solaris 2.2 and 2.3, which had recently replaced the venerable BSD-based SunOS on all newly sold Sparc systems. Brown had account access on the systems already by virtue of being on the faculty; the systems (like all workstations used as interactive desktops) were 90% idle 90% of the time, and it looked like an ideal opportunity to extend the computation out over 100 or more systems delivering on the order of GFLOPs of total compute power.
Brown sought permission to run his (Monte Carlo) code in the background on these systems. Permission was granted on an experimental basis, and he promptly did so. For a bit more than six months, the "OnSpin3d" computation was spread out over as many as 150 systems at once in acpub, in the physics department, and in the math department.
Many GFLOP-years of cycles (depending on what you count as a "GFLOP") were accumulated on his embarrassingly parallel computation (which was no longer being run on top of PVM to avoid stressing the campus backbone network, but rather was being run more like a Grid application with remote, scripted job submission and occasional expect/tcl/rsh grazing of accumulated results). All this work went into a lovely Physical Review paper.
Alas, it turned out that althought the idea of recovering cycles like this was conceptually sound and worked perfectly under BSD-based SunOS, SysV-based Solaris had a different, badly broken, scheduler. A process that could run on a Sparcstation 1 in the background "invisible" to the interactive user of the Sparcstation console would gradually de facto boost its priority on a ten or more times faster Sparcstation 5 to where it would ultimately render the interactive console unusably slow. In its early years and revisions, Solaris was known not-so-fondly as "Slowaris" among computer cognoscenti, largely for this reason.
Obviously, this was unacceptable, and Brown's "grand experiment" with turning an entire University WAN into a (for the time) enormous compute cluster ended. At the time there seemed to be little reason to "publish" the experiment for public accolade, but using published linpack results of roughly 10 MFLOPS for the Sparc 5, the aggregate peak power of this "Grid" style cluster in application to OnSpin3d was in the ballpark of 12-15 GLOPs, enough to have easily put it in the top 100 on the top 500 list for 1995 (recognizing, of course, that the aggregate linpack results for that list were based on a slightly different benchmark and hence not directly comparable).
Even so, the GFLOP-years of computation accumulated during the six months that OnSpin3d ran on the cluster were sufficient to allow a lovely paper to be written, and provided direct inspiration to Brown to build a dedicated cluster computer in the physics department to continue his work. He faced two problems. Sparcs (up to then the architecture of choice for the price/performance senstive, especially at Duke, which had excellent pricing from Sun) were proving too expensive to buy in large quantities. At the same time Solaris had proven to be a spectacular failure as a potential cluster operating system, and although the physics department had clung to SunOS 4.x for that very reason, it was increasingly difficult to resist Sun's pressure to "upgrade" to Solaris on new hardware.
Brown had already encountered linux and was experimenting with floppy-based Slackware and SLS on his home PC's, but the 133 MHz Pentium and AMD processors of the day, although "decent" performers in terms of the interactive desktop, were not stellar numerical performers compared to the 10 MFLOP or better Sparcs. In 1995, however, Intel had released its new Pentium Pro (P6 architecture) chips.
These chips promised several very signficant improvements over the Pentium. Their floating point performance was much improved and actually compared well to the Sparcs, including the new Ultra Sparcs, in absolute terms while significantly beating them in price/performance. Also, they were architected to work in a dual processor configuration. At the time, the chassis, hard disk, and other supporting hardware that comprised a "system" were a significant fraction of its total cost. Packing two processors in one chassis represented a further improvement in the cost/benefit ratio for CPU bound code like OnSpin3d.
Brown did not have tremendously great resources to spend on computer hardware, and there were limits on what could reasonably be sought from the agency funding his research. This meant that building a cluster with 32 or more nodes was out of the question, as it would have cost well over $100,000 at the time. The design of a linux-based Pentium Pro cluster to run OnSpin3d was obvious, it was a matter of matching the scale of the design to the funding. Support was obtained from the Army Research Office to purchase four dual processor 200 MHz Pentium Pro systems, with the University providing a brand new switched 100BT ethernet for the entire department as part of a campus wide networking project.
Shortly after the system was designed and its funding sought, a visitor to the department from Drexel gave a short presentation on cluster computing and for the first time we learned of the linux-based "beowulf" built the previous year, with similar clusters duplicated at Drexel and a few other institutions (the ones at the top of the "clusters" list on the beowulf.org website). We were not alone! This came as a welcome surprise, and Brown soon joined the beowulf mailing list and made lots of new friends, all heavily involved in building, designing, using and even studying COTS high-performance compute clusters.
The systems funded by the ARO were ordered and delivered in mid-1996, with a few additional systems purchased with DOE and NSF funds by individual groups in the department over the next year. The 2.0.0 linux kernel, the first production kernel that would support SMP operation on the Pentium Pro, had literally just been released in June and was promptly installed. A minor glitch with the vendor (the systems were ordered with TWO processors, but shipped with only one, making it remarkably difficult to get the right process scaling:-) was resolved satisfactorily.
Some interesting glitches in system operation caused Brown to make his first somewhat desperate foray into the new world of "kernel hacking", and gave him a deep insight into the benefits of running an open source operating system on compute nodes where one could try to fix things instead of wait "forever" for Sun to e.g. fix their scheduler. Nevertheless, by mid-September these initial problems were resolved, OnSpin3d was running on all four linux-based SMP systems (three as "cluster nodes", one as a server and desktop console). Eight CPUs, 1600 aggregate MHz, and a reasonable fraction of a GFLOP at a cost of less than $20,000.
Brown's desktop had been named "ganesh" for years as Brown grew up in India (a tradition that continues to this very day:-), and a motif of Hindu gods as names for Unix systems had been in use in the department for many years. The new systems were therefore christened Brahma, Shiva, Ganesh, and Arjuna, with the cluster itself named Brahma in honor of the "B" in "Beowulf" and ganesh the "head node" on Brown's desktop.
Linux 2.0.0 soon proved to have a much better multitasking scheduler than Solaris (as did the earlier 0.9+ versions he had played with before), comparable in efficiency to the SunOS scheduler of days past. OnSpin3d ran happily in the background of desktop systems with no visible impact on interactive performance, so Brown rapidly pushed most of his nodes out to desktops within the physics department, which was somewhat starved at the time for desktop workstations. Graduate offices and faculty desks were graced with the original systems, and new systems were given out to anyone willing to buy a monitor, keyboard and mouse (not essential to their cluster functionality and hence not purchased with grant money). Brahma was architected as a "distributed parallel supercomputer" on the original PVM model, which could freely mix dedicated and desktop systems into a single organized model applied to a single (fairly coarse grained) parallel problem.
More and more systems were added to Brahma, and eventually systems started to be shelved up on our relatively small server room. At any given time over half of the systems were out on desktops, but all the systems were typically at a load average of 1.0 or slightly higher, running OnSpin3d and other programs continuously in the background while permitting the interactive user to browse the burgeoning Web, read and send mail, run Mathematica, and do many other physics or instructionally related tasks. Few networks in the world squeezed more value out of its systems than our physics department, which rapidly converted from all other operating systems to almost exclusive use of Linux as its overwhelming cost/benefit advantage became apparent.
All things pass in time, however, and clusters are no exception. Moore's law continued its inexorable advance, and PII's and the PIII's appeared. The first two major "all at once" additions of cluster nodes (the Intel equipment grant systems indicated as Brahma 2 and (the Intel equipment grant systems indicated as Brahma 3) continued to be named Brahma, and a node-naming scheme of b1, b2, b3... was introduced to differentiate "dedicated" compute nodes from nodes on people's desktops. This scheme failed to indicate processor generation or node architecture, however, and users had to keep track of where the 400 MHz CPUs ended and the older 933 MHz systems began by hand. The use of desktop systems as compute nodes began to gradually decline as well, not because it was obtrusive but because it simply became too difficult to know what was available and free.
Beginning with the ganesh cluster (g01-g15 and ganesh itself) all new "Brahma" clusters have subsequently been given unique "names" of their own that are associated with their ownership, application, and numbering scheme. This, together with xmlsysd and wulfstat, have made it very simple indeed to track the clusters for utilization, load, availability, and so forth.
At the time of this writing (April, 2003) Brahma has grown from its original GFLOP or so of aggregate power to somewhere in the ballpark of 50-100 GFLOPs, with over a hundred nodes and two hundred CPUs. Its power keeps growing, as well, as by now the power of COTS cluster computing is no secret, and Duke has gone from a couple of clusters in 1996 to literally more clusters and sub-clusters than can be counted in 2003. The newly renovated cluster/server room the Duke physics building houses both our own brahma-related cluster(s) and the 96 CPU (Dual Athlon 1800+) "genie" cluster owned and operated by the Institute for Statistics and Decision Sciences at Duke, for example.
Commodity computer based network cluster computing was not invented by any single individual or group (although, as noted above, Jack Dongarra and the inventors of PVM would have one of the most valid claims based on the fact that this software package was one of the first explicitly designed to support this very thing). It has come into its present form by means of the collective effort of hundreds of individuals and groups in institutions and universities all over the world. Since the mid-90's, the organizing "brain" of that effort has without question been the beowulf mailing list, organized by Thomas Sterling and Donald Becker as an outgrowth of their "beowulf" project at NASA-Goddard.
Duke University and the Duke University Physics Department via the Brahma project is proud to have been one of the first major cluster computing sites, and one of the major contributors to the development and promotion of linux and cluster computing on the beowulf list for many years.