next up previous contents
Next: Cluster Siting Up: A Model for Cluster Previous: Introduction   Contents

A Model for Cluster Computing at Duke

The existing model for cluster computing at Duke is one of many locally centralized, generally autonomous, cluster computing operations. This model works, and it works for certain very good reasons. Well designed clusters, located in facilities that provide adequate infrastructure such as physical space, power, cooling capacity, and networking, scale extremely well in their system management requirements. That is, barring hardware failure a cluster node should require full-time equivalent (FTE) labor on the order of an hour a year or even less to install, update, and operate. In a department that already has a competent systems manager or systems management group, it is often possible to install and operate a cluster using opportunity cost labor provided by the local manager as just another aspect of managing the departmental LAN.

This is a particularly efficient solution, as the LAN manager already provides most of the core services required by the cluster (e.g. account management, disk and backup services, software installation and management services, and security) for the departmental groups utilizing the cluster resource. These services can be extended to the cluster nodes for essentially zero marginal cost, making the labor cost for installing and maintaining the nodes the only cost that scales with the size of the cluster, and this cost scales in a particularly predictable way.

This model is also efficient for a second reason. Since there are many clusters on campus, each engineered according to the needs of its local users and being perpetually built and rebuilt as new moneys become available, there is an evolutionary optimization that naturally occurs as new ideas are tried out, good ideas and bad ideas are discovered in small scale experiments, and these ideas and experiences shared across campus. This model works well in the rapidly changing world of computer and networking hardware, where ``revolutionary'' changes occur every year and are an accepted part of doing business.

This should be compared to the likely efficiency of a monolithic model where all cluster computer operations on campus where organized and managed by a single, centralized authority. Bad ideas would be costly on an institutional scale instead of a departmental or group scale; good ideas would have to diffuse into the institution from other institutions; change would necessarily proceed at a much slower rate. Worst of all, the cluster managers would likely become increasingly dissociated from their client base and increasingly narrow in their support of the wide range of user environments likely to be familiar to the cluster users. Accountability and flexibility would be lost.

These negative elements associated with monolithic models can all be observed now in those existing computer operations on Duke that are heavily centralized, especially in the realms of mainframe computing and in the generally homogeneous academic computing clusters1. Those of us who have been associated in some way with computing on campus over decades recall well the days of the Triangle Universities Computation Center (TUCC) and its campus equivalent (DUCC), and the inefficiencies that actively drove the primary computer users on campus to abandon this model altogether in favor of organization at the departmental scale.

For all of these reasons, the model proposed herein for improved institutional support of cluster computing remains a model that is centralized locally, at the departmental level where that makes sense and in a number of distributed cluster sites where it does not make sense. It avoids the creation of any sort of monolithic centralized cluster facility that might become the Duke Supercomputing Center (DSC) to mirror the North Carolina Supercomputing Center (NCSC) as DUCC once mirrored TUCC. It relies on institutional organization and coordination enabled by technology to achieve the desired support at the institutional scale while retaining the flexibility and cost efficiency of the localized management model.

The primary features of the proposed model are thus:

  1. Mostly decentralized clusters, in a number of "cluster facilities" in reasonable physical proximity to their users, where those users themselves tend to be clustered, e.g. physics, math, computer science, chemistry, engineering, other science and engineering mileau with long-term needs for High Performance Computing (HPC). This simply recognizes that the existing model is fundamentally sound and should not be radically changed.

  2. As an »extension« of the model, one or more cluster facilities (both existing ones and new ones) can successfully house clusters belonging to otherwise isolated groups that »don't« need to be in immediate proximity to their clusters. Again, as the examples of Math and ISDS, this is a viable model but needs to be promoted by Duke at the institutional level where a cost-benefit analysis or lack of local infrastructure make it appropriate. There are two possible models for managing these remote clusters. Both are likely to make sense for different kinds of clusters and cluster owners.

  3. One is the ``owner managed'' model, where the cluster is remotely sited but still managed by a departmental LAN manager of the department to which the owning group belongs. This is the only remote management model possible and in use (by Math and ISDS) at this time. It is obviously successful, for obvious reasons (it retains most of the zero-marginal cost advantages associated with local cluster administration).

    There are some additional cost penalties, however. The cost of physically managing and installing the nodes is considerably higher than with strictly local nodes, as it takes a relatively long time for the departmental manager to travel away from their primary departmental LAN over to the cluster site to perform such maintenance and installation duties that require physical presence. During this time offsite, their management of their departmental LAN is obviously somewhat less responsive. Similarly, they are necessarily less responsive to the needs of the cluster owners when those needs require a trip off site over to where the cluster is physically located. At a guess, offsite management by the systems manager of the owning group is roughly twice as costly per node as onsite management by the local systems manager of the owning group.

  4. An additional model proposed for the management of these offsite clusters is that they be managed by ``the university''. This alternative model is one that we wish to architect and implement for a variety of reasons. Some research groups that might wish to operate clusters are in departments that lack the human infrastructure to support an offsite cluster, or the departmental LAN infrastructure to be able to realize any sort of economy of scale if they did. In addition, groups may find advantages in the resource sharing that is enabled if they locate their cluster under a common, university-level administrative umbrella with several other architecturally similar clusters. The construction of a suitable university management model for offsite clusters is a primary focus of this white paper, although that should not be construed as any sort of abandonment of the local management model (onsite or offsite) where it makes the most sense.

  5. The existing local management model is not without flaws. Local managers at some sites have in the past been relatively untrained graduate students or postdocs, who have sometimes proven spectacularly incompetent or untrustworthy. Even when done by competent and professional local managers and there the considerable advantages associated with zero-marginal cost extension of the existing LAN services is obtained, the labor cost associated with running one or more on or offsite clusters is not necessarily either trivial or acceptable in any given departmental environment.

    Running a cluster in addition to a LAN involves tradeoffs that affect productivity in many ways, the most obvious one being that in many cases an administrator must choose to do one or the other, performing a sort of a task prioritization or triage as needs for services and support emerge. If the LAN manager is relatively underutilized, this is not generally a problem. If they are already heavily burdened, it can easily overburden them and result in a reduction in the quality of services.

    Also, these local systems administrators are (generally) well-trained in LAN administration but may lack expertise germane to cluster management per se (where it differs). The construction of a university-level mechanism to better support and to better train onsite and offsite local managers is also a primary focus of the model proposed in this white paper.

  6. In order to accomplish these goals of providing clusters that are fully managed by the University (offsite as far as the cluster owners are concerned), providing operational support to both onsite and offsite local managers, and providing improved training for local managers, the University will clearly need some sort of centralized cluster organization. This organization can improve productivity and efficiency at the institutional level in many, much needed ways. For example, in addition to the above, it can also help manage: cluster siting and the building or remodeling of facilities as needed; cluster tracking and inventory, grant-writing support, cluster architecture and standards, personnel support (both centralized and owner/local), application support, information coordination and dissemination, cluster integration both on campus and off (at NCSC, for example) and the management of the university-managed clusters.

This, then is an outline for a campus cluster support model that is fleshed out in more detail below. In it, clusters will continue to be both managed and physically located locally where it makes obvious sense to do so, as this results in by far the greatest economies of scale. Nevertheless, a University-level cluster computing operation will be proposed that will remain at least partly delocalized itself, and which will be responsible for providing a variety of levels and kinds of support to groups operating or hoping to operate clusters for many purposes throughout the University.

next up previous contents
Next: Cluster Siting Up: A Model for Cluster Previous: Introduction   Contents
Robert G. Brown 2003-06-02