next up previous contents
Next: Local management-local site Up: A Model for Cluster Previous: A Model for Cluster   Contents

Cluster Siting

In this section we will discuss the advantages of cluster decentralization (or rather, centralization at a department-local level) in more detail, doing a cost-benefit analysis (CBA) of local management-local siting, local management-remote siting, and remote management-remote siting for a variety of typical cluster environments. The numbers presented in this CBA are a ``best guess'' sort of approximation and should be refined with actual numbers where available.

It is difficult to discuss cluster computing at any scale in completely general terms. On the beowulf list, ``your mileage may vary'' (YMMV) and ``it depends on what you are doing'' are the standard warning and answer to nearly any complex question. A cluster that works optimally (in the CBA sense) for one computation won't work at all for a different computation. For that reason, we need to differentiate clusters, and cluster problems, at a very early point in the discussion into two very generic classes:

Silly as this distinction may be, it is a crucial one. Problems and clusters that fit in the former group for all practical purposes must be engineered and operated on a per-problem, per-cluster basis by the group that uses the cluster. At this point in time the University simply cannot provide meaningful support for this sort of cluster computing at the institutional level. As time passes and the cluster support described in this document is (hopefully successfully) implemented that may change. At this time, however, it would be a capital mistake for the University to even consider anything but a local management model for this sort of cluster.

In is at least possible to describe some fairly ``generic problems'' that fit the latter description, and to describe a ``standard cluster'' architecture that should do just fine to solve them. Remote, centralized cluster management makes the most sense when the cluster has a very ``vanilla'' architecture that will work successfully on a wide range of relatively simple cluster problems. We will therefore focus most of our attention on problems of this sort.

To make the discussion concrete, let us consider an ``embarrassingly parallel'' application such as a Monte Carlo computation consisting of many fully independent sub-computations. We will presume that only a small amount of data is required to initiate a sub-computation, which runs for a long time on a single CPU and then returns a small amount of data that represents the result. Such a computation runs efficiently in parallel on any number of processors, requires little in the way of network speed or local storage, and doesn't globally fail if a single node goes down in the middle of its sub-computation.

In addition, we will consider a more challenging but still fundamentally simple problem such as a ``coarse grained'' lattice decomposition of some sort. Each node works on a part of some large space (lattice). To advance the computation many of the nodes have to communicate results between nodes before they can proceed, and if a single node goes down in mid-computation the entire computation dies and must be started over from the beginning. However, each node still does a lot of computation for a little bit of communications, and the computation can thus be scaled up to many nodes with a very generic network architecture. Also, the computation has no particularly special requirements in terms of local storage or memory and can easily fit on a fairly standard node design. However, it does generate a fairly large set of results, output continuously throughout the computation.

Both of these computations will run efficiently on a very generic architecture. Let us now analyze the costs of the different ways of siting the hardware and managing it.

A cluster supercomputer of any design is at heart a client/server LAN. Some of the costs of installing and managing a LAN scale with the number of servers. Others are fixed costs that don't scale at all. Still others scale with the number of clients, or the number of users. As is the case with any such LAN, primary costs for LAN construction, maintenance, and administration include items such as:

These are all services that must be provided and costs that must be paid for any LAN, including the specialized LANs we call a compute cluster or beowulf.

In addition, there are certain physical infrastructure costs associated with a LAN that must be tallied. These are not human or management costs (detailed above) but are nonetheless far from negligible.

With these costs in hand at least by name, we are finally in a position to consider and compare the various location/management schemes.

next up previous contents
Next: Local management-local site Up: A Model for Cluster Previous: A Model for Cluster   Contents
Robert G. Brown 2003-06-02