next up previous contents
Next: The (De)Centralized Management Group Up: A Model for Cluster Previous: User Support   Contents

Physical Infrastructure

As detailed above, a reasonable cost estimate for the recurring physical infrastructure (power and cooling) costs for maintaining a cluster node in any on-campus site can be estimated as $1 per watt steady-state power consumption per year. This does not include the cost of renovation of a cluster site, the cost of racks or wiring trays at the site, nor does it include any sort of ``rent'' on the physical space provided at a particular site for a cluster node as these are all one-time capital infrastructure costs.

It is difficult to know what to do about these costs as far as cost recovery is concerned. The recurring physical infrastructure cost for operating a node is not trivial at roughly $100 per CPU per year, but neither is it trivial for operating a desktop workstation within any department on campus, or for providing power and cooling for any piece of experimental apparatus in a lab. In some cases (such as powering the TUNL particle accelerators or the FEL) those costs are tremendous and must clearly be borne by a funding agency as a line item. In most other cases, they are considered to be part of the infrastructure already paid for in the indirect cost portion of a grant. To the best of my knowledge, the University doesn't put separate electrical meters on each lab space throughout campus and attempt to backcharge the researchers in each space for the power and cooling they happen to consume.

Similar considerations seem to hold for renovations to space required to site clusters within departments or elsewhere on campus. In many cases the University provides space renovations as part of the startup package offered to attract faculty hoping to build some sort of experimental program, or renovates space to meet the changing needs of established programs as they grow and alter their focus. Large programs or programs requiring new construction, however, might well fund the construction or renovation out of grant money. Again, it seems reasonable to assume that indirect costs already charged to most grants suffice to cover at least a ``reasonable'' amount of space and renovation effort if it is required to support a funded project.

Clusters seem to be right in the middle. For many researchers, they are just another piece of essential equipment. Grants to theorists, mathematicians, computer scientists, and statisticians in particular have paid indirect costs at basically the same rate as experimentalists for years while generally requiring little more than an office and access to the library to support their research. As times change and clusters become and integral tool for researchers in these fields, it is not at all unreasonable for them to expect the same sort of infrastructure support for their essential equipment as has always been given routinely to the experimentalists.

On the other hand, a 128-CPU cluster can cost on the order of $13,000 per year to power and cool. Its space requirements are not huge (a few square meters of floor and rackspace) but the capital investment cost of renovating that space so that adequate power and cooling density is achieved in any given location may not be small (order of $50,000 to $150,000, amortized over perhaps ten years and split up among hundreds of nodes). The economics and CBA of indirect cost recovery from the grants that presumably will have cluster nodes housed in the facility over the course of many years is not as easy to determine, and a compelling case would likely have to be made to make at least some of the many granting agencies comfortable with supporting it as a line item in all but a few special cases.

It is therefore suggested that no attempt be made to recover this cost at this time, at least on a routine basis. In the case of smaller clusters (5 to 10 kilowatts), it would indeed be unreasonable and should be considered to be paid out of indirect costs on the supported research done with the cluster nodes thus housed. In the case of larger clusters (up to perhaps 25 kilowatts) it could be argued either way, but not convincingly argued without hard data that can only be collected deliberately, over time, by looking at the indirect cost balance of actual clusters operating in situ.

For clusters larger than 25 kilowatts (more than 256 CPUs, to use another measure) the capital outlay from the granting agency is already expected to be in the hundreds of thousands of dollars. For clusters of this size, granting agencies have to expect to pay various additional charges because operating a cluster of this size even within a pre-existing LAN with zero marginal cost extension of the LAN workspace is likely to require a significant fraction of an FTE manager. Although some of these costs may be recovered for centrally run clusters utilizing the integrated node approach above, within a departmental LAN they are not usually so charged even though the additional administrative burden can easily push the local administrator(s) across an FTE capacity boundary.

Finally, indirect costs are not generally charged on capital equipment. This can create an imbalance precisely where a low-overhead research project (one that funds only one or two salaries and the cluster) is concerned. Overhead on the salaries may not cover the actual cost of operating the cluster if the cluster is large, and there are no indirect costs charged on the cluster hardware.

For this reason it would not be unreasonable to try to recover recurring operational costs such as power and cooling at fair market value (exclusive of the cost of any renovations required, for the most part) for very large clusters where this imbalance is likely to occur and where granting agencies are likely to recognize this and tolerate the additional expense. However, it would be advisable to proceed here on a case by case basis, taking into account the kind of grant being sought, the amount of cluster node equipment being purchased, and the actual amount of indirect cost monies the grant will generate.

next up previous contents
Next: The (De)Centralized Management Group Up: A Model for Cluster Previous: User Support   Contents
Robert G. Brown 2003-06-02