Commodity cluster computing at Duke is finally coming of age. The early pioneers have well-established clusters, which have proven very beneficial and cost effective ways to support a variety of research initiatives. However, the old clusters have grown, and new clusters have proliferated, to the point where a number of infrastructure problems have begun to emerge.
Recognizing that clusters are likely to be the primary vehicle for high performance computing in the University environment for at least three to five years, the University is exploring models for providing the requisite infrastructure support. Its goal is to provide cost-effective support and growth pathways for both new and existing cluster-based research projects, in keeping with its primary mission of fostering education and research.
The existing organization of cluster on campus is purely driven by practical issues. Clusters are typically physically located in close physical proximity to (within the departments of) the groups that built and operate them. They are most often managed by departmental systems managers, sometimes augmented by cluster-experienced members of the research groups. There are notable exceptions to this rule of physical proximity, usually associated with a group that lacks adequate local facilities for their cluster. In these cases the cluster is most often located in a facility ``belonging to'' other groups also doing cluster computing.
For example, multiple clusters in physics, in the computer science department, in the engineering school, are physically located within the premises and are run by systems managers working for these various departments and schools. However, a cluster belonging to a group in the math department shares space over in computer science; a cluster belonging to a group in ISDS shares space with the physics clusters.
This white paper describes a model for University cluster support that recognizes the considerable benefits of this mostly-local physical and administrative organization, while extending it in ways that should yield clear benefits in scale, provide much better support to the local administrators that now do most of the actual work of cluster support without adequate training, and extend the advantages of cluster computing more consistently to groups that lack the local infrastructure to run their clusters on a departmental or group level.
The success of this model will strongly depend on its ability to meet a highly variable set of needs in a way that is perceived to be fair by the high performance computing community on campus. The University's research is supported by grants from many agencies covering many fields of endeavor, with many distinct standards for what can and cannot be funded in terms of computing support. The cost recovery associated with the support model will need to be as flexible and as variable as these many sources of support.
An essential feature of the model is adaptibility. In addition to coping with a highly variable cost-recovery terrain (where some granting agencies prefer to fund systems including all support, others presume support to be already paid for in the indirect cost portion of the grant), the model will need to cope with the ever changing landscape of computer and networking hardware, and with cluster users ranging from tyro to expert. Any fixed and inflexible model will fail in time as the needs of the community evolve to where they are no longer being met.
The model presented is thus to be viewed as no more than a beginning framework for meeting the needs of the community. It is expected that it will be reviewed and revised as often as necessary to achieve its stated goals as the needs of the community change and as its design features are tested in application. The model is at heart an evolutionary model where new ideas can be tested and accepted or rejected as they prove viable.