In the previous two cases the particular model of the parallel program being run didn't much matter. In either case, the program was being run ``locally'' and so the user's standard NFS filespace and/or any special project space was likely mounted and immediately available on the user's own workstation and the cluster node alike. Backup of any critical data was already arranged within the preexisting backup paradigm of the department. Even things like visualization were likely transparently supported on the LAN workstations and the cluster nodes alike. There were essentially no additional costs associated with the management and secure utilization of multiple accounts or transport of data across LAN boundaries.
When the cluster in question is remotely managed, by a centralized University entity, this is no longer true. Cluster access and data transport will necessarily cross LAN administrative boundaries, and boundaries of trust. Cluster users will require a cluster-local LAN infrastructure to support their cluster computation, cluster-local accounts, authentication, security, and fileservices. Cluster users will require mechanisms for accessing the cluster and transporting their data to and from their department local LANs. Furthermore, since the remotely managed cluster may well be a shared entity with access split among many groups, the cluster itself will likely need a variety of cluster management tools to be installed that facilitate e.g. remote monitoring of jobs, batch job submission, job level accounting, and muchmore.
In essence, a core management LAN structure has to be created for the cluster. The cluster manager will need to manage accounts, security, users, and coordinate those structures with those of a variety of departmental LAN managers, where the cluster users have their primary accounts and the workstations through which the cluster will likely be accessed. Finally, the cluster will require more tools to be installed to support remote cluster access and monitoring, and making those tools effective will require more user support and training.
At best, a remotely managed, remotely sited cluster will thus require substantially more FTE management than will a locally managed cluster, however it is sited. It thus behooves us to consider ways that we can minimize this additional expense and recover a reasonable fraction of the cost-efficiency of managing a cluster within the cluster owner/user's LAN.
The obvious approach is to piggyback the LAN management aspects of the cluster on top of an existing LAN that already offers services to the entire University community. This approach permits us to realize substantial economies of scale. Although not as great as the economies that exist for LAN-local management (since the cluster will willy-nilly not be directly integrated into the owner/user's LAN) such an approach permits the University itself to gain important dual benefits, described below.
The unique solution thus appears to be to fully integrate the cluster with the existing Academic Computing group (acpub). Any member of the University community already can get an account within acpub, and for a variety of reasons most faculty, staff, students, and even many postdocs already have done so. Acpub provides at least one mechanism for institution wide disk authenticated access (AFS on top of kerberos) that is already, for the most part, supported to the departmental LANs likely to host the owner/users of centrally managed cluster facilities. One presumes that any additional LAN resources required by the cluster (e.g. local server disk pools, backup mechanisms, additional cluster management tools) could be scalably provided within their existing LAN support framework, minimizing cost and maximizing integrability.
This does create certain organizational issues that must be dealt with. The staff that runs the public cluster(s) cannot just "be" the acpub staff, as the acpub staff likely lacks core expertise in cluster construction and management (the same problem that exists in the locally manged clusters out in the departmental LANs). Also, at this point acpub is not primarily based on linux, and although their staff is far from incompetent in linux, neither are they the campus's primary experts. They are thus unlikely to be able to scalably extend their existing staff and services to clusters without augmenting their staff with one or more cluster and linux experts. Those experts would need to have the freedom to support the clusters ``semi-autonomously'' - integrating with and drawing upon the scalably extensible services acpub can provide for account management, access, authentication, and possibly disk services, while not being forced to strictly conform to the acpub workstation model.
One advantage of providing centralized cluster management within the acpub hierarchy is that it will provide acpub with a straightforward route for migrating from student clusters based on proprietary hardware and operating systems (e.g. Sun and Solaris, Wintel) to linux clusters. This is a desirable migration for many, many reasons (tremendous direct cost savings in software, greater security, the ultimate degree in system installation and management scaling, and open standards for a variety of document and data protocols that facilitates e.g. long term archival storage and retrieval of critical data). Acpub will virtually ``inherit'' the ability to build and manage scalable linux workstation clusters from the ability to build and manage scalable linux compute nodes; as noted above, compute nodes are just a specialized variant of a workstation from the point of view of installation and management.
It should be pointed out that this model for cluster support already works for at least the embarrassingly parallel class of tasks described above. The author of this document ran embarrassingly parallel Monte Carlo computations on the entire acpub collection of workstation clusters for close to a year some years ago (with the permission and support of the acpub staff), getting a phenomenal amount of research computing done before weaknesses in Solaris forced this experiment to be terminated. This extremely simple model of account+workstation or node access would likely fail for distributing true coarse grained parallel tasks in a multiuser environment, but a simple extension of it very likely would work, and that is what is proposed below.
This document does not attempt to suggest the details of how cluster and linux management might be integrated with the existing acpub staff and group; only that this is by far the most desirable way to proceed as (once a modest cost penalty for the initial startup is paid) it maximally leverages existing modes for delivering University level compute services to University personnel in all venues.
It will have the temerity to suggest that this be done delicately, in a way that carefully avoids crippling either the existing acpub staff or the cluster staff that would be working with them, and with full respect to the FTE capacity boundaries that undoubtedly already exist within acpub. As in, don't try to make acpub simply absorb the additional burden of cluster support with their existing staff. It will also have the temerity to suggest that it be done in a way that integrates existing methodology for providing the core linux installation and support services outside of acpub (as they are now, via e.g. the dulug site, with the integration of a ``virtual staff'' of linux experts and cluster experts that are not ``in'' acpub per se but are still charged with providing support and training services), crossing boundaries of administrative control.
A major point of the model proposed for cluster management on campus is that it remain decentralized as much as possible with support mechanisms that cross boundaries of administrative control even where it attempts to provide a basis for centralized management and access. It centralizes where one can see an immediate and clear CBA advantage in terms such as zero marginal cost extension of existing LAN management structures.
This is not at all paradoxical - linux based cluster management and support currently spans the entire globe, with linux and cluster specific development, instruction, and training coming from an international community. This model for support provides the greatest possible basis for experimentation, evolutionary optimization, and the rapid dissemination of the best and worst solutions throughout the institution, and encourages an open consensus model for technology engineering that ensures that the broad needs of the community are continuously met. If you like, it keeps the customers of any given service in close contact with the service providers, as the two groups are mixed.
In the next section we will discuss in some detail the economies of scale associated with a truly distributed support model. This section will be quite specific in its suggestions for how to integrate a ``centralized'' cluster facility into the existing linux and cluster community.