CLUSTERS
The
concept of computer clusters has been established in many guises
over many years. The degree of cluster sophistication
varies greatly from established proprietary OpenVMS clusters, noted
in part for its maturity and high degree of fault tolerance, to “do-it-yourself” high
availability clusters based on commodity hardware components. The
concept of clustering, in its simplest form, joins two or more disparate
systems in such a fashion to pool available system resources (CPU,
memory, I/O etc), provide a scalable architecture for evolving capacity
management, or to build in redundancy for achieving a higher level
of availability to reduce unplanned downtime. Clusters are commonly
categorised into the three cluster types. However, a clear cluster
distinction can be somewhat indistinct:
High Availability (HA) - HA typically provides a fail-safe environment
through redundancies in hardware, software and middleware.
High Performance Computing (HPC) - HPC typically
embraces large scale parallel applications to aggregate computation
processing power,
memory, or I/O subsystem.
Logical Compute Farms (LCF) - LCFs typically consist
of many identical compute nodes whose numbers can vary with demand
over time and jobs
allocate through load balancing.
High Availability
Great competitive demands are being placed on
corporate IT resources as most research, product development and
mission critical business
applications rely heavily on the availability of computational
resource, project data and business databases. Failure of IT
systems can quickly
cascade into an operational failure across an entire business.
Moreover, server and applications are expected to be available
24 hours a day,
seven days a week with no room for downtime.
All high availability solutions rely on some amount
of built in redundancy. At the simplest level this redundancy might
involve
replication of
business critical data. At the other extreme high availability
involves complete duplication of the solution stack to a physically
separate
location. Many other high availability solutions lie between
these two extremities and involve redundancies in hardware, software,
storage and network components. High Availability clusters usually
involve
two or more systems connected together via a common interconnect
(or heartbeat), share a common storage subsystem and have equal
access to available resources. In an event of component failure,
high availability
clusters necessitate failover of the defective component or a
subset
of the solution stack onto one or more alternative systems.
High Performance Computing
For the last three decades, supercomputer design has focused on expensive
specially designed vector computing and massively parallel symmetric
multiprocessor computing platforms. Recently there has been a shift
toward parallel cluster computing that uses commodity “off-the-shelf” components
connected together by a high speed internal network. HPC clusters
are typically deployed for parallel computing to aggregate more
processing power or effective memory for a solution of problems
within the scientific or research and development arenas. The trend
to deploy HPC clusters is clearly demonstrated in the number of
clusters listed in the “TOP500 Supercomputer Sites” list,
which lists the 500 most powerful supercomputers in the world,
and makes clustered systems the most common high performance computer
architecture.
HPC clusters are typically made up of a large
number of compute nodes connected through high cluster interconnect.
Clusters numbering into
several hundreds compute nodes are not uncommon. There are two dominant
architectures in parallel computing: shared-memory systems and distributed-memory
systems:
- Shared Memory Systems (SMS)- SMS provide symmetric multiprocessor
with a common shared memory address space. Parallel computing
takes place through the use of shared data structures or application
threads.
- Distributed Memory Systems (DMS) - DMS comprise of disparate
compute nodes that do not share memory directly other than through
message passing semantics in software.
HPC clusters are typically distributed memory
systems that may use SMP systems as a building block. Very large
application data may
be required to be distributed across the HPC cluster and every
compute node must interact with the others to move data between processors.
Large blocks of contiguous memory requires a high speed interconnect
whereas small message packets require low latency interconnect
to
accelerate parallel program execution.
Logical Compute Farm
Logical Compute Farms (LCF) provide a single interface to a loosely
coupled set of commodity compute nodes that can dynamically increase
or decrease in response to application demand. Common examples
of Logical Compute Farms include Web server farms and Internet
messaging
services. Jobs are typically allocated onto the LCFs through
batch queues or server load balancing network switches. This
kind of
cluster also provides significant and transparent redundancy
through the
horizontal scalability of compute nodes and has many attributes
of high availability clusters. In other ways, LCFs aggregated
computational power has many attributes of high performance compute
cluster.