(Post 11/04/2006) Obviously not all applications
are suitable for distributed computing. The closer an application gets
to running in real time, the less appropriate it is. Even processing tasks
that normally take an hour are two may not derive much benefit if the
communications among distributed systems and the constantly changing availability
of processing clients becomes a bottleneck.
Instead you should think in terms of tasks that take
hours, days, weeks, and months. Generally the most appropriate applications,
according to Entropia, consist of "loosely coupled, non-sequential
tasks in batch processes with a high compute-to-data ratio." The
high compute to data ratio goes hand-in-hand with a high compute-to-communications
ratio, as you don't want to bog down the network by sending large amounts
of data to each client, though in some cases you can do so during off
hours. Programs with large databases that can be easily parsed for distribution
are very appropriate.
Clearly, any application with individual tasks that need
access to huge data sets will be more appropriate for larger systems than
individual PCs. If terabytes of data are involved, a supercomputer makes
sense as communications can take place across the system's very high speed
backplane without bogging down the network. Server and other dedicated
system clusters will be more appropriate for other slightly less data
intensive applications. For a distributed application using numerous PCs,
the required data should fit very comfortably in the PC's memory, with
lots of room to spare.
Taking this further, United Devices recommends that the
application should have the capability to fully exploit "coarse-grained
parallelism," meaning it should be possible to partition the application
into independent tasks or processes that can be computed concurrently.
For most solutions there should not be any need for communication between
the tasks except at task boundaries, though Data Synapse allows some interprocess
communications. The tasks and small blocks of data should be such that
they can be processed effectively on a modern PC and report results that,
when combined with other PC's results, produce coherent output. And the
individual tasks should be small enough to produce a result on these systems
within a few hours to a few days
Types of Distributed Computing Applications
Beyond the very popular poster child SETI@Home application,
the following scenarios are examples of other types of application tasks
that can be set up to take advantage of distributed computing.
- A query search against a huge database that can be split across
lots of desktops, with the submitted query running concurrently against
each fragment on each desktop.
- Complex modeling and simulation techniques that increase the accuracy
of results by increasing the number of random trials would also be
appropriate, as trials could be run concurrently on many desktops,
and combined to achieve greater statistical significance (this is
a common method used in various types of financial risk analysis).
- Exhaustive search techniques that require searching through a huge
number of results to find solutions to a problem also make sense.
Drug screening is a prime example.
- Many of today's vendors, particularly Entropia and United Devices,
are aiming squarely at the life sciences market, which has a sudden
need for massive computing power. As a result of sequencing the human
genome, the number of identifiable biological targets for today's
drugs is expected to increase from about 500 to about 10,000. Pharmaceutical
firms have repositories of millions of different molecules and compounds,
some of which may have characteristics that make them appropriate
for inhibiting newly found proteins. The process of matching all these
"ligands" to their appropriate targets is an ideal task
for distributed computing, and the quicker it's done, the quicker
and greater the benefits will be. Another related application is the
recent trend of generating new types of drugs solely on computers.
- Complex financial modeling, weather forecasting, and geophysical
exploration are on the radar screens of these vendors, as well as
car crash and other complex simulations.
To enhance their public relations efforts and demonstrate
the effectiveness of their platforms, most of the distributed computing
vendors have set up philanthropic computing projects that recruit CPU
cycles across the Internet. Parabon's Compute-Against-Cancer harnesses
an army of systems to track patient responses to chemotherapy, while Entropia's
FightAidsAtHome project evaluates prospective targets for drug discovery.
And of course, the SETI@home project has attracted millions of PCs to
work on analyzing data from the Arecibo radio telescope for signatures
that indicate extraterrestrial intelligence. There are also higher end
grid projects, including those run by the US National Science Foundation,
NASA, and as well as the European Data Grid, Particle Physics Data Grid,
the Network for Earthquake Simulation Grid, and Grid Physics Network that
plan to aid their research communities. And IBM has announced that it
will help to create a life sciences grid in North Carolina to be used
for genomic research.
Security and Standards Challenges
The major challenges come with increasing scale. As soon
as you move outside of a corporate firewall, security and standardization
challenges become quite significant. Most of today's vendors currently
specialize in applications that stop at the corporate firewall, though
Avaki, in particular, is staking out the global grid territory. Beyond
spanning firewalls with a single platform, lies the challenge of spanning
multiple firewalls and platforms, which means standards.
Most of the current platforms offer high level encryption
such as Triple DES. The application packages that are sent to PCs are
digitally signed, to make sure a rogue application does not infiltrate
a system. Avaki comes with its own PKI (public key infrastructure), for
example. Identical application packages are typically sent to multiple
PCs and the results of each are compared. Any set of results that differs
from the rest becomes security suspect. Even with encryption, data can
still be snooped when the process is running in the client's memory, so
most platforms create application data chunks that are so small, that
it is unlikely snooping them will provide useful information. Avaki claims
that it integrates easily with different existing security infrastructures
and can facilitate the communications among them, but this is obviously
a challenge for global distributed computing.
Working out standards for communications among platforms
is part of the typical chaos that occurs early in any relatively new technology.
In the generalized peer-to-peer realm lies the Peer-to-Peer Working Group,
started by Intel, which is looking to devise standards for communications
among many different types of peer-to-peer platforms, including those
that are used for edge services and collaboration.
The Global Grid Forum is a collection of about 200 companies
looking to devise grid computing standards. Then you have vendor-specific
efforts such as Sun's Open Source JXTA platform, which provides a collection
of protocols and services that allows peers to advertise themselves to
and communicate with each other securely. JXTA has a lot in common with
JINI, but is not Java specific (thought the first version is Java based).
Intel recently released its own peer-to-peer middleware,
the Intel Peer-to-Peer Accelerator Kit for Microsoft .Net, also designed
for discovery, and based on the Microsoft.Net platform.
For Grid projects there's the Globus ToolKit, www.globus.org
developed by a group at the Argonne National Laboratory and another team
at the University of Southern California's Information Science Institute.
It bills itself as an open source infrastructure that provides most of
the security, resource discovery, data access, and management services
needed to construct Grid applications. A large number of today's grid
applications are based on the Globus Toolkit. IBM is offering a version
of the Globus ToolKit for its servers running on Linux and AIX. And Entropia
has announced that it intends to integrate its platform with Globus, as
an early start towards communications among platforms and applications.
For today, however, the specific promise of distributed
computing lies mostly in harnessing the system resources that lie within
the firewall. It will take years before the systems on the Net will be
sharing compute resources as effortlessly as they can share information.
Finally, we realize that distributed computing technologies
and research cover a vast amount of territory, and we've only touched
the surface so far. Please join our forums and discuss some of your experiences
with distributed computing, and feel free to make suggestions of technologies
you'd like to see us explore in more detail.
(Sưu tầm) |