Distributed computing Part 2 - Distributed Computing Application Characteristics  

(Post 11/04/2006) Obviously not all applications are suitable for distributed computing. The closer an application gets to running in real time, the less appropriate it is. Even processing tasks that normally take an hour are two may not derive much benefit if the communications among distributed systems and the constantly changing availability of processing clients becomes a bottleneck.

Instead you should think in terms of tasks that take hours, days, weeks, and months. Generally the most appropriate applications, according to Entropia, consist of "loosely coupled, non-sequential tasks in batch processes with a high compute-to-data ratio." The high compute to data ratio goes hand-in-hand with a high compute-to-communications ratio, as you don't want to bog down the network by sending large amounts of data to each client, though in some cases you can do so during off hours. Programs with large databases that can be easily parsed for distribution are very appropriate.

Clearly, any application with individual tasks that need access to huge data sets will be more appropriate for larger systems than individual PCs. If terabytes of data are involved, a supercomputer makes sense as communications can take place across the system's very high speed backplane without bogging down the network. Server and other dedicated system clusters will be more appropriate for other slightly less data intensive applications. For a distributed application using numerous PCs, the required data should fit very comfortably in the PC's memory, with lots of room to spare.

Taking this further, United Devices recommends that the application should have the capability to fully exploit "coarse-grained parallelism," meaning it should be possible to partition the application into independent tasks or processes that can be computed concurrently. For most solutions there should not be any need for communication between the tasks except at task boundaries, though Data Synapse allows some interprocess communications. The tasks and small blocks of data should be such that they can be processed effectively on a modern PC and report results that, when combined with other PC's results, produce coherent output. And the individual tasks should be small enough to produce a result on these systems within a few hours to a few days

Types of Distributed Computing Applications

Beyond the very popular poster child SETI@Home application, the following scenarios are examples of other types of application tasks that can be set up to take advantage of distributed computing.

  • A query search against a huge database that can be split across lots of desktops, with the submitted query running concurrently against each fragment on each desktop.
  • Complex modeling and simulation techniques that increase the accuracy of results by increasing the number of random trials would also be appropriate, as trials could be run concurrently on many desktops, and combined to achieve greater statistical significance (this is a common method used in various types of financial risk analysis).
  • Exhaustive search techniques that require searching through a huge number of results to find solutions to a problem also make sense. Drug screening is a prime example.
  • Many of today's vendors, particularly Entropia and United Devices, are aiming squarely at the life sciences market, which has a sudden need for massive computing power. As a result of sequencing the human genome, the number of identifiable biological targets for today's drugs is expected to increase from about 500 to about 10,000. Pharmaceutical firms have repositories of millions of different molecules and compounds, some of which may have characteristics that make them appropriate for inhibiting newly found proteins. The process of matching all these "ligands" to their appropriate targets is an ideal task for distributed computing, and the quicker it's done, the quicker and greater the benefits will be. Another related application is the recent trend of generating new types of drugs solely on computers.
  • Complex financial modeling, weather forecasting, and geophysical exploration are on the radar screens of these vendors, as well as car crash and other complex simulations.

To enhance their public relations efforts and demonstrate the effectiveness of their platforms, most of the distributed computing vendors have set up philanthropic computing projects that recruit CPU cycles across the Internet. Parabon's Compute-Against-Cancer harnesses an army of systems to track patient responses to chemotherapy, while Entropia's FightAidsAtHome project evaluates prospective targets for drug discovery. And of course, the SETI@home project has attracted millions of PCs to work on analyzing data from the Arecibo radio telescope for signatures that indicate extraterrestrial intelligence. There are also higher end grid projects, including those run by the US National Science Foundation, NASA, and as well as the European Data Grid, Particle Physics Data Grid, the Network for Earthquake Simulation Grid, and Grid Physics Network that plan to aid their research communities. And IBM has announced that it will help to create a life sciences grid in North Carolina to be used for genomic research.

Security and Standards Challenges

The major challenges come with increasing scale. As soon as you move outside of a corporate firewall, security and standardization challenges become quite significant. Most of today's vendors currently specialize in applications that stop at the corporate firewall, though Avaki, in particular, is staking out the global grid territory. Beyond spanning firewalls with a single platform, lies the challenge of spanning multiple firewalls and platforms, which means standards.

Most of the current platforms offer high level encryption such as Triple DES. The application packages that are sent to PCs are digitally signed, to make sure a rogue application does not infiltrate a system. Avaki comes with its own PKI (public key infrastructure), for example. Identical application packages are typically sent to multiple PCs and the results of each are compared. Any set of results that differs from the rest becomes security suspect. Even with encryption, data can still be snooped when the process is running in the client's memory, so most platforms create application data chunks that are so small, that it is unlikely snooping them will provide useful information. Avaki claims that it integrates easily with different existing security infrastructures and can facilitate the communications among them, but this is obviously a challenge for global distributed computing.

Working out standards for communications among platforms is part of the typical chaos that occurs early in any relatively new technology. In the generalized peer-to-peer realm lies the Peer-to-Peer Working Group, started by Intel, which is looking to devise standards for communications among many different types of peer-to-peer platforms, including those that are used for edge services and collaboration.

The Global Grid Forum is a collection of about 200 companies looking to devise grid computing standards. Then you have vendor-specific efforts such as Sun's Open Source JXTA platform, which provides a collection of protocols and services that allows peers to advertise themselves to and communicate with each other securely. JXTA has a lot in common with JINI, but is not Java specific (thought the first version is Java based).

Intel recently released its own peer-to-peer middleware, the Intel Peer-to-Peer Accelerator Kit for Microsoft .Net, also designed for discovery, and based on the Microsoft.Net platform.

For Grid projects there's the Globus ToolKit, www.globus.org developed by a group at the Argonne National Laboratory and another team at the University of Southern California's Information Science Institute. It bills itself as an open source infrastructure that provides most of the security, resource discovery, data access, and management services needed to construct Grid applications. A large number of today's grid applications are based on the Globus Toolkit. IBM is offering a version of the Globus ToolKit for its servers running on Linux and AIX. And Entropia has announced that it intends to integrate its platform with Globus, as an early start towards communications among platforms and applications.

For today, however, the specific promise of distributed computing lies mostly in harnessing the system resources that lie within the firewall. It will take years before the systems on the Net will be sharing compute resources as effortlessly as they can share information.

Finally, we realize that distributed computing technologies and research cover a vast amount of territory, and we've only touched the surface so far. Please join our forums and discuss some of your experiences with distributed computing, and feel free to make suggestions of technologies you'd like to see us explore in more detail.

