|| Scatter/Gather with the gather (aggregation) phase also distributed.
| Current Project Version
| Project Maturity
| Project License
|| Apache License 2.0
| Compatible GigaSpaces XAP Version
|| 6.0 and greater
| Project Captain
|| Peter Coates
|| Peter Coates
Project white paper (PDF)
Features and Capabilities
This is a simple framework for doing the scatter/gather, a.k.a., map-reduce pattern across multiple GigaSpaces clusters. It is implemented in terms of GigaSpaces, but actually can easily be adapted to any JavaSpaces clusters. Its purpose is to facilitate scatter/gather processing when geographically dispersed clusters are in use. The key is to separate results from the knowledge of the existence and location of the result. This lets a master task scatter itself across expensive links without necessarily being a bottleneck in dealing with the (possibly recursive) results.
From the abstract of the accompanying paper (download)
The familiar scatter/gather design pattern, also known as map-reduce, is used to spread a large task among many CPU's in a cluster. In this design pattern, a central process breaks a task into many sub-tasks and scatters them among the participating CPU's, which then return the results
for subsequent aggregation by the original process.
This technique is effective, but the gather step is a potential bottleneck: while the processing is distributed, the gathering ordinarily is not.
If a task is to be scattered among geographically dispersed clusters, network latency and bandwidth constraints among distant clusters can make the aggregation phase of the ordinary scatter/gather
even more problematic. This is especially true if the processing is recursive and the intermediate results are large.
This project is a simple framework to support a variation on scatter/gather that helps to alleviate these problems by (a) distributing the gather phase and (b) minimizing the amount of result data that needs to be pulled back across the remote links. (c) providing access to the information necessary to direct the interactions of remote clusters without incurring the cost of moving all the data to a central location.
The framework consists of a set of classes which can be extended to override the functionality of a few critical methods that control breaking a task into subtasks, processing the results, controlling aggregation, etc.