- Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).
- map() produces one or more intermediate values along with an output key from the input.
Reduce
- After the map phase is over, all the intermediate values for a given output key are combined together into a list.
- reduce() combines those intermediate values into one or more final values for that same output key.
- (in practice, usually only one final value per key).
[Source from: http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html]
Distribution and Reliability
MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the Google File System) records the node as dead and sends out the node's assigned work to other nodes. Individual operations use atomic operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running. When files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for side-effects).
The reduce operations operate much the same way. Because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or in the same rack as the node holding the data being operated on. This property is desirable as it conserves bandwidth across the backbone network of the datacenter.Implementations are not necessarily highly-available. For example, in Hadoop the NameNode is a single point of failure for the distributed filesystem and if the JobTracker fails, all outstanding work is lost.
[Source from: http://en.wikipedia.org/wiki/MapReduce]
No comments:
Post a Comment