MapReduce: Simplified Data Processing on Large Clusters

• sending code to servers
• moving data from Map to Reduce
• recovering from failures

Limit

• no interaction or state
• no multi-stage pipeline
• no real-time or streaming processing

Bottleneck

• network
• root switch

Minimize network use

• input is read from local disk (via GFS)
• Map workers write to local disk
• Reduce workers read directly from Map workers
• intermediate data partitioned into files holding many keys

• Master hands out new tasks to workers who finish previous tasks

fault tolerance

• re-run the failed Maps and Reduces
• Map and Reduce must be pure deterministic functions

worker crash recovery

• Map
• Master tells other workers to run those lost tasks
• omit if Reduce workers already fetched the intermediate data
• Reduce
• Master re-start worker's unfinished tasks

other failures/problems

• Master gives 2 workers the same Map() task: tell Reduce workers about only one of them
• Master gives 2 workers the same Reduce() task: GFS handle this
• a single worker is very slow: Master starts a 2nd copy of the last few tasks

conclusion

- not the most efficient or flexible
+ scales well
+ easy to program


### Procedure

1. Splits input files into M pieces of typically 16 MB ~ 64 MB per piece.
2. Master picks idle workers and assigns each one a map task or a reduce task.
3. Map worker parses the input data and passes each key/value pair to the user-defined Map function. Results are buffered in memory.
4. Periodically write the buffered pairs to local disk, partitioned into R regions. Pass the locations back to the master.
5. Reduce worker reads all the data through RPC and sorts it by the intermediate keys. An external sort is used when the intermediate data is too large to fit in memory.
6. Reduce worker passes the keys and the corresponding set of intermediate values to the user-defined Reduce function. Results are appended to a final output file.
7. After all map and reduce tasks have been completed, master wakes up the user program.

### Master Data Structure

enum State {
idle,
in_progress,
completed,
}

state: State,
localtion: String,
size: u32,
}


### Fault Tolerance

worker failure

• master pings every worker periodically
• if no response, marks the worker as falied
• any map tasks completed by the failed worker are reset (because the results are stored locally)
• any map tasks or reduce tasks in progress on the failed worker are reset
• notify the reduce worker

master failure

• checkpoints for recovery

### Locality

The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the coressponding input data.

• scheduling: O(M+R)
• states: O(M*R)
• each piece of input data size: 16 MB to 64 MB
• M = 200,000
• R = 5,000
• workers = 2,000

When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes.

### Partitioning Function

hash(func(key)) mod R

### Ordering Guarantees

Within a given partition, the intermediate key/value pairs are processed in increasing key order.

### Combiner Function

(Optional) Combiner function defined by user will partial merging the data after Map before sending to Reduce worker.

Sometimes this can significantly speeds up the MapReduce operations.

### Input and Output Types

Default reader function: the key is the offset in the file and the value is the contents of the line.