Map Reduce

MapReduce is a programming framework by Google. It was created to solve the problem of creating web search indexes.

In order to understand and serve them better, the organisations find it imperative to quickly analyse the huge amounts of data that their customers and audiences generate through online activity and digital footprints. The data can be rich text, rdbms, text, graph etc and needs to be processed into meaningful insights in order to bring business value to the organization.

MapReduce is used for processing and simplifying highly distributable problems across huge datasets. It works both on cluster or grid format of network and allows computational processing on data stored either in a filesystem (unstructured) or within a database (structured).

There are two fundamental pieces of a MapReduce query:

1. “Map” step: The master node takes the input, divides it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the solution back to its master node.

2. “Reduce” step: The master node then takes the solutions to all the sub-problems and combines them in a way to get the output – the answer to the problem it was originally trying to solve.

Thus,the important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the issue of “data too large to fit onto a single machine”. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled if the input data is still available.

The MapReduce framework is the powerhouse behind most of today’s big data processing. MapReduce can be found inside Hadoop, MPP and NoSQL databases, such as Vertica or MongoDB.