The heading of the question explains everything what my question is.
I have been reading through multiple texts, answers where I came across this line
Through use of the combiner and by taking advantage of the ability to
preserve state across multiple inputs, it is often possible to
substantially reduce both the number and size of key-value pairs that
need to be shuffled from the mappers to the reducers.
I am not able to understand this concept. An elaborate answer and explanation with an example would be really helpful. How to develop an intuition to understand such concepts?
If you already feel comfortable with the "reducer" concept, a combiner concept will be easy. A combiner can be seen as a mini-reducer on the map phase. What i mean by that? Lets see an example: suppose that you are doing the classic wordcount problem, you know that for every word a key-value pair is emited by the mapper. Then the reducer will take as input this key-value pairs and summaryze them. Supose that a mapper collects some key-value pairs like:
<key1,1>, <key2,1>, <key1,1>, <key3,1>, <key1,1>
If you are not using a combiner, this 4 key-value pairs will be sent to the reducer. but using a combiner we could perform a pre-reduce in the mapper, so the output of the mapper will be:
<key1,3>, <key2,1>, <key3,1>
In this simple example by using a combiner, you reduced the total number of key-value pairs from 5 to 3, which will give you less network traffic and better performance in the shuffle phase.