转载自:http://yangguan.org/mapreduce-patterns-algorithms-and-use-cases/翻译自:http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/在这篇文章里总结了几种网上或者论文中常见的MapReduce模式和算法,并系统化的解释了这些技术的不同
转载自:http://yangguan.org/mapreduce-patterns-algorithms-and-use-cases/ 翻译自:http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/ 在这篇文章里总结了几种网上或者论文中常见的MapReduce模式和算法,并系统化的解释了这些技术的不同之处。所有描述性的文字和代码都使用了标准hadoop的MapReduce模型,包括Mappers, Reduces, Combiners, Partitioners,和 sorting。如下图所示。class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum)
这种方法的缺点显而易见,Mapper提交了太多无意义的计数。它完全可以通过先对每个文档中的词进行计数从而减少传递给Reducer的数据量:
class Mapper method Map(docid id, doc d) H = new AssociativeArray for all term t in doc d do H{t} = H{t} + 1 for all term t in H do Emit(term t, count H{t})如果要累计计数的的不只是单个文档中的内容,还包括了一个Mapper节点处理的所有文档,那就要用到Combiner了:
class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1) class Combiner method Combine(term t, [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum)
class Mapper method Map(id n, object N) Emit(id n, object N) for all id m in N.OutgoingRelations do Emit(id m, message getMessage(N)) class Reducer method Reduce(id m, [s1, s2,...]) M = null messages = [] for all s in [s1, s2,...] do if IsObject(s) then M = s else // s is a message messages.add(s) M.State = calculateState(messages) Emit(id m, item M)一个节点的状态可以迅速的沿着网络传全网,那些被感染了的节点又去感染它们的邻居,整个过程就像下面的图示一样:
class N State in {True = 2, False = 1, null = 0}, initialized 1 or 2 for end-of-line categories, 0 otherwise method getMessage(object N) return N.State method calculateState(state s, data [d1, d2,...]) return max( [d1, d2,...] )
class N State is distance, initialized 0 for source node, INFINITY for all other nodes method getMessage(N) return N.State + 1 method calculateState(state s, data [d1, d2,...]) min( [d1, d2,...] )
class N State is PageRank method getMessage(object N) return N.State / N.OutgoingRelations.size() method calculateState(state s, data [d1, d2,...]) return ( sum([d1, d2,...]) )要指出的是上面用一个数值来作为评分实际上是一种简化,在实际情况下,我们需要在Mapper端来进行聚合计算得出这个值。下面的代码片段展示了这个改变后的逻辑 (针对于 PageRank 算法):
class Mapper method Initialize H = new AssociativeArray method Map(id n, object N) p = N.PageRank / N.OutgoingRelations.size() Emit(id n, object N) for all id m in N.OutgoingRelations do H{m} = H{m} + p method Close for all id n in H do Emit(id n, value H{n}) class Reducer method Reduce(id m, [s1, s2,...]) M = null p = 0 for all s in [s1, s2,...] do if IsObject(s) then M = s else p = p + s M.PageRank = p Emit(id m, item M)
Record 1: F=1, G={a, b} Record 2: F=2, G={a, d, e} Record 3: F=1, G={b} Record 4: F=3, G={a, b} Result: a -> 3 // F=1, F=2, F=3 b -> 2 // F=1, F=3 d -> 1 // F=2 e -> 1 // F=2解决方案 I: 第一种方法是分两个阶段来解决这个问题。第一阶段在Mapper中使用F和G组成一个复合值对,然后在Reducer中输出每个值对,目的是为了保证F值的唯一性。在第二阶段,再将值对按照G值来分组计算每组中的条目数。 第一阶段:
class Mapper method Map(null, record [value f, categories [g1, g2,...]]) for all category g in [g1, g2,...] Emit(record [g, f], count 1) class Reducer method Reduce(record [g, f], counts [n1, n2, ...]) Emit(record [g, f], null )第二阶段:
class Mapper method Map(record [f, g], null) Emit(value g, count 1) class Reducer method Reduce(value g, counts [n1, n2,...]) Emit(value g, sum( [n1, n2,...] ) )解决方案 II: 第二种方法只需要一次MapReduce 即可实现,但扩展性不强。算法很简单-Mapper 输出值和分类,在Reducer里为每个值对应的分类去重然后给每个所属的分类计数加1,最后再在Reducer结束后将所有计数加和。这种方法适用于只有有限个分类,而且拥有相同F值的记录不是很多的情况。例如网络日志处理和用户分类,用户的总数很多,但是每个用户的事件是有限的,以此分类得到的类别也是有限的。值得一提的是在这种模式下可以在数据传输到Reducer之前使用Combiner来去除分类的重复值。
class Mapper method Map(null, record [value f, categories [g1, g2,...] ) for all category g in [g1, g2,...] Emit(value f, category g) class Reducer method Initialize H = new AssociativeArray : category -> count method Reduce(value f, categories [g1, g2,...]) [g1', g2',..] = ExcludeDuplicates( [g1, g2,..] ) for all category g in [g1', g2',...] H{g} = H{g} + 1 method Close for all category g in H do Emit(category g, count H{g})
class Mapper method Map(null, items [i1, i2,...] ) for all item i in [i1, i2,...] for all item j in [i1, i2,...] Emit(pair [i j], count 1) class Reducer method Reduce(pair [i j], counts [c1, c2,...]) s = sum([c1, c2,...]) Emit(pair[i j], count s)Stripes Approach(条方法?不知道这个名字怎么理解) 第二种方法是将数据按照pair中的第一项来分组,并维护一个关联数组,数组中存储的是所有关联项的计数。The second approach is to group data by the first item in pair and maintain an associative array (“stripe”) where counters for all adjacent items are accumulated. Reducer receives all stripes for leading item i, merges them, and emits the same result as in the Pairs approach.
class Mapper method Map(null, items [i1, i2,...] ) for all item i in [i1, i2,...] H = new AssociativeArray : item -> counter for all item j in [i1, i2,...] H{j} = H{j} + 1 Emit(item i, stripe H) class Reducer method Reduce(item i, stripes [H1, H2,...]) H = new AssociativeArray : item -> counter H = merge-sum( [H1, H2,...] ) for all item j in H.keys() Emit(pair [i j], H{j})
class Mapper method Map(rowkey key, tuple t) if t satisfies the predicate Emit(tuple t, null)
class Mapper method Map(rowkey key, tuple t) tuple g = project(t) // extract required fields to tuple g Emit(tuple g, null) class Reducer method Reduce(tuple t, array n) // n is an array of nulls Emit(tuple t, null)
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, null) class Reducer method Reduce(tuple t, array n) // n is an array of one or two nulls Emit(tuple t, null)
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, null) class Reducer method Reduce(tuple t, array n) // n is an array of one or two nulls if n.size() = 2 Emit(tuple t, null)
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, string t.SetName) // t.SetName is either 'R' or 'S' class Reducer method Reduce(tuple t, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R'] if n.size() = 1 and n[1] = 'R' Emit(tuple t, null)
class Mapper method Map(null, tuple [value GroupBy, value AggregateBy, value ...]) Emit(value GroupBy, value AggregateBy) class Reducer method Reduce(value GroupBy, [v1, v2,...]) Emit(value GroupBy, aggregate( [v1, v2,...] ) ) // aggregate() : sum(), max(),...
class Mapper method Map(null, tuple [join_key k, value v1, value v2,...]) Emit(join_key k, tagged_tuple [set_name tag, values [v1, v2, ...] ] ) class Reducer method Reduce(join_key k, tagged_tuples [t1, t2,...]) H = new AssociativeArray : set_name -> values for all tagged_tuple t in [t1, t2,...] // separate values into 2 arrays H{t.tag}.add(t.values) for all values r in H{'R'} // produce a cross-join of the two arrays for all values l in H{'L'} Emit(null, [k r l] )
class Mapper method Initialize H = new AssociativeArray : join_key -> tuple from R R = loadR() for all [ join_key k, tuple [r1, r2,...] ] in R H{k} = H{k}.append( [r1, r2,...] ) method Map(join_key k, tuple l) for all tuple r in H{k} Emit(null, tuple [k r l] )
原文地址:[转载]MapReduce的模式、算法和用例, 感谢原作者分享。