首页 >数据库 >mysql教程 > Hive 令人头痛的multi-distinct

Hive 令人头痛的multi-distinct

WBOY原创: 2016-06-07 17:38:081312浏览

线上一个查询简化如下：Selectdt,count(distinctc1),count(distinctcasewhenc20andc1=0thenc1end),count(distinctcasewhenc20andc10thenc1end)fromtwheredtbetwe

ABSTRACTSYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAMEt))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT(TOK_SELEXPR (TOK_TABLE_OR_COL dt)) (TOK_SELEXPR (TOK_FUNCTIONDI count(TOK_TABLE_OR_COL c1))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (> (TOK_TABLE_OR_COL c2) 0) (= (TOK_TABLE_OR_COL c1) 0))(TOK_TABLE_OR_COL c1)))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (> (TOK_TABLE_OR_COL c2) 0) (> (TOK_TABLE_OR_COL c1) 0))(TOK_TABLE_OR_COL c1))))) (TOK_WHERE (TOK_FUNCTION between KW_FALSE(TOK_TABLE_OR_COL dt) '20131108' '20131110')) (TOK_GROUPBY (TOK_TABLE_OR_COLdt)))) STAGEDEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGEPLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: t TableScan alias: t Filter Operator predicate: expr: dt BETWEEN '20131108'AND '20131110' type: Boolean //通过select operator做投影 Select Operator expressions: expr: dt type: string expr: c1 type: int expr: c2 type: int outputColumnNames: dt, c1, c2 //在MAP端进行简单的聚合，雷区1：假设有N个distinct，MAP处理数据有M条，，那么这部处理后的输出是N*M条数据，因为MAP会对dt,keys[i]做聚合操作，所以尽量在MAP端过滤尽可能多的数据 Group By Operator aggregations: expr: count(DISTINCTc1) expr: count(DISTINCTCASE WHEN (((c2 > 0) and (c1 = 0))) THEN (c1) END) expr: count(DISTINCTCASE WHEN (((c2 > 0) and (c1 > 0))) THEN (c1) END) bucketGroup: false keys: expr: dt type: string expr: c1 type: int expr: CASE WHEN (((c2> 0) and (c1 = 0))) THEN (c1) END type: int expr: CASE WHEN (((c2> 0) and (c1 > 0))) THEN (c1) END type: int mode: hash outputColumnNames: _col0,_col1, _col2, _col3, _col4, _col5, _col6 //雷区2：在做Reduce Sink时是根据partition cplumns进行HASH的方式，那么对于按date分区的表来说一天的所有数据被放大N倍传输到Reducer进行运算，导致性能长尾或者OOME. Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: int expr: _col2 type: int expr: _col3 type: int sort order: ++++ Map-reduce partitioncolumns: expr: _col0 type: string tag: -1 value expressions: expr: _col4 type: bigint expr: _col5 type: bigint expr: _col6 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCTKEY._col1:0._col0) expr: count(DISTINCTKEY._col1:1._col0) expr: count(DISTINCTKEY._col1:2._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1,_col2, _col3 Select Operator expressions: expr: _col0 type: string expr: _col1 type: bigint expr: _col2 type: bigint expr: _col3 type: bigint outputColumnNames: _col0, _col1,_col2, _col3 File Output Operator compressed: true GlobalTableId: 0 table: input format:org.apache.hadoop.mapred.TextInputFormat output format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1

查看执行计划（省去非关键部分）：

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages:Stage-1, Stage-3, Stage-4 Stage-3 is a root stage Stage-4 is a root stage Stage-0 is a root stage

本文出自 “MIKE老毕的WIKI” 博客，请务必保留此出处

声明：

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

上一篇： Oracle number date varchar2字段类型占用空间大小下一篇：匹配与显示中文说明

查看更多

Hive 令人头痛的multi-distinct

相关文章