Home >Database >Mysql Tutorial >Why Does Adding a `WHERE master_id IS NOT NULL` Clause Increase the `COUNT(*)` in Hive?

Why Does Adding a `WHERE master_id IS NOT NULL` Clause Increase the `COUNT(*)` in Hive?

Linda Hamilton
Linda HamiltonOriginal
2025-01-12 06:27:45318browse

Why Does Adding a `WHERE master_id IS NOT NULL` Clause Increase the `COUNT(*)` in Hive?

Hive count exception: WHERE clause causes count to increase

During Hive data exploration, a strange phenomenon was observed. When counting using the simple statement select count(*) as c from mytable, the result is 1,129,563. However, adding the filter select count(*) as c from mytable where master_id is not null unexpectedly increased the count to 1,134,041.

Further investigation revealed that the master_id column always contained non-null values. This leads to a question: How to explain that the implementation of the null value exclusion condition actually increases the number of rows?

Possible explanation: Hive statistics

The answer lies in understanding the impact of Hive statistics. By default, Hive utilizes statistics to optimize queries and improve performance. When executing a select count(*) as c from mytable query (without any filtering), Hive may rely on stored statistics to estimate the count. However, these statistics are not always accurate or up-to-date.

In this case, the statistics may indicate that there are many rows in the master_id column that contain null values. When adding the filter master_id is not null, Hive re-evaluated the statistics and recognized that most rows contained non-null values. This results in a more accurate count, eliminating noticeable discrepancies.

Methods to solve the problem

To get an accurate count, especially when working with tables that have been significantly changed or have not been analyzed recently, it is recommended:

  • Disable statistics: Set hive.compute.query.using.stats=false to prevent Hive from using statistics and force a full table scan.
  • Collect statistics: Use the ANALYZE TABLE command to manually update table statistics and ensure their accuracy.
  • Enable automatic statistics collection: Set hive.stats.autogather=true to automatically collect statistics during data operations such as INSERT OVERWRITE.

The above is the detailed content of Why Does Adding a `WHERE master_id IS NOT NULL` Clause Increase the `COUNT(*)` in Hive?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn