Home >Database >Mysql Tutorial >Why Does My Hive `COUNT(*)` Increase After Filtering for Non-Null Values?

Why Does My Hive `COUNT(*)` Increase After Filtering for Non-Null Values?

DDD
DDDOriginal
2025-01-12 10:55:44384browse

Why Does My Hive `COUNT(*)` Increase After Filtering for Non-Null Values?

*Hive's `COUNT()` Yields Unexpected Results After Filtering**

Using Hive's COUNT(*) function can sometimes produce surprising outcomes. Consider a table, "mytable," with a column "master_id."

An initial count query reveals 1,129,563 rows:

<code class="language-sql">SELECT COUNT(*) AS c FROM mytable;</code>

However, filtering for non-null "master_id" values unexpectedly increases the count to 1,134,041:

<code class="language-sql">SELECT COUNT(*) AS c FROM mytable WHERE master_id IS NOT NULL;</code>

This is puzzling, especially since a query checking for null "master_id" values returns zero:

<code class="language-sql">SELECT COUNT(*) AS c FROM mytable WHERE master_id IS NULL;</code>

The Reason Behind the Discrepancy

This inconsistency stems from Hive's reliance on table statistics for query optimization. The initial COUNT(*) query, lacking a WHERE clause, might use estimated statistics instead of a full table scan. Adding the WHERE clause forces a complete table scan, leading to a precise, and higher, row count.

How to Obtain Accurate Counts

To guarantee accurate results, consider these solutions:

  1. Disable Statistics-Based Optimization: Prevent Hive from using statistics for query optimization:

    <code class="language-sql">SET hive.compute.query.using.stats=false;</code>
  2. Manually Update Table Statistics: Use the ANALYZE TABLE command to refresh the table's statistics:

    <code class="language-sql">ANALYZE TABLE mytable COMPUTE STATISTICS;</code>
  3. Enable Automatic Statistics Gathering: Configure Hive to automatically update statistics during INSERT OVERWRITE operations:

    <code class="language-sql">SET hive.stats.autogather=true;</code>

By implementing one of these methods, you can ensure your COUNT(*) queries consistently return accurate row counts.

The above is the detailed content of Why Does My Hive `COUNT(*)` Increase After Filtering for Non-Null Values?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn