The selected HIVE () non null counter returns a higher value than select count ()

Question

The selected HIVE () non null counter returns a higher value than select count ()

I am currently doing data exploration using Hive and cannot explain the following behavior. Let's say I have a table (named mytable) with a master_id field.

When I count the number of rows, I get

select count(*) as c from mytable c 1129563

If I want to count the number of rows with an invalid master_id, I get a larger number

 select count(*) as c from mytable where master_id is not null c 1134041

In addition, master_id is never null.

 select count(*) as c from mytable where master_id is null c 0

I cannot explain how adding a where clause can ultimately increase the number of lines. Does anyone have a hint to explain this behavior?

thanks

+5

hive hiveql

z_eb Oct 7 '16 at 8:27

source share

1 answer

leftjoin · Accepted Answer · 2016-10-07T09:44:55+0000

Most likely, your request without using statistics due to this parameter is set:

 set hive.compute.query.using.stats=true;

Try to set it to false and execute again.

Alternatively, you can calculate statistics from a table. See SYNTAX ANALYZE TABLE

It is also possible to automatically collect statistics during INSERT OVERWRITE:

 set hive.stats.autogather=true;

The selected HIVE (*) non null counter returns a higher value than select count (*)

More articles:

The selected HIVE () non null counter returns a higher value than select count ()