The selected HIVE (*) non null counter returns a higher value than select count (*)

I am currently doing data exploration using Hive and cannot explain the following behavior. Let's say I have a table (named mytable) with a master_id field.

When I count the number of rows, I get

select count(*) as c from mytable c 1129563 

If I want to count the number of rows with an invalid master_id, I get a larger number

 select count(*) as c from mytable where master_id is not null c 1134041 

In addition, master_id is never null.

 select count(*) as c from mytable where master_id is null c 0 

I cannot explain how adding a where clause can ultimately increase the number of lines. Does anyone have a hint to explain this behavior?

thanks

+5
source share
1 answer

Most likely, your request without using statistics due to this parameter is set:

 set hive.compute.query.using.stats=true; 

Try to set it to false and execute again.

Alternatively, you can calculate statistics from a table. See SYNTAX ANALYZE TABLE

It is also possible to automatically collect statistics during INSERT OVERWRITE:

 set hive.stats.autogather=true; 
+5
source

Source: https://habr.com/ru/post/1257840/


All Articles