Is the LIMIT clause in HIVE really random?

HIVE documentation notes that the LIMIT clause returns rows chosen at random . I came across a SELECT table in a table with over 800,000 records with LIMIT 1 , but always returns the same record to me.

I am using the Shark distribution and I wonder if this has anything to do with this expected behavior? Any thoughts would be appreciated.

Thanks Visakh

+6
source share
3 answers

Although the documentation states that it returns strings at random, this is actually not the case.

It returns "selected rows in random order" as it appears in the database without any where / order by clause. This means that it is not random (or randomly selected), as you think, simply because the order in which the rows are returned cannot be determined.

As soon as you click order by x DESC limit 5 there, it will return the last 5 lines of what you select.

To get strings obtained randomly, you will need to use something like: order by rand() LIMIT 1

However, this can affect the speed if your indexes are not configured properly. I usually do min / max to get the identifier in the table, and then make a random number between them, and then select these records (in your case there will be only one record), which is usually faster than having a database work, especially on a large dataset

+5
source

To be safe, you want to use

select * from table

distribute by rand ()

sort by rand ()

limit of 10,000;

+8
source

The documentation may have been updated since this question was originally published in 2014, but as of December 2017, the documentation now reads: "The next query returns 5 arbitrary clients."

In this case, “arbitrary” means that the selection method is either not deterministic or it may not be difficult to document. In other words, you should not rely on this as a reliable method for obtaining a certain subset of records (for example, for sampling). You should use the Limit clause without the Order By clause if you are looking for expediency and want to get a small result set as quickly as possible (for example, for QA purposes). Otherwise, use one of Order By, Cluster By, or Distribute By / Sort By, if necessary.

+2
source

Source: https://habr.com/ru/post/969683/


All Articles