SQL - 5% random sampling by group

I have a table with approximately 10 million rows and 4 columns, without a primary key. The data in column 2 3 4 (x2 x3 and x4) are grouped into the 50 groups indicated in column X1.

To get a random sample of 5% from the table, I always used

SELECT TOP 5 PERCENT * FROM thistable ORDER BY NEWID() 

The result returns about 500,000 rows. But some groups get an unequal representation in the sample (relative to their original size) if they are chosen this way.

This time, to get the best sample, I wanted to get a 5% sample from each of the 50 groups listed in column X1. So, in the end, I can get a random sample of 5% of the rows in each of the 50 groups in X1 (instead of 5% of the whole table).

How can I approach this problem? Thanks.

+6
source share
1 answer

You must be able to count each group, and then force the data to be displayed in random order. Fortunately, we can do this with a query like CTE. Although CTE is not strictly necessary, it will help break the solution down into several bits, rather than into multiple sub-selections, etc.

I assume that you already have a column that groups the data, and that the value in this column is the same for all elements in the group. If so, something like this might work (the names of the columns and tables should be changed according to your situation):

 WITH randomID AS ( -- First assign a random ID to all rows. This will give us a random order. SELECT *, NEWID() as random FROM sourceTable ), countGroups AS ( -- Now we add row numbers for each group. So each group will start at 1. We order -- by the random column we generated in the previous expression, so you should get -- different results in each execution SELECT *, ROW_NUMBER() OVER (PARTITION BY groupcolumn ORDER BY random) AS rowcnt FROM randomID ) -- Now we get the data SELECT * FROM countGroups c1 WHERE rowcnt <= ( SELECT MAX(rowcnt) / 20 FROM countGroups c2 WHERE c1.groupcolumn = c2.groupcolumn ) 

Two CTE expressions allow you to randomly sort and then count each group. The final choice should be quite simple: for each group, find out how many lines it has and return only 5% of them (total_row_count_in_group / 20).

+7
source

Source: https://habr.com/ru/post/958334/


All Articles