I have a table with approximately 10 million rows and 4 columns, without a primary key. The data in column 2 3 4 (x2 x3 and x4) are grouped into the 50 groups indicated in column X1.
To get a random sample of 5% from the table, I always used
SELECT TOP 5 PERCENT * FROM thistable ORDER BY NEWID()
The result returns about 500,000 rows. But some groups get an unequal representation in the sample (relative to their original size) if they are chosen this way.
This time, to get the best sample, I wanted to get a 5% sample from each of the 50 groups listed in column X1. So, in the end, I can get a random sample of 5% of the rows in each of the 50 groups in X1 (instead of 5% of the whole table).
How can I approach this problem? Thanks.
source share