Choose n number of random rows, where n is proportional to each value% of the total population

I have a table of 58 million customer records. Each customer has a market value (EN, US, FR, etc.).

I am trying to choose a set of samples of 100 thousand, which contains customers from all markets. The ratio of customers to the market in the sample should correspond to the relationships in the actual table.

Thus, if customers from the UK will make up 15% of the entries in the customer table, then in the set of 100 thousand copies there should be customers from 15 thousand UK, as well as for each market.

Is there any way to do this?

+6
source share
2 answers

First, a simple random sample should very well reflect the size of the market. What you are asking for is a stratified pattern.

One way to get such a pattern is to randomize the data and assign a serial number in each group. Then normalize the serial number between 0 and 1 and, finally, arrange by the normalized value and select the upper "n" lines:

select top 100000 c.* from (select c.*, row_number() over (partition by market order by rand(checksum(newid())) ) as seqnum, count(*) over (partition by market) as cnt from customers c ) c order by cast(seqnum as float) / cnt 

It may be clear what happens if you look at the data. Consider a sample of 5 from:

 1 A 2 B 3 C 4 D 5 D 6 D 7 B 8 A 9 D 10 C 

The first step assigns a random number to each market:

 1 A 1 2 B 1 3 C 1 4 D 1 5 D 2 6 D 3 7 B 2 8 A 2 9 D 4 10 C 2 

Then normalize these values:

 1 A 1 0.50 2 B 1 0.50 3 C 1 0.50 4 D 1 0.25 5 D 2 0.50 6 D 3 0.75 7 B 2 1.00 8 A 2 1.00 9 D 4 1.00 10 C 2 1.00 

Now, if you take the top five, you get the first five values, which are a stratified pattern.

+5
source

Using a sample that a large random sample will give you a sample with a good statistical approximation of the original population, as Gordon Linoff pointed out.

To force an equal percentage between the population and the sample, you can calculate and use all the necessary parameters: population dimension and section dimension with the addition of a random identifier.

 Declare @sampleSize INT Set @sampleSize = 100000 With D AS ( SELECT customerID , Country , Count(customerID) OVER (PARTITION BY Null) TotalData , Count(customerID) OVER (PARTITION BY Country) CountryData , Row_Number() OVER (PARTITION BY Country ORDER BY rand(checksum(newid()))) ID FROM customer ) SELECT customerID , Country FROM D WHERE ID <= Round((Cast(CountryData as Float) / TotalData) * @sampleSize, 0) ORDER BY Country 

SQLFiddle demo with less data.

Keep in mind that approximating a function in WHERE can cause the returned data to be slightly smaller or slightly larger than desired, for example, in the demo version, the returned rows are 9 instead of 10.

+1
source

Source: https://habr.com/ru/post/970037/


All Articles