Choose random sampling from sqlserver fast

I have a huge table of> 10 million rows. I need to effectively capture a random sample of 5000. I have some constriants that reduce the total series that I am looking to like 9 million.

I tried using order by NEWID (), but this query will take too long, because it must scan the table of all rows.

Is there a faster way to do this?

+14
performance sql database sql-server random
Mar 16 '09 at 20:33
source share
4 answers

If you can use pseudo-random sampling and you are on SQL Server 2005/2008, then take a look at TABLESAMPLE. For example, an example from SQL Server 2008 / AdventureWorks 2008 that runs on a row basis:

USE AdventureWorks2008; GO SELECT FirstName, LastName FROM Person.Person TABLESAMPLE (100 ROWS) WHERE EmailPromotion = 2; 

The trick is that TABLESAMPLE is not exactly random, as it generates a given number of rows from each physical page. You cannot get exactly 5,000 rows unless TOP is also limited. If you are running SQL Server 2000, you will either have to create a temporary table that matches the primary key, or you will have to do this using the NEWID () method.

+19
Mar 16 '09 at 20:46
source share

Have you studied using the TABLESAMPLE clause?

For example:

 select * from HumanResources.Department tablesample (5 percent) 
+8
Mar 16 '09 at 20:42
source share

Microsoft Server-specific SQL Server 2000 solution (instead of slow NEWID () in large tables):

 SELECT * FROM Table1 WHERE (ABS(CAST( (BINARY_CHECKSUM(*) * RAND()) as int)) % 100) < 10 

The Microsoft SQL Server team realized that failing to accept random row samples was easily a common problem in SQL Server 2000; therefore, the team addressed the issue in SQL Server 2005 by introducing the TABLESAMPLE clause. This sentence selects a subset of rows to select random data pages and return all rows to these pages. However, for those of us who have products that run on SQL Server 2000 and need backward compatibility or who really need a random level at the line level, the BINARY_CHECKSUM query is a very effective workaround.

An explanation can be found here: http://msdn.microsoft.com/en-us/library/cc441928.aspx

+6
Nov 23 '12 at 6:05
source share

Yes, maybe your friend (note that this is not random in the statistical sense of the word): Tables in msdn

+4
Mar 16 '09 at 20:40
source share



All Articles