Explain Codes LogoExplain Codes Logo

Select n random rows from SQL Server table

sql
performance
randomness
sampling
Alex KataevbyAlex Kataev·Dec 22, 2024
TLDR

For a rapid-fire n random records from a table, use the NEWID() function in conjunction with ORDER BY like so:

SELECT TOP (n) * FROM YourTable ORDER BY NEWID();

Inject n with your desired row count and YourTable with your actual table name to get a new, reordered result set each time the query executes. Simple and effective for small to mid-sized tables, it's an easy win when performance isn't a primary focus.

Performance nuances for larger datasets

When your database scales up, ORDER BY NEWID() might stumble. SQL Server will have to allocate a new GUID for each row and then reorder them. This operation can be taxing both in terms of time and resources. In situations like these, you would be better off considering more performance-oriented alternatives.

Using the TABLESAMPLE clause with large volumes of data

For massive datasets that require random sampling, the TABLESAMPLE clause is an appealing option:

SELECT * FROM YourTable TABLESAMPLE (10 PERCENT) REPEATABLE (12345);

The TABLESAMPLE clause pulls a scientifically valid random sample by selecting pages instead of rows, which contributes to a significant boost in speed for larger tables. However, take note that TABLESAMPLE is a bit rough around the edges, it may not always return precisely 10% of rows, especially when dealing with smaller datasets.

Results fine-tuning with advanced randomization techniques

Even distribution is key, especially when filtering by a certain percentage. You can leverage the ABS, CAST, and BINARY_CHECKSUM functions for this:

SELECT * FROM YourTable WHERE ABS(CAST((BINARY_CHECKSUM(*) * RAND()) as int)) % 100 < 10;

This formula uses checksums to each row a random disposition based on their content, aided by RAND() to ensure the distribution retains a uniform spread across the required percentage.

Bias & randomness concerns in TABLESAMPLE

TABLESAMPLE might introduce an element of bias in row selection - its default behavior is to retrieve random pages of data, not rows. This means a completely uniform distribution of rows is not always guaranteed.

Getting your hands dirty with RAND()

Including a seed value in RAND() will result in the same sequence of numbers generated for a given seed in a single session. To generate random float values, employ RAND((CAST(NEWID() AS binary(4)))) which blends the randomness of NEWID() with the numerical output of RAND().

Battle testing your query performance

Use SQL Server's in-built tools to measure the efficiency of your query and how it stacks up against others:

SET STATISTICS TIME ON; SET STATISTICS IO ON; -- Your SQL Query here. Feel free to insert witty comments SET STATISTICS TIME OFF; SET STATISTICS IO OFF;

This sequence will spit out runtime and I/O statistics, and help you make an informed decision on the most performant method. Did someone say data-driven?

Targeting an exact number of rows

If your requirement revolves around retrieving a certain number of rows precisely, merging TABLESAMPLE with common table expressions (CTEs) or subquery approaches will be ideal. This is especially necessary when you need a proportional representation from groups within the table.

Hybrid approach with a temporary filter table

Construct a temporary table with randomly assigned NEWID() values, limit this to your required count, and join it back to the original table.

SELECT YT.* FROM YourTable YT JOIN (SELECT TOP (n) [yourPk], NEWID() as rnd FROM YourTable ORDER BY rnd) AS TempTable ON YT.[yourPk] = TempTable.[yourPk];

This methodology ensures we select a static number of rows and takes advantage of the intentional randomness NEWID() brings to the table, without experiencing the drag on performance that sorting the complete table brings.

Beware of potential pitfalls

Care for larger datasets

When running queries on large datasets, ORDER BY NEWID() is not your friend. Using it with huge datasets will tank your performance and response time. Always benchmark your query's performance on a dataset that closely mirrors your production data to keep frustratingly long runtimes at bay.

Ensuring randomness

When you're after a truly random sample, sidestep any unintentional patterns. For instance, when working with BINARY_CHECKSUM, include enough variance in the columns used for the expression to minimize repeatable patterns.

Scientific-grade randomness

If you need scientific-grade randomness and you're considering BINARY_CHECKSUM and RAND for a significant application, it's prudent to consult with a statistician to verify your methodology. Someone might just thank you for it!