Microsoft SQL Server provides a lot of statements to find what data is existed in A set but not in B set such as NOT IN, LEFT JOIN, EXCEPT and NOT EXISTS. The question is when we use these statements and which is the best performance in case we execute it on a large data set. In this article, I would like to exam how these statements perform and which is the best choice when I want to filter in data from two data set.

Sample data

For demonstration purpose, I try to use AdventureWorks2014 for querying list of customers without sales. Click here to download in case you do not have the database and download the script to create a large table. Now, we execute every statement 10 times and make sure that we clear CACHE after each execution. The table SalesOrderHeaderEnlarged has a non-clustered index on CustomerID column and Sales.Customer has the primary key on CustomerID

NOT EXISTS Testing

DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
GO
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO
SELECT CustomerID
FROM Sales.Customer c
WHERE NOT EXISTS (SELECT 1
FROM SalesOrderHeaderEnlarged s WHERE s.CustomerKey = c.CustomerKey)

GO
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
GO

Statistic and execution plan

(701 row(s) affected)
Table ‘SalesOrderHeaderEnlarged’. Scan count 19820, logical reads 72014, physical reads 160, read-ahead reads 2040, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Customer’. Scan count 1, logical reads 37, physical reads 1, read-ahead reads 35, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

 

 

NOT IN Testing

DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
GO
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO
SELECT CustomerID
FROM Sales.Customer c
WHERE CustomerID NOT IN (SELECT CustomerID
FROM SalesOrderHeaderEnlarged)

GO
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
GO

Statistic and execution plan

Table ‘SalesOrderHeaderEnlarged’. Scan count 19820, logical reads 71482, physical reads 171, read-ahead reads 1968, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Customer’. Scan count 1, logical reads 37, physical reads 1, read-ahead reads 35, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

LEFT JOIN Testing

DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
GO
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO
SELECT c.CustomerID
FROM Sales.Customer c
LEFT OUTER JOIN SalesOrderHeaderEnlarged s ON c.CustomerID = s.CustomerID
WHERE s.CustomerID IS NULL

GO
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
GO

Statistic and execution plan

Table ‘SalesOrderHeaderEnlarged’. Scan count 1, logical reads 3322, physical reads 3, read-ahead reads 3408, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Customer’. Scan count 1, logical reads 123, physical reads 1, read-ahead reads 121, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

EXCEPT Testing

DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;
GO
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO
SELECT c.CustomerID
FROM Sales.Customer c
EXCEPT
SELECT CustomerID
FROM SalesOrderHeaderEnlarged

GO
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
GO

Statistic and execution plan

Table ‘SalesOrderHeaderEnlarged’. Scan count 19820, logical reads 71481, physical reads 171, read-ahead reads 1968, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Customer’. Scan count 1, logical reads 37, physical reads 1, read-ahead reads 35, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Be aware of cold cache in above testings, we also clear cache after every execution. What measures happen if warm cache? Beside, the large table has a non-clustered index on CustomerID so we try to measure on both cases of warm cache and no non-clustered index.

Comparison

I create a report for comparison of Average Execution Duration and I do not care how many logical reads and physical reads

Basically, the performance is not a significant different among NOT EXISTS, NOT IN , LEFT JOIN and EXCEPT but it seems NOT EXISTS is a good choice for most of cases in my view. In my experience, I have some common rules when I use these statements.

  1. NOT EXISTS is used when the right data set is large
  2. NOT IN is used when the right data set is quite small (<1000 records) or it is a list of hard-code values
  3. LEFT JOIN is used to when I need an additional columns from the right data set

Obviously, I recommend you to execute these statement on your own system to determine which is best one for you.

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s