INTERSECT/EXCEPT versus IN/NOT IN

by Allan Svelmøe Hansen May 21, 2008 19:59

SQL Server has two commands useful for finding similar rows and different rows between two tables.

These commands are EXCEPT and INTERSECT. They work by comparing the entire row of one query/table with another query/table.
EXCEPT then returns the rows which are in the first, but not the second table and INTERSECT returns the rows from the first which are also in the second table. But how do they perform?

In real life I find it is rare that I have to compare two tables or queries (or more) which possible have completely indentical data, but it does happen once in a while. However most of the time, these rows also have a unique primary key or otherwise index, so most of the time I use IN and NOT IN myself.
So I'll try EXCEPT/INTERSECT against IN/NOT IN and see how they fare.

I've made two identical tables, Table1 and Table1 both with an ID (int) and Value (nvarchar(10)), and filled them with some test data, ensuring a number of identical rows and a number of different rows using the following query:

DECLARE @I INT
SET 
@I 0

WHILE @I 10000 BEGIN
 IF 
@I <> 0
  
INSERT INTO Table1 VALUES (@IREPLICATE('A'10))
 
IF @I <> INSERT
  INTO 
Table2 VALUES (@IREPLICATE('A'10))
 
SET @I @I 1
END
 

I've not used any indexes et al, because it isn't needed. I am only after the comparable values.

Anyway, using what I know of common problems from work, I now want to find rows in Table1 which aren't in Table2.
Because I know I have "ID" which I assume right now is a unique key (I have no duplicates, because I know what data I've inserted), I'll only use an NOT IN on the ID column, like this:

SELECT *
FROM Table1
WHERE ID NOT IN (
 
SELECT ID FROM Table2
)
GO

SELECT FROM Table1
EXCEPT
SELECT 
FROM Table2
GO 
 

This performs as one might expect vastly superior for the NOT IN syntax with a ratio of 23% vs. 77%.
However, suppose that I also needed to make sure that Value was the same and that I thus couldn't rely only on the ID, making the following query:

SELECT *
FROM Table1
WHERE ID NOT IN (
 
SELECT ID FROM Table2
)
OR 
[Value] NOT IN (
 
SELECT [Value] FROM Table2
)
GO

SELECT FROM Table1
EXCEPT
SELECT 
FROM Table2
GO 
 

And now the result changes with the NOT IN syntax taking 78% and the EXCEPT only 22%.
So if you need more then one value to be different, then it starts to look like you can use the EXCEPT with performance gain versus the NOT IN syntax.
If you only need to compare on one, it is a waste of ressources to use EXCEPT – lest it is a one time query of course. But it if is something which needs to run "often", I'd think twice.

Conclusions on the results of INTERSECT are exactly (well, the numbers differ of course) the same.
One column to compare, and it is better to use IN, like this:

SELECT *
FROM Table1
WHERE ID IN (
 
SELECT ID FROM Table2
)
GO
 

Compared to

SELECT FROM Table1
INTERSECT
SELECT 
FROM Table2
GO
 

If needing more values, then INTERSECT is better.

Bookmark and Share DotnetKicks dotnetshoutout

NOT IN versus LEFT JOIN

by Allan Svelmøe Hansen May 18, 2008 08:36
As you may or may not know, it is possible to use LEFT JOIN (and thus RIGHT JOIN) to filter out tables, just as it is with the NOT IN.
However, how do they perform against each other?
Well, using similar data as created in the "In versus INNER JOIN" example, I made two simple queries to see:

SELECT *
FROM TestOne
WHERE ID1 NOT IN
(
 
SELECT ID1 FROM TestTwo
)
GO

SELECT T1.*
FROM TestOne T1
LEFT JOIN TestTwo T2 ON T1.ID1 T2.ID1
WHERE T2.ID1 IS NULL
GO
 

This gave the following execution plan:
(click to enlarge)
Estimated Execution Plan For NOT IN versus LEFT JOIN for first and second query
As can be seen the margin of difference - 46% versus 54% - is much smaller then with the IN syntax as linked above.
So I tried more complex queries where I just took the ones I made for the In versus INNER JOIN article, and modified them to be NOT IN and LEFT JOINS. Like this:


SELECT T1.*
FROM TestOne T1
WHERE T1.ID1 NOT IN (
  
SELECT ID1 FROM TestTwo T2
  
WHERE ID2 NOT IN (
   
SELECT ID2 FROM TestFour T4
   
WHERE ID4 NOT IN (
    
SELECT ID4 FROM TestFive T5
   
)
  )
 )
GO

SELECT T1.*
FROM TestOne T1
LEFT JOIN TestTwo T2 ON T1.ID1 T2.ID1
LEFT JOIN TestFour T4 ON T2.ID2 T4.ID2
LEFT JOIN TestFive T5 ON T4.ID4 T5.ID4
WHERE T2.ID1 IS NULL
AND 
T4.ID2 IS NULL
AND 
T5.ID4 IS NULL
GO 


This gave a queryplan where the LEFT JOIN actually performened better, not by much but still 52 versus 48%. (I've minimized the query plans because I was only interested in the numbers, and not in the execution plan itself.)
(click to enlarge)
Estimated Execution Plan For NOT IN versus LEFT JOIN for third and fourth query
So where the case was more in favore of using IN instead of INNER JOINs when filtering inclusive, then exclusive looks to be more evenly matched when it comes to NOT IN versus the LEFT JOIN syntax.
But as always, use a case by case decision, by looking at your own querys details. This is just an example and guideline numbers.
Bookmark and Share DotnetKicks dotnetshoutout

Powered by BlogEngine.NET 1.6.1.0
Theme by Mads Kristensen | Modified by Mooglegiant

About:
Allan Svelmøe Hansen

My real name is Allan Svelmøe Hansen.
I live in Denmark, where I work as a developer for hedal:kruse:brohus using SQL Server and the .NET framework since 2004.
My primary fields of expertise is back-end data integration, database design and optimization.


       View Allan Svelmøe Hansen's profile on LinkedIn     

Disclaimer

The opinions expressed herein are my own personal opinions and thoughts and does not represent my employers view in any way, nor are my results guaranteed for all situations.
Content is presented "as is", with no warranty.