INTERSECT/EXCEPT versus IN/NOT IN

by Allan Svelmøe Hansen May 21, 2008 19:59

SQL Server has two commands useful for finding similar rows and different rows between two tables.

These commands are EXCEPT and INTERSECT. They work by comparing the entire row of one query/table with another query/table.
EXCEPT then returns the rows which are in the first, but not the second table and INTERSECT returns the rows from the first which are also in the second table. But how do they perform?

In real life I find it is rare that I have to compare two tables or queries (or more) which possible have completely indentical data, but it does happen once in a while. However most of the time, these rows also have a unique primary key or otherwise index, so most of the time I use IN and NOT IN myself.
So I'll try EXCEPT/INTERSECT against IN/NOT IN and see how they fare.

I've made two identical tables, Table1 and Table1 both with an ID (int) and Value (nvarchar(10)), and filled them with some test data, ensuring a number of identical rows and a number of different rows using the following query:

DECLARE @I INT
SET 
@I 0

WHILE @I 10000 BEGIN
 IF 
@I <> 0
  
INSERT INTO Table1 VALUES (@IREPLICATE('A'10))
 
IF @I <> INSERT
  INTO 
Table2 VALUES (@IREPLICATE('A'10))
 
SET @I @I 1
END
 

I've not used any indexes et al, because it isn't needed. I am only after the comparable values.

Anyway, using what I know of common problems from work, I now want to find rows in Table1 which aren't in Table2.
Because I know I have "ID" which I assume right now is a unique key (I have no duplicates, because I know what data I've inserted), I'll only use an NOT IN on the ID column, like this:

SELECT *
FROM Table1
WHERE ID NOT IN (
 
SELECT ID FROM Table2
)
GO

SELECT FROM Table1
EXCEPT
SELECT 
FROM Table2
GO 
 

This performs as one might expect vastly superior for the NOT IN syntax with a ratio of 23% vs. 77%.
However, suppose that I also needed to make sure that Value was the same and that I thus couldn't rely only on the ID, making the following query:

SELECT *
FROM Table1
WHERE ID NOT IN (
 
SELECT ID FROM Table2
)
OR 
[Value] NOT IN (
 
SELECT [Value] FROM Table2
)
GO

SELECT FROM Table1
EXCEPT
SELECT 
FROM Table2
GO 
 

And now the result changes with the NOT IN syntax taking 78% and the EXCEPT only 22%.
So if you need more then one value to be different, then it starts to look like you can use the EXCEPT with performance gain versus the NOT IN syntax.
If you only need to compare on one, it is a waste of ressources to use EXCEPT – lest it is a one time query of course. But it if is something which needs to run "often", I'd think twice.

Conclusions on the results of INTERSECT are exactly (well, the numbers differ of course) the same.
One column to compare, and it is better to use IN, like this:

SELECT *
FROM Table1
WHERE ID IN (
 
SELECT ID FROM Table2
)
GO
 

Compared to

SELECT FROM Table1
INTERSECT
SELECT 
FROM Table2
GO
 

If needing more values, then INTERSECT is better.

Bookmark and Share DotnetKicks dotnetshoutout

IN versus INNER JOIN

by Allan Svelmøe Hansen May 11, 2008 19:22

I decided to do some testing on inner joins versus the in syntax, meaning what to use if I need to select the content from one table out, based on content in another table.
So what I did was to create a couple of tables:
CREATE TABLE TestOne
(
 
ID1 INT NOT NULL,
 
Value1 NVARCHAR(255) NOT NULL
)
GO

CREATE TABLE TestTwo
(
 
ID2 INT NOT NULL,
 
Value2 NVARCHAR(255) NOT NULL,
 
ID1 INT NOT NULL
)
GO 
 

I then populated them with some test data (1000 rows for the first and 10000 rows for the second of pointless data). Note that I have added no indexes or restraints of any kind as I wanted the testing to be as base as possible. So then I basically pulled data out from TestOne filtered by ID1 having to be in TestTwo.
Like this:

SELECT *
FROM TestOne T1
WHERE
T1.ID1 IN (SELECT ID1 FROM TestTwo)
GO

SELECT T1.*
FROM TestOne T1
INNER JOIN TestTwo T2 ON T1.ID1 T2.ID1
GO
 

Then I looked at the estimated query plans and was a bit stupefied by the result.

(click to enlarge)
Estimated Execution Plan For IN versus INNER JOIN for first and second query
There was a massive difference between the two, and this shows that it is actually better to just use the IN operator instead of INNER JOIN when filtering. Of course if you actually need the content of the filtering table then you need to join, but just filtering – then IN looks clearly superior.
This result was made even more visible when I added a third table (similar to the second one) and joined that into the equation.

Then I wanted to make some more complex queries, because it is rare that I write such simple queries at work. But of course now it gets more “thought out” and fictive now, and I should have done this with real life work data to get a better indication. But this’ll have to do for now :)
Again note that no indexes, keys, restraints or anything are made.

So I populated the database two more tables (Four and Five) and these additional tables and data very similar to how I populated the first three.

Then I made the following queries

SELECT T1.*
FROM TestOne T1
WHERE T1.ID1 IN (
  
SELECT ID1 FROM TestTwo T2
  
WHERE ID2 IN (
   
SELECT ID2 FROM TestFour T4
   
WHERE ID4 IN (
    
SELECT ID4 FROM TestFive T5
   
)
  )
 )
GO

SELECT T1.*
FROM TestOne T1
INNER JOIN TestTwo T2 ON T1.ID1 T2.ID1
INNER JOIN TestFour T4 ON T2.ID2 T4.ID2
INNER JOIN TestFive T5 ON T4.ID4 T5.ID4
GO   
 

What I select out is TestOne filtered by TestTwo which is filtered by TestFour which again is filtered by TestFive.
This gave a similar result as before, but with a smaller margin.
Now it was 41% versus 59% in favor of the IN syntax.
(click to enlarge)
Estimated Execution Plan For IN versus INNER JOIN for third query
(click to enlarge)
Estimated Execution Plan For IN versus INNER JOIN for fourth query
This means that we are now entering a more case-by-case area where one can’t say for sure which method is best. Also personally – I would say that the INNER JOIN syntax makes for a more readable query, but that of course is personal preference.
But also imagine that I suddenly had to get the information from TestFour as well, for example – that could make quite a nasty query if I tried to use the IN syntax, whereas I could simply change the SELECT in the latter query to SELECT T1.*, T4.* and then done.

Then I tried to add some keys and indexes, but that did not change the result.

So in conclusion, if making simple filtering, then the IN syntax looks vastly superior against the INNER JOIN syntax, but more complex queries and if you need the data from the filtering tables, makes the conclusion more gray and would need a case-by-case conclusion.

Another time, I’ll try to dabble in using LEFT and RIGHT JOINs as filtering against the NOT IN syntax.

 

Bookmark and Share DotnetKicks dotnetshoutout

Powered by BlogEngine.NET 1.6.1.0
Theme by Mads Kristensen | Modified by Mooglegiant

About:
Allan Svelmøe Hansen

My real name is Allan Svelmøe Hansen.
I live in Denmark, where I work as a developer for hedal:kruse:brohus using SQL Server and the .NET framework since 2004.
My primary fields of expertise is back-end data integration, database design and optimization.


       View Allan Svelmøe Hansen's profile on LinkedIn     

Disclaimer

The opinions expressed herein are my own personal opinions and thoughts and does not represent my employers view in any way, nor are my results guaranteed for all situations.
Content is presented "as is", with no warranty.