The MERGE syntax - syntax and performance

by Allan Svelmøe Hansen May 21, 2010 15:40

Normally when you have rows from one table you want to move over into the other, you'll have to run both an update query to get your existing rows updated with the new values and insert query to get new rows over to your target. That means you have to write two queries. With the merge, you only need one as it performs both the update and the insert.
So let's take a look at it.

First, I'll create 2 tables with dummy data; a tblSource and a tblTarget.
The scripts for creating the tables and data can be found here merge_setup_20100521.sql (3.58 kb), but basically it's just a target and source table with an ID, ValA, ValB and ValC with a clustered index on ID and some dummy data.
So to do the update/insert it would look like this:
UPDATE MyTarget SET 
   
MyTarget.ValA MySource.ValA

   
MyTarget.ValB MySource.ValB

   
MyTarget.ValC 
MySource.ValC 
FROM dbo.tblTargetInUp AS 
MyTarget  
INNER JOIN dbo.tblSourceInUp AS MySource ON MyTarget.ID 
MySource.ID 

INSERT INTO 
dbo.tblTargetInUp 
SELECT IDValAValB
ValC 
FROM dbo.tblSourceInUp AS 
MySource  
WHERE NOT EXISTS (SELECT FROM dbo.tblTargetInUp AS MyTarget WHERE MyTarget.ID MySource.ID
)  
 

If we look at the execution plan, I get an estimated cost of 9.1754 for the update and 2.48725 for the insert, meaning a combined cost of 11.66265 for 33.333 rows updated and 33.3334 rows inserted, into a target of 66.666 rows.

Now do remember, the estimated cost is just a number for how the query runs in my environment, it can't be taken as a direct number and transferred to another system - I'm only interested in the relative comparison with the merge. More data, more indexes, more variations will all affect the actual numbers. Also if looking at the execution plan, it is clear it is two queries we fire, meaning that all the overhead which goes into running one query will be doubled for this. But it was how you'd have to do merges in the past.

Now, with the merge syntax we can do it like this:
MERGE dbo.TblTargetMerge AS MyTarget USING 
(   
   
SELECT FROM dbo.TblSourceMerge AS 
TS 
AS 
MySource 
ON MyTarget.ID 
MySource.ID 
WHEN MATCHED 
THEN  
   UPDATE SET 
       
MyTarget.ValA MySource.ValA

       
MyTarget.ValB MySource.ValB

       
MyTarget.ValC 
MySource.ValC 
WHEN NOT MATCHED 
THEN 
   INSERT  
   
(IDValAValBValC

   
VALUES 
   
(MySource.IDMySource.ValAMySource.ValBMySource.ValC

  

Note that the syntax takes both the update and insert in the WHEN MATCHED and WHEN NOT MATCHED.
For the complete overview over the syntax, I'll refer you to the documentation by microsoft: MERGE (Transact-SQL).
But basically – you MERGE into a table using a source, and then define the ON clause (as you would a join), and then specify the WHEN MATCHED and the WHEN NOT MATCHED clauses.

One thing I'll expand on myself though, is the OUTPUT clause which can also be coupled on to the merge. I mentioned the OUTPUT clause myself recently.
The important thing is that you can couple the $action to the output clause and get information about whether you merged the data or you inserted the data, meaning whether the row was matched or not matched.
Like this:

MERGE dbo.TblTargetMerge AS MyTarget USING 
<...snipped FOR being brief...

OUTPUT $ACTION
rest_of_select_here 
;  
  

Nifty.
Anyways - once we've build this query, we can look at the execution plan, and here we clearly see it is handled as one query.
And in my case, the estimated cost is 8.407 for the same number of rows as above, meaning we’ve saved about 27.9% just by changing syntax.

Now, this structure I used to compare is very simple - with a simple matching of ID to ID and then just insert/overwrite everything so the actual result may vary (naturally), however the merge syntax does appear to be faster and with the added bonus of keeping the query combined into one syntax rather than divided into two different queries.
 
I must admit, I do like the merge syntax myself once I learned to read and write them.

Bookmark and Share DotnetKicks dotnetshoutout

Using INNER JOIN/SELF JOIN to allow for smaller indexes.

by Allan Svelmøe Hansen May 15, 2009 16:26

In databases, I often need foreign keys in my tables, because I’ll want to use them to select content out from my tables. However this can often result in either bad index utilization in the selection or making additional indexes based on the foreign key(s) and the content I need to select out.
This in turn can result in ending up with “many” indexes and sometimes many big indexes.

However a method can be to make smaller indexes and use an INNER JOIN to join into the table an extra time.

I’m going to show the pattern with a relative simple example to illustrate, because it is about the pattern more so then the specifics layout, content and size of the table.
It is just a pattern/technique to keep in mind and have in the SQL toolbox.
Suppose you have two tables of a similar pattern to this:

Tables for self join example
(click for larger size)

I’ve let the SQL Server create my clustered indexes based on the primary key, which means I have a clustered index over my primary key(s).
When having to look up rows in JoinOne based on the foreign key, it’ll often look like this:

SELECT 
FROM 
JoinOne T1
WHERE FKOne <VALUE>
  

Because I’ve added no other indexes to the tables, I’ll get an Index Scan or Table Scan when running the above query, which - as we - know is not a good thing most of the time.
This usually leads to the creating of a second index which indexes over my foreign key.
However if the table is large, and if I’ll need to extract many/most of the columns (like in my example with SELECT *), it can mean I’ll have to make either a complete index or one with many included columns, just ordered by my foreign key first.

If I make a small index with my foreign key and my primary key second, I can use this to join into my table again, and then after that use my primary key.
This is an index to illustrate:

Index for self join example 
(click for larger size)

Then I can make the following query:

SELECT T2.
FROM 
JoinOne T1
INNER JOIN JoinOne T2 
ON
 
T1.PKOne 
T2.PKOne
 
AND T1.PKTwo 
T2.PKTwo
 
AND T1.PKThree 
T2.PKThree
WHERE T1.FKOne <VALUE>
  

When looking at its execution plan, it’ll show two index seeks instead of my previous index scan. This way I have a small index as possible, but maintain seeks in my execution.
Of course there are some considerations one need to take with this pattern. Firstly – if the tables are “small”, then the scan in itself might be alright, or the overhead of keeping a complete index for the foreign key is neglible.

However the advantage of the pattern is that I can have smaller indexes on (very) large tables, which means less overhead when inserting/updating – but still have only index seeks in my selections execution plans.
It is a useful technique in my opinion – when used at the right times, which of course is on a case by case evaluation as always.  

Bookmark and Share DotnetKicks dotnetshoutout

Using RANK() and CTE

by Allan Svelmøe Hansen June 05, 2008 19:53
Despite having used SQL Server for quite a while I’ve not really had much practical usage for many of the new features. Lately however I’ve started to use Common Table Expressions (CTE) much more and I’m absolutely starting to love them. Another feature I am starting to love are the ranking functions, RANK().

Basically speaking CTEs are subqueries which you can reuse, so practically they remind me very much of temporary tables and similar techniques. But they not only provide vastly better clarity (in my view), but they can also perform better. So I’m going to be turning more attention towards CTEs and their usage and hopefully some performance checks in this and hopefully future blog posts.
RANK() is a function which enables you to – well rank – groups (aka partitions) of data, providing you with a powerful tool for solving problems. It is a good syntax to have in your problem solving toolbox.

The problem I was faced with at work was that we had some price data which we needed to clean up.
We had a number of products which had a number of prices. And out of these combinations we only wanted to keep the latest ones. So utilizing some of the new things in SQL Server 2005 we were able to solve it. For this problem we used RANK and CTE’s. Granted we could have solved it without using CTE’s but it was here I started seeing the benefits of using these common table expressions, and the solution turned out to be elegant thanks to these techniques.

We had a table which simplified consisted of a PriceID (identity), a price value and a product id foreign key.
Design of Pricetable


Because we only wanted to keep the “newest” Price we knew that we only wanted the highest PriceID from each “group” of Products.
Then I started to look at the GROUP BY clause on ProductID, but it left me with a problem of getting the highest PriceID for each group, so I could delete the rows from Price which weren’t selected.
I then remembered the RANK function, which I’d noticed a few times, and the pieces of the puzzle simply clicked together.

What I did was selecting the RANK on the partition of ProductID (basically, you group the RANKs on the partition). And because I wanted to keep the highest PriceID I ordered by that field descending. Then I also selected the ProductID and the PriceID in my query. Like this:

SELECT
 
ProductIDPriceID,
 
RANK() OVER (PARTITION BY ProductID ORDER BY PriceID DESCAS R
FROM Price 
    

Now this gave me a result akin to this (PriceID, ProductID, R)
Sample result from query

You’ll see that I have now got each ProductID, PriceID with a grouped rank in my resultset. And then it is merely a matter of selecting the rows which have a rank higher then 1.
For this I used the CTE syntax like this:

WITH priceCTE AS (
 
SELECT
  
ProductIDPriceID,
  
RANK() OVER (PARTITION BY ProductID ORDER BY PriceID DESCAS R
 
FROM Price    
)
SELECT FROM priceCTE WHERE 1
GO
 

As can be seen the CTE is declare using the WITH <NAME> AS syntax. Then the contens of the parenthises provides the subquery which you then can use after the paranthises.
Then it was just chancing the SELECT to a DELETE and I had achieved my goal. Simple and clean, and most of all – easy to read again in a month or two, incase I’m faced with a similar problem.

Then as always I was interested in how this performed against a temporary table variable so I simply made a dublicate of this query but instead of a CTE, I placed the subquery into a table variable, and ran the same.

DECLARE @T AS TABLE(ProductID INTPriceID INTINT)
INSERT INTO @T
SELECT
 
ProductIDPriceID,
 
RANK() OVER (PARTITION BY ProductID ORDER BY PriceID DESCAS R
FROM Price

SELECT FROM @T WHERE 1
GO
 

Then I took a look at the estimated query plans, and it listed the CTE one as using 47% of the batch and the table variable using 53%. So there is a performance benefit which I’ll look more into at a later point, although not major one in this case.
Bookmark and Share DotnetKicks dotnetshoutout

INTERSECT/EXCEPT versus IN/NOT IN

by Allan Svelmøe Hansen May 21, 2008 19:59

SQL Server has two commands useful for finding similar rows and different rows between two tables.

These commands are EXCEPT and INTERSECT. They work by comparing the entire row of one query/table with another query/table.
EXCEPT then returns the rows which are in the first, but not the second table and INTERSECT returns the rows from the first which are also in the second table. But how do they perform?

In real life I find it is rare that I have to compare two tables or queries (or more) which possible have completely indentical data, but it does happen once in a while. However most of the time, these rows also have a unique primary key or otherwise index, so most of the time I use IN and NOT IN myself.
So I'll try EXCEPT/INTERSECT against IN/NOT IN and see how they fare.

I've made two identical tables, Table1 and Table1 both with an ID (int) and Value (nvarchar(10)), and filled them with some test data, ensuring a number of identical rows and a number of different rows using the following query:

DECLARE @I INT
SET 
@I 0

WHILE @I 10000 BEGIN
 IF 
@I <> 0
  
INSERT INTO Table1 VALUES (@IREPLICATE('A'10))
 
IF @I <> INSERT
  INTO 
Table2 VALUES (@IREPLICATE('A'10))
 
SET @I @I 1
END
 

I've not used any indexes et al, because it isn't needed. I am only after the comparable values.

Anyway, using what I know of common problems from work, I now want to find rows in Table1 which aren't in Table2.
Because I know I have "ID" which I assume right now is a unique key (I have no duplicates, because I know what data I've inserted), I'll only use an NOT IN on the ID column, like this:

SELECT *
FROM Table1
WHERE ID NOT IN (
 
SELECT ID FROM Table2
)
GO

SELECT FROM Table1
EXCEPT
SELECT 
FROM Table2
GO 
 

This performs as one might expect vastly superior for the NOT IN syntax with a ratio of 23% vs. 77%.
However, suppose that I also needed to make sure that Value was the same and that I thus couldn't rely only on the ID, making the following query:

SELECT *
FROM Table1
WHERE ID NOT IN (
 
SELECT ID FROM Table2
)
OR 
[Value] NOT IN (
 
SELECT [Value] FROM Table2
)
GO

SELECT FROM Table1
EXCEPT
SELECT 
FROM Table2
GO 
 

And now the result changes with the NOT IN syntax taking 78% and the EXCEPT only 22%.
So if you need more then one value to be different, then it starts to look like you can use the EXCEPT with performance gain versus the NOT IN syntax.
If you only need to compare on one, it is a waste of ressources to use EXCEPT – lest it is a one time query of course. But it if is something which needs to run "often", I'd think twice.

Conclusions on the results of INTERSECT are exactly (well, the numbers differ of course) the same.
One column to compare, and it is better to use IN, like this:

SELECT *
FROM Table1
WHERE ID IN (
 
SELECT ID FROM Table2
)
GO
 

Compared to

SELECT FROM Table1
INTERSECT
SELECT 
FROM Table2
GO
 

If needing more values, then INTERSECT is better.

Bookmark and Share DotnetKicks dotnetshoutout

NOT IN versus LEFT JOIN

by Allan Svelmøe Hansen May 18, 2008 08:36
As you may or may not know, it is possible to use LEFT JOIN (and thus RIGHT JOIN) to filter out tables, just as it is with the NOT IN.
However, how do they perform against each other?
Well, using similar data as created in the "In versus INNER JOIN" example, I made two simple queries to see:

SELECT *
FROM TestOne
WHERE ID1 NOT IN
(
 
SELECT ID1 FROM TestTwo
)
GO

SELECT T1.*
FROM TestOne T1
LEFT JOIN TestTwo T2 ON T1.ID1 T2.ID1
WHERE T2.ID1 IS NULL
GO
 

This gave the following execution plan:
(click to enlarge)
Estimated Execution Plan For NOT IN versus LEFT JOIN for first and second query
As can be seen the margin of difference - 46% versus 54% - is much smaller then with the IN syntax as linked above.
So I tried more complex queries where I just took the ones I made for the In versus INNER JOIN article, and modified them to be NOT IN and LEFT JOINS. Like this:


SELECT T1.*
FROM TestOne T1
WHERE T1.ID1 NOT IN (
  
SELECT ID1 FROM TestTwo T2
  
WHERE ID2 NOT IN (
   
SELECT ID2 FROM TestFour T4
   
WHERE ID4 NOT IN (
    
SELECT ID4 FROM TestFive T5
   
)
  )
 )
GO

SELECT T1.*
FROM TestOne T1
LEFT JOIN TestTwo T2 ON T1.ID1 T2.ID1
LEFT JOIN TestFour T4 ON T2.ID2 T4.ID2
LEFT JOIN TestFive T5 ON T4.ID4 T5.ID4
WHERE T2.ID1 IS NULL
AND 
T4.ID2 IS NULL
AND 
T5.ID4 IS NULL
GO 


This gave a queryplan where the LEFT JOIN actually performened better, not by much but still 52 versus 48%. (I've minimized the query plans because I was only interested in the numbers, and not in the execution plan itself.)
(click to enlarge)
Estimated Execution Plan For NOT IN versus LEFT JOIN for third and fourth query
So where the case was more in favore of using IN instead of INNER JOINs when filtering inclusive, then exclusive looks to be more evenly matched when it comes to NOT IN versus the LEFT JOIN syntax.
But as always, use a case by case decision, by looking at your own querys details. This is just an example and guideline numbers.
Bookmark and Share DotnetKicks dotnetshoutout

Powered by BlogEngine.NET 1.6.1.0
Theme by Mads Kristensen | Modified by Mooglegiant

About:
Allan Svelmøe Hansen

My real name is Allan Svelmøe Hansen.
I live in Denmark, where I work as a developer for hedal:kruse:brohus using SQL Server and the .NET framework since 2004.
My primary fields of expertise is back-end data integration, database design and optimization.


       View Allan Svelmøe Hansen's profile on LinkedIn     

Disclaimer

The opinions expressed herein are my own personal opinions and thoughts and does not represent my employers view in any way, nor are my results guaranteed for all situations.
Content is presented "as is", with no warranty.