Using INNER JOIN/SELF JOIN to allow for smaller indexes.

by Allan Svelmøe Hansen May 15, 2009 16:26

In databases, I often need foreign keys in my tables, because I’ll want to use them to select content out from my tables. However this can often result in either bad index utilization in the selection or making additional indexes based on the foreign key(s) and the content I need to select out.
This in turn can result in ending up with “many” indexes and sometimes many big indexes.

However a method can be to make smaller indexes and use an INNER JOIN to join into the table an extra time.

I’m going to show the pattern with a relative simple example to illustrate, because it is about the pattern more so then the specifics layout, content and size of the table.
It is just a pattern/technique to keep in mind and have in the SQL toolbox.
Suppose you have two tables of a similar pattern to this:

Tables for self join example
(click for larger size)

I’ve let the SQL Server create my clustered indexes based on the primary key, which means I have a clustered index over my primary key(s).
When having to look up rows in JoinOne based on the foreign key, it’ll often look like this:

SELECT 
FROM 
JoinOne T1
WHERE FKOne <VALUE>
  

Because I’ve added no other indexes to the tables, I’ll get an Index Scan or Table Scan when running the above query, which - as we - know is not a good thing most of the time.
This usually leads to the creating of a second index which indexes over my foreign key.
However if the table is large, and if I’ll need to extract many/most of the columns (like in my example with SELECT *), it can mean I’ll have to make either a complete index or one with many included columns, just ordered by my foreign key first.

If I make a small index with my foreign key and my primary key second, I can use this to join into my table again, and then after that use my primary key.
This is an index to illustrate:

Index for self join example 
(click for larger size)

Then I can make the following query:

SELECT T2.
FROM 
JoinOne T1
INNER JOIN JoinOne T2 
ON
 
T1.PKOne 
T2.PKOne
 
AND T1.PKTwo 
T2.PKTwo
 
AND T1.PKThree 
T2.PKThree
WHERE T1.FKOne <VALUE>
  

When looking at its execution plan, it’ll show two index seeks instead of my previous index scan. This way I have a small index as possible, but maintain seeks in my execution.
Of course there are some considerations one need to take with this pattern. Firstly – if the tables are “small”, then the scan in itself might be alright, or the overhead of keeping a complete index for the foreign key is neglible.

However the advantage of the pattern is that I can have smaller indexes on (very) large tables, which means less overhead when inserting/updating – but still have only index seeks in my selections execution plans.
It is a useful technique in my opinion – when used at the right times, which of course is on a case by case evaluation as always.  

Bookmark and Share DotnetKicks dotnetshoutout

Partition removed by adding index

by Allan Svelmøe Hansen April 29, 2008 13:17

During some testing of partitioned tables to give myself more insight in the possibilities and effects (more numbers), I ran into something weird.
I created tables as I did in my earlier testing, however now when I added indexes to the tables, I found all performance gain was gone, and when I removed indexes, I noticed my tables were no longer partitioned.
Testing some more, backtracking my steps, I found out that it was the creation of indexes which caused my partitioned tables to become non-partitioned.

I have created the tables via SQL scripts but I created the indexes via SQL Server Management Studio (SSMS as it is usually abbreviated). This creation of indexes did not take into considerations that I had partitioned tables, because I found that adding an index changed the table to non-partitioned.

Then I started to read the documentation for create index SQL command (try first, then read the documentation …… isn’t that how it usually goes? Innocent ) and there I read that I could specify an partition scheme.

CREATE CLUSTERED INDEX <INDEX NAME>
ON TABLE  (<COLUMNS>)
ON <PARTITION SCHEME>

So while I did know that you could partition your indexes, I was not aware of the fact that apparently SSMS alters your table and removes the partitioning somehow.
Annoying, but at least now I know.

Bookmark and Share DotnetKicks dotnetshoutout

DateTime and Indexes.

by Allan Svelmøe Hansen January 28, 2008 07:30

Lately I’ve been reading some contradicting information about SQL indexes.
This time it was that DateTime fields in a database was prime candidates for usage as index keys, and that DateTime wasn’t prime candidates.

The reasoning behind the first statement looks to be that if you have a date and/or time stamp that you use in your data, you often filter on said data, thus making it advisable to use as index.
The reasoning behind the contrary statement that DateTime data is not advisable to use as index keys is that because a DateTime is stored with time down to milliseconds, and it is rare that you actually need that type of precesion in your queries, you will induce index scans more often by having the DateTime in your index.

Color me confused and call me Barney. So I decided to try and gain an understanding of these two contradicting statements to see whether they both were correct. Not that I think I can prove anything, but hey, I need to satisfy my own curiosity a bit.

So I decided to make a table:

My table for testing DateTime indexes

And filled it with testdata.

DECLARE @i INT
SET 
@i 0
DECLARE @d DateTime
DECLARE @d2 DateTime
WHILE @i 10000 BEGIN
 SET 
@d GETDATE()
 
SET @d DATEADD(dd@i 4@d)
 
SET @d2 CONVERT(VARCHAR(16), @d112)
 
INSERT INTO DateTimeTest
 
VALUES
 
('abcdefg' STR(@i) + 'h' STR(@i 4), @d@d2)
END 
 

You’ll notice that I have two fields with DateTime (one and two), and this is because I wanted to clean one of the dates for its time representation to see if that made any difference. (Basically, it didn’t :))

I then made 2 indexes (unique, non-clustered). IX1 was made containing DateTimeOne and id, and IX2 was made containing DateTimeTwo and id (remember, DateTimeTwo contains the same date as DateTimeOne, but the time representation is set to 00:00:000)

I then ran a series of queries to see what result I would get in the execution plan.
For example:

DECLARE @d DateTime
SET @d '2008-01-29'
SELECT id
FROM  DateTimeTest
WHERE DateTimeOne >= @d

SELECT id
FROM  DateTimeTest
WHERE DateTimeTwo >= @d 
 

This provides the same result for both queries, both seeking the index. I tried the same when selecting the DateTimeOne and DateTimeTwo values out in their respective queries and got the same result. 2 times Index seeks, and both queries with the same numbers in the execution plan.
 
So this had me starting to believe that the first case from the starting conundrum is true. However...what if you only needed rows from a specific day or similar information?
 
So I ran some queries along this type:

DECLARE @d DateTime
SET @d DATEADD(dd,2GETDATE())
SELECT id
FROM  DateTimeTest
WHERE DATEPART(ddDateTimeOne) < DATEPART(dd@d)

SELECT id
FROM  DateTimeTest
WHERE DATEPART(ddDateTimeTwo) < DATEPART(dd@d
 

And lo and behold – I got my elusive index scans. Naturally I would say, because now I could not use the indexes as they were build, but had to compute on them, making them suboptimal.

So it looks that there are truths in both statements, and it depends on how you use your data, but that is basically true for any index. To make the optimal indexes, one should always analyze the data.


Now that I was in the middle of all this testing, I started to wonder that since the internal representation of a datetime field is 2x4 bytes of integer, then – could that be used to speed up the queries.
So I made a third date field, but in this I placed the value of DateTimeOne cast as an integer.
I then ran the queries:

DECLARE @d DateTime
SET @d DATEADD(dd,2GETDATE())
SELECT id
FROM  DateTimeTest
WHERE DateTimeOne @d

SELECT id
FROM  DateTimeTest
WHERE DateTimeOneInt CAST(@d AS integer
 

To my surprise, this performed equally well for both queries, both showing the same numbers in the execution plan. It wasn’t until I had to select the DateTimeOneInt as a DateTime and thus had to convert in my SELECT statement that a small penalty was incurred. I also found that I did infact not really need to cast @d to an integer. I could just use it in the WHERE statement directly.

 

Bookmark and Share DotnetKicks dotnetshoutout

Learning about execution plans. (part 2)

by Allan Svelmøe Hansen January 18, 2008 11:15

Reflecting over the workings of indexes I learned the other day, I started to draw some conclusions. Some might be premature, but well – numbers seems to back it up.

Firstly, SQL Server 2005 automatically builds an index when you assign primary keys. If you do not do anything, it will by default be a clustered index, because primary keys usually are the prime candidate for clustered indexes, as those are the values you assumed would use to get data via. However, you can specify that an index should be non-clustered, but I’ve not found a way to make it non-unique (which is why you can’t have duplicate values in your PK), and I do not know whether that is at all possible.

Anyways, because of this, it starts to matter which order you create your columns in, in the table. That is because I’ve not yet found a way to change the order of columns in a primary key index, whereas in a normal created index, I can shift the order of the columns to my desire.

This means, that unless you analyze the data and thus know how you'll use it, you actually risk ending up with ineffective indexes, which can and will hurt your performance.
The example I showed in my part one article is a prime example of this.
If looking at the table definition, I had TestTwoID, TestTwoValue and TestOneID (the FK) in that order. In that example, I solved my inefficiency by making a custom clustered index (after I made the PK index non-clustered).
However, suppose for a moment that it had been real life data, then an analyzing of the data could have told me that I’d properly use the FK (TestOneID) as a means to look up data from TestTwo table. Then had I created the table with TestOneID as the first column, and made a combined PK of TestOneID and TestTwoID, I would have had an effective clustered index already, and wouldn’t have needed to make non-unique/new clustered as I did.

Now this might seem like a small thing, however remember that indexes takes up space. So by having just one proper clustered index, then if nothing else, you save space. Small “randomish” testing for me showed almost a doubling of space used for indexes in my test database when I had the PK non-clustered index and the custom clustered, compared to a proper clustered index. So if these findings are extrapolated to a working production database, it can be significant space.

As for the unique versus non-unique then as far as I’ve read, SQL Server internally only keeps unique indexes, so if you specify an index shouldn’t be unique, SQL Server will automatically append 4 bytes of data to the index to make sure it is unique. And seeing as 4 bytes matches an integer, then it looks to be that you yourself could enforce this uniqueness by having a unique integer value on your table. Following this logic, it could be thought out that if you know your table will not grow too large, then a smallint (2 bytes) should be able to save you some space. However, small random testing did not really show this to be true for me, so whether that is accurate or not – I can’t say with certaincy.

Also, because you also use extra space for keeping stuff in your table, then perhaps it goes to show that contrary to the school learning I received “back then”, that perhaps a primary key is not always the way to go. Primary keys should as far as I’m concerned only be used when needed, and then you can create the clustered indexes yourself.
But that is an entirely new discussion.

Bookmark and Share DotnetKicks dotnetshoutout

Learning about execution plans. (part 1)

by Allan Svelmøe Hansen January 15, 2008 16:09

I started to look more closely at the execution plans and how they worked and what I could use them for.
And while I’m still very much a novice within database optimizing and all that, I am learning, and one thing I’ve read so far is that scans are bad, seeks are good.
Scans means that SQL Server is – well scanning – all the datapages which either a table or an index consists off.  An index seek means that SQL Server could follow the index tree right down to the leaf level and pull the data from there…… at least that is my current understanding (remember, I’m  a novice)
Anyways, I made two tables to test a little.

(click to enlarge)
Database diagram for TestOne and TestTwo used in this example

TestOne has 10.001 rows of dummy data and TestTwo has 30.001 rows where generally each PK from TestOne is about 3 times in TestTwo.
The automatically created indexes are used first. This means a clustered index on the PK for TestOne and a clustered index on the PK for TestTwo.
Then I performed the very simple, yet realistic, query:

SELECT FROM TestOne T1
INNER JOIN TestTwo T2 ON T1. TestOneID T2.TestOneID
WHERE T1.TestOneID 4567  
 

(4567 is just a random number I used, could have used anything from 1 to 10.001
This gave the following estimated execution plan:

(click to enlarge)
Estimated Execution Plan 1

From that it is visible that the TestTwo gets an Index Scan which is because I’m looking for a value which isn’t directly visible from the clustered index on TestTwo, so SQL Server has to scan through the entire number of datapages to find the value(s) I’m looking for. On the other hand we can see that TestOne performs an Index Seek, which is because the key we are looking for is inside the Clustered Index, so SQL Server can look up the value down to the leaf node of the index.

Of interesting values can be said that the entire query has estimated cost of 0.56509, with the Clustered Index Seek (TestOne table) has an estimated CPU cost of 0.0001581 and estimated I/O cost of 0.003125.
TestTwo table, the Index Scan, has an estimated CPU cost of 0.0331581 and 0.514236 I/O cost.
(click to enlarge)
Estimated Cost for Index Scan on TestTwo
These numbers show clearly that it is the I/O cost of the Index Scan on TestTwo table which is the culprit and the main focus for optimizing this query.



Then I messed about a bit with the indexes on TestTwo, because that was the table which needed focus.
I changed the PK index to a non-clustered index, and created a new clustered Index for the columns of TestOneID and TestTwoID.
There isn’t much idea in putting TestTwoID into the clustered index, but I did it so I could enforce the unique constraint on the index myself without having SQL Server itself putting extra bytes upon each row. Space size it doesn’t matter much in my example, and it isn’t (wasn’t) my main focus anyway…

(click to enlarge)
Changed Index for TestTwo table

I then ran the exact same query once more and got this encouraging result

(click to enlarge)
Estimated Execution Plan 2
As can be seen the entire query now has estimated cost of 0.0065809, and my Index Scan has now become an Index Seek. (And gone from taking 97% of the total resources to 50%)

The other relevant numbers are I/O on my TestTwo seek, which now only cost an estimated 0.003125 and estimated CPU cost which is 0.0001603. Any way one can view it, those numbers are drastically lower then prior, and thus my index changes has had influences.

Of course there are costs associated with indexes, both in terms of size and the added overhead to updating/maintaining the indexes et al, so you can't just stuff all you want into an index. So whether or not altered indexes are beneficial in a specific situation, is something which must be analyzed on a case by case basis. However this example does go to show that there is major benefits in the lookup though, and indexes are very important (of course) when it comes to databases… hence why I’m interested in them myself :)

Bookmark and Share DotnetKicks dotnetshoutout

Powered by BlogEngine.NET 1.6.1.0
Theme by Mads Kristensen | Modified by Mooglegiant

About:
Allan Svelmøe Hansen

My real name is Allan Svelmøe Hansen.
I live in Denmark, where I work as a developer for hedal:kruse:brohus using SQL Server and the .NET framework since 2004.
My primary fields of expertise is back-end data integration, database design and optimization.


       View Allan Svelmøe Hansen's profile on LinkedIn     

Disclaimer

The opinions expressed herein are my own personal opinions and thoughts and does not represent my employers view in any way, nor are my results guaranteed for all situations.
Content is presented "as is", with no warranty.