Tuesday, July 28, 2009

The Real SQL Pages, A Beginner’s Guide To Indexing

Okay, so the title may be a little lame :-) , but I think it suits the basic analogy to describe indexing quite well. Today I was asked on the MSDN forums, how indexing really works, in SQL Server. I began type a lot of techno babble and then thought to myself, will the OP even understand what I am talking about? The answer is probably not. I deleted all my text and came up with a simple analogy that is easy for everyone to visualize. Essentially, indexing in SQL Server works like a phonebook. That’s right ladies and gentleman, a phonebook.

How is indexing like a phonebook? Imagine that you have a phonebook in front of you and you have a handful of bookmarks. If you want to find all the phone numbers where the person’s name begins with A, you will place a bookmark at the first page of that letter and on each subsequent page of that letter. If you want a more granular bookmark, you can create a composite key, that contains information about names, within the letter A. These other keys will provide density details that allow the optimizer to make better cardinality estimates, or in your case allow you to find the pages you are looking for in the phonebook more quickly. For example, you have a composite index on (A,Adam’ Haines). The first letter will allow you to quickly identify all letters that start with A and the second column will allow you to more quickly identify names that are equal to ‘Adam Haines’ . You cannot directly retrieve the phone number for ‘Adam Haines’ , without using the letter bookmark because ‘Adam Haines’ is contained within the letter A. Essentially, we have use one of our letter A bookmarks and then jump to ‘Adam Haines’. Think about it this way. If I asked you to find all the phone numbers where the person’s name started with the letter A, you would place bookmarks on each page where the name starts with the letter A. If I asked you to find all the phone numbers where the person name is ‘Adam Haines’, your bookmarks would be useless because you are looking for a specific name and not a letter. Without knowing the letter, we have to flip through or scan the pages to find the value. If you want to be able to search for a specific name, you have to place a bookmark on the page where the name exists. Let’s see this in action.

We will start by creating our table structure:

IF EXISTS(SELECT 1 FROM sys.tables WHERE NAME = 'PhoneBook')
BEGIN
    DROP TABLE dbo.PhoneBook;
END
GO
 
CREATE TABLE dbo.PhoneBook(
Letter CHAR(1),
Name VARCHAR(100),
PhoneNumber VARCHAR(12),
PRIMARY KEY (Letter,Name,PhoneNumber)
);
GO
 
INSERT INTO dbo.PhoneBook VALUES ('A','Adam''s DB Consulting','555-555-5555');
INSERT INTO dbo.PhoneBook VALUES ('A','Alex BI Consulting','555-555-1234');
GO

Now we can run a few simple queries. The first query will filter the data using a filter on the letter A. This query will yield an index seek because Letter is the first column of our index.

SELECT *
FROM dbo.[PhoneBook]
WHERE [Letter] = 'A' --Index seek

Results:

image

In the second query will use the same letter, but we will also include a specific name we want to find. The query plan is the same as the prior. On larger datasets, this query should be less costly because we are using a more granular predicate, so the optimizer should be able to find the row more quickly.

SELECT *
FROM dbo.[PhoneBook]
WHERE [Letter] = 'A' AND Name = 'Adam''s DB Consulting' --Index seek

image

In the third query we will try searching for just the specific name.

SELECT *
FROM dbo.[PhoneBook]
WHERE Name = 'Adam''s DB Consulting' --Clustered Index scan
GO

Results:

image

Wait what the heck happened? The optimizer scanned the index even though the Name column is included in the index key!!!!! The answer is quite simple. As I stated earlier, the only column that is “seekable” is the first column in the index key. The other columns are used for density vectors, which help with cardinality estimates. In no way do the other columns help satisfy a predicate, where the first column is not present. Think back to where we put that bookmark. We put the bookmark on the letter A and the names within A, but we did not include the letter, in our search. How do you know where that name is specifically, without knowing the letter? The reality is we cant, so we have to scan. This is why we have to create an index on the name column.

--Create an index on Name to get an index seek
CREATE UNIQUE NONCLUSTERED INDEX ncl_idx_PhoneBook_PhoneNumber ON dbo.[PhoneBook](Name);
GO

Now lets run the same query.

SELECT *
FROM dbo.[PhoneBook]
WHERE Name = 'Adam''s DB Consulting' --Nonclustered Index seek
GO

Results:

image

I hope this analogy has made things a little clearer. Now this by no means all there is to know about indexing, but it is a good fundamental look at how they work. Let me know if you guys have anything to add, or anything I may have missed.

Happy coding.

2 comments:

Brian Tkatch said...

Thanx Adam, i love Back to Basics blog posts.

Even if it is simple, it can always use review. And who know, maybe something simple was overlooked the first time.

Know to (fully) understand clustered vs non-clustered.

Kent Waldrop said...

Adam,

Good post. Your example of searching for name without the letter is the scenario in which an Oracle "Skip Scan" might apply.

Skip scan is useful when the leading edge column of the index is not explicitly involved in the filtering and has a low cardinality -- which would probably be the case here.

I am skeptical about how useful skip scans are, but they can be faster than an index "full scan".

Once every few years I might coerce a non-production query into simulating a "skip scan" by joining a domain table -- if one is available -- to the target table so that I am now including the first column of the index in the lookup.

I cannot think of anytime I have done this with production code.

Pure novelty I guess.

Kent Waldrop