A few months ago, a client I was working with had a dilemma with one of their search queries. A query that previously took 10-15 seconds to complete would now take at least 45 seconds and sometimes even longer.
This change occurred shortly after a release in which the distribution of data in the database changed dramatically. Instead of evenly distributed data across all of the tables, a few tables and indexes now had single values concentrations that equaled 50-75% of the index.
To mimic his behavior, we’ll create a copy of the SalesOrderDetail in AdventureWorks that is weighted towards a single ProductID. This script will accomplish this:
IF OBJECT_ID('dbo.SalesOrderDetail') IS NOT NULL DROP TABLE dbo.SalesOrderDetail GO SELECT sad.* INTO SalesOrderDetail FROM Sales.SalesOrderDetail sad INSERT INTO SalesOrderDetail ( SalesOrderID ,CarrierTrackingNumber ,OrderQty ,ProductID ,SpecialOfferID ,UnitPrice ,UnitPriceDiscount ,LineTotal ,rowguid ,ModifiedDate ) SELECT sad.SalesOrderID ,sad.CarrierTrackingNumber ,sad.OrderQty ,sad.ProductID ,sad.SpecialOfferID ,sad.UnitPrice ,sad.UnitPriceDiscount ,sad.LineTotal ,sad.rowguid ,sad.ModifiedDate FROM Sales.SalesOrderDetail sad CROSS JOIN ( SELECT TOP 20 * FROM Sales.SalesOrderHeader) soh WHERE productid=870 GO CREATE CLUSTERED INDEX tix_SalesOrderDetail_SalesOrderIDSalesOrderDetailID ON SalesOrderDetail(SalesOrderID,SalesOrderDetailID) GO CREATE UNIQUE NONCLUSTERED INDEX tix_SalesOrderDetail_SalesOrderDetailID ON SalesOrderDetail(SalesOrderDetailID) GO CREATE NONCLUSTERED INDEX tix_SalesOrderDetail_ProductID ON SalesOrderDetail(ProductID) GO
Also, create the following index:
CREATE INDEX IX_Product_NameProductSubcategoryID ON Production.Product(Name) INCLUDE (ProductSubcategoryID)
This will remove a Key Lookup that doesn’t affect the demonstration. It does help simplify the plan a little to highlight the issue.
Now that we have data to work with, let’s begin with a mock-up of the search query that the client was using. As you may have guessed, I am using the AdventureWorks database for this demonstration.
The query that was being executed was fairly simple. It had a few tables that were joined together to gather some base data. In this case we will be looking at products and the category that the product belongs to. We also want to include a count of the number of order in which the product has been included in. For out “search” we are only going to return the product Water Bottle – 30 oz. This is the product that is taking up most of the table.
This query looks like the following:
SELECT ps.Name ,p.ProductID ,p.Name ,COUNT(*) FROM Production.ProductCategory pc INNER JOIN Production.ProductSubcategory ps ON pc.ProductCategoryID=ps.ProductCategoryID INNER JOIN Production.Product p ON ps.ProductSubcategoryID=p.ProductSubcategoryID LEFT OUTER JOIN SalesOrderDetail sad ON p.ProductID=sad.ProductID WHERE p.Name='Water Bottle - 30 oz.' GROUP BY ps.Name ,p.ProductID ,p.Name ORDER BY p.Name GO
This is a fairly common construct when querying. I’ve written queries just like this for years. If you take a look at the execution plan you’ll notice something that maybe you weren’t aware of.
The rows returned from SalesOrderDetail are join to the Product and ProductCategory data before they are aggregated. Now in most cases this isn’t a big deal. In some rare cases it is a big deal – and with the client I was working with that day it definitely was. In their database, it wasn’t 98K rows that were expanded out. It was millions of rows and this was leading to some serious issues with performance.
I often watch webcasts and one that I had watched around the time this issue came up was Rob Farley’s (Blog | @Rob_Farley) Designing for Simplification. In this webcast, Rob talks about the rules for having a join. Looking at the query plan above, the LEFT OUTER JOIN is duplicating rows. But the real purpose of that JOIN is to introduce a value.
We have a single column to aggregate, with a COUNT(). One of the ways that we can accomplish this would be to use a correlated sub-query. Through this, the data is aggregated before it is joined to the other data in the results. To accomplish this, the query would be re-written as such:
SELECT ps.Name ,p.ProductID ,p.Name ,( SELECT COUNT(*) FROM SalesOrderDetail sad WHERE p.ProductID=sad.ProductID) FROM Production.ProductCategory pc INNER JOIN Production.ProductSubcategory ps ON pc.ProductCategoryID=ps.ProductCategoryID INNER JOIN Production.Product p ON ps.ProductSubcategoryID=p.ProductSubcategoryID WHERE p.Name='Water Bottle - 30 oz.' ORDER BY p.Name
The chief difference is that the query to determine the COUNT() is no longer brought in through a LEFT OUTER JOIN. Instead it is included in the SELECT statement as a sub-query. This small change does influence the execution plan:
As you can see above, the 98K rows are aggregated right away. They aren’t joined to the other tables and duplicating that data unnecessarily. The GROUP BY statement on the outer statement is no longer needed and the performance is slightly improved.
Unfortunately, this example of the technique I used with a correlated sub-query didn’t match the dramatic performance improvement that I witnessed at the client. You could though see that there was a difference in the execution plan and there is also a slight query cost difference. This technique is one that given the right distributions of data and “proper” table schema will provide significant benefits.
Another take-away from this post is that you need to try different techniques and watch your execution plans. Different methods of writing T-SQL will influence the results and you need to try many techniques to find the right one.
I guess I wasn’t really paying attention when I wrote my Monday blog post since it covered Aggregates as well. If you want to read a bonus post on aggregates, check out XQuery for the Non-Expert – Aggregates.