当前位置: 动力学知识库 > 问答 > 编程问答 >

sql - Subqueries vs MAX aggregate function peformance

问题描述:

MAX aggregation vs Subqueries:

This seems to be a reoccurring issue in the queries that I have been writing recently, and I would like to discover which query style is:

  • Most efficient (time and resources)
  • More reliable and easier to maintain
  • Makes the most sense to use


More Information:

The queries I am writing tend to always pull from one base table and will also join to several other tables; however, the joined tables often have a vertical orientation where the foreign key is referenced several times with a unique "descriptor" and "response." (See: Table #MovieDescriptions for an example.)


Please use the following SQL queries as a test scenario:


 -- Drop temp tables if exist

IF OBJECT_ID('TempDB..#Movies','U') IS NOT NULL

DROP TABLE #Movies

IF OBJECT_ID('TempDB..#MovieDescriptions','U') IS NOT NULL

DROP TABLE #MovieDescriptions

-- Creating temp tables

CREATE TABLE #Movies

(

MovieID int IDENTITY(1,1),

MovieName varchar (100),

ReleaseYear datetime,

Director varchar (100)

)

CREATE TABLE #MovieDescriptions

(

MovieDescID int IDENTITY(1,1),

FK_MovieID varchar(100),

DescriptionType varchar(100),

DescriptionResponse varchar(100)

)

-- Inserting test data

INSERT INTO #Movies (MovieName, ReleaseYear, Director) VALUES ('Gone With the Wind', CONVERT(datetime,'12/15/1939'), 'Victor Fleming')

INSERT INTO #Movies (MovieName, ReleaseYear, Director) VALUES ('2001: A Space Odyssey', CONVERT(datetime,'01/01/1968'), 'Stanley Kubrick')

INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('1', 'Written By', 'Sideny Howard')

INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('1', 'Genre', 'Drama')

INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('1', 'Rating', 'G')

INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('2', 'Written By', 'Standley Kubrick')

INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('2', 'Genre', 'Sci-Fi')

INSERT INTO #MovieDescriptions (FK_MovieID, DescriptionType, DescriptionResponse) VALUES ('2', 'Rating', 'G')

-- Using subqueries

SELECT

MovieName,

ReleaseYear,

(SELECT DescriptionResponse

FROM #MovieDescriptions

WHERE FK_MovieID = #Movies.MovieID AND DescriptionType = 'Genre'

) AS Genre,

(SELECT DescriptionResponse

FROM #MovieDescriptions

WHERE FK_MovieID = #Movies.MovieID AND DescriptionType = 'Rating'

) AS Rating

FROM #Movies

-- Using aggregate functions

SELECT

MovieName,

ReleaseYear,

MAX(CASE WHEN md.DescriptionType = 'Genre' THEN DescriptionResponse END) AS Genre,

MAX(CASE WHEN md.DescriptionType = 'Rating' THEN DescriptionResponse END) AS Rating

FROM #Movies m

INNER JOIN #MovieDescriptions md

ON m.MovieID = md.FK_MovieID

GROUP BY MovieName, ReleaseYear

In addition, if there is a better way of selecting this data, that would be helpful as well.

网友答案:

There are many ways to get results from multiple tables like nested/sub queries, joins, aggrgate functions etc.

But Joins are way more efficient than subqueries when you need results from n number of tables.

It might not be observable for few tables. But when you work with 5 or 6 tables simultaneously, you will observer how joins perform.

网友答案:

Assuming a more normal setup where your tables are properly indexed, and where the foreign key relationship columns have matching data types (hint hint: they don't currently match, int vs. varchar), then you should always find that the your second query (joins + aggregation) outperforms the first (subqueries in your select clause). It may not be noticeable with small amounts of data. But the more your base table has data (#Movies), the more the difference will be obvious.

The reason is simple. In your first query:

SELECT
     MovieName,
     ReleaseYear,
     (SELECT DescriptionResponse
      FROM #MovieDescriptions
      WHERE FK_MovieID = #Movies.MovieID AND DescriptionType = 'Genre'
      ) AS Genre,
     (SELECT DescriptionResponse
      FROM #MovieDescriptions
      WHERE FK_MovieID = #Movies.MovieID AND DescriptionType = 'Rating'
      ) AS Rating
FROM #Movies

If #Movies contains 1000 rows, then SQL Server really has no choice but to perform one full table scan on #Movies, and, for each of the 1000 rows, it needs to perform 2 additional queries on #MovieDescriptions. In reality, you are executing a total of 2001 queries. Because your subqueries are in the SELECT clause, SQL Server has no choice but to execute the query in this way.

On the other hand, your 2nd query:

SELECT
     MovieName,
     ReleaseYear,
     MAX(CASE WHEN md.DescriptionType = 'Genre' THEN DescriptionResponse END) AS Genre,
     MAX(CASE WHEN md.DescriptionType = 'Rating' THEN DescriptionResponse END) AS Rating
FROM #Movies m
     INNER JOIN #MovieDescriptions md
     ON m.MovieID = md.FK_MovieID
GROUP BY MovieName, ReleaseYear

Because you are using a join here, this gives SQL Server the flexibility of figuring out the most efficient way of joining the data from #Movies and #MovieDescriptions. Depending on your indexes, filters, number of rows, etc. it may decide to do a hash join, maybe it will use nested loops, etc. The point is that SQL Server has more options, and can now figure out the best way of reducing the number data block reads from your 2 tables (and indexes).

EDIT: I should also add that the above assumes that you are fetching every row returned from the query. If the query returns thousands of rows, but you only fetch the first 10 rows, then, in certain cases, the first query could actually outperform the second. That's because the subqueries will only execute on the rows as they get selected or fetched. If you never fetch some of the rows, you may never incur the cost of executing the subqueries on those unfetched rows. Something to consider.

网友答案:

I like this version of your query better..

SELECT
     MovieName,
     ReleaseYear,
     MAX(CASE WHEN md.DescriptionType = 'Genre' THEN DescriptionResponse END) AS Genre,
     MAX(CASE WHEN md.DescriptionType = 'Rating' THEN DescriptionResponse END) AS Rating
 FROM #Movies m
     INNER JOIN #MovieDescriptions md
     ON m.MovieID = md.FK_MovieID
GROUP BY MovieName, ReleaseYear

But This involves a sort on movies tables and this sort cost is 63% of total cost and you are restricted in getting all columns..

I rewrote this version using Apply which doesn't have sort and you are not restricted in getting all columns..

select 
* from
#movies m
cross apply
(
select 
max(case when descriptiontype='genre' then descriptionresponse  end) as genre,
max(case when descriptiontype='rating' then descriptionresponse  end) as rating
from
#MovieDescriptions md
where  md.fk_movieid=m.movieid)b

execution plan of above version..

Now Both queries doesn't have indexes..

What are the indexes that are needed for First query to work and scale..

An additional index on MovieName, ReleaseYear to avoid sort and to tell SQLServer these are unique..since you already had (assuming) primary key on Movieid ,having one more index to avoid sort is ,some thing which i don't prefer and if you need more columns from first query,you will have to add it again to index..

For The second query to scale better,
you will need an index on FK_movieid which i assume you have it already..

PS:

I may have missed bunch of things,But i would go with my query ..

分享给朋友:
您可能感兴趣的文章:
随机阅读: