27
Featherman’s SQL Server Intermediate Functions © (Introduction of local variables, more content on SQL Sub-queries, SQL window functions, SQL table variables) To date you know a great deal of T-SQL! Good Persistence! Here we solve some more problems. Once you have run many a GROUP BY query, you will discover that there are roadblocks to building the datasets you need. While array tables with sub-queries fixes the data merging problem there are still data calculation problems. This intermediate TSQL document covers two particularly common problems. The first is that while you can now make row-based calculations, (i.e. multiplying two column values from the same row; for every row in the resultset), and aggregated sub-totals and counts within windows of a dataset, we still have not found a way to calculate a column total for the entire dataset. We always partitioned the group totals or each row value itself was the total. A common analytics need is to show a % of total column and to compare individual row values to this overall value. Often you need to derive an overall value from the dataset and then compare individual rows to this overall value. This document uses the % of total problem to illustrate this point …and solve a common problem. This document shows four different ways to calculate a percent of total: 1. By using a sub-query, 2. By using a local variable, 3. By using a non-specified OVER () clause - thereby using the entire resultset, and 4. By specifying a fieldname in the OVER () clause to calculate totals within sub groups. You don’t need 4 different ways to calculate a % of total, however using a similar problem lets you compare methodologies and in the process learn four new powerful SQL programming methodologies. In the process, two powerful constructs are further discussed variables and window functions. 1

faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

Featherman’s SQL Server Intermediate Functions © (Introduction of local variables, more content on SQL Sub-queries, SQL window functions, SQL table variables)

To date you know a great deal of T-SQL! Good Persistence! Here we solve some more problems. Once you have run many a GROUP BY query, you will discover that there are roadblocks to building the datasets you need. While array tables with sub-queries fixes the data merging problem there are still data calculation problems. This intermediate TSQL document covers two particularly common problems. The first is that while you can now make row-based calculations, (i.e. multiplying two column values from the same row; for every row in the resultset), and aggregated sub-totals and counts within windows of a dataset, we still have not found a way to calculate a column total for the entire dataset. We always partitioned the group totals or each row value itself was the total. A common analytics need is to show a % of total column and to compare individual row values to this overall value. Often you need to derive an overall value from the dataset and then compare individual rows to this overall value. This document uses the % of total problem to illustrate this point …and solve a common problem.

This document shows four different ways to calculate a percent of total:

1. By using a sub-query, 2. By using a local variable, 3. By using a non-specified OVER () clause - thereby using the entire resultset, and 4. By specifying a fieldname in the OVER () clause to calculate totals within sub groups.

You don’t need 4 different ways to calculate a % of total, however using a similar problem lets you compare methodologies and in the process learn four new powerful SQL programming methodologies. In the process, two powerful constructs are further discussed variables and window functions.

The second major problem tackled in this module is the need to run different WHERE filters when performing different calculations. Often the columns needed in a dataset are quite different and the calculations can be difficult to reproduce in Excel. The beauty of using a report writing tool such as SSRS is that you can build complex expressions that serve as the data source for a column in a report or a line on a chart. You can also drive this functionality into the stored procedure, to speed up the process of report generation. So the problem of different columns needing different filtering (the WHERE clause) and different calculations is presented and conquered.

While a great deal of SQL queries are provided that you can run, you will learn best by inventing your own problem and TSQL solution using the methodologies shown here.

1

Page 2: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

Sub-Queries - You have learned that in many cases, a subquery can be used instead of a JOIN (and vice versa). In other cases, sub-queries allow you to set up an advanced WHERE clause, for filtering. Shown in this document a subquery can also be used to provide one or more values in the SELECT clause (We will use it to calculate totals which are used for the denominator of an expression). BE SURE to highlight the sub-query in SSMS and run it, to ensure it is giving the correct response.

Subqueries are typically part of the WHERE clause, as follows:1) WHERE column IN (subquery), 2) WHERE column <comparison> (subquery), 3) WHERE EXISTS (subquery).

In this document two sub-queries examples are shown and both return a single number (aka a scalar). In the first refresher example an average is returned by the sub-query, and used in the outer query’s subsequent processing. To versions of this query are shown. In the example a total is calculated and used in subsequent processing.

USE [AdventureWorksDW2012];Select sc.[EnglishProductSubcategoryName] AS [Sub-Category], [ModelName], [ProductKey], [ProductAlternateKey], [EnglishProductName], [ListPrice]

FROM [dbo].[DimProduct] as pINNER JOIN [dbo].[DimProductSubcategory] as scON p.[ProductSubcategoryKey]= sc.[ProductSubcategoryKey]

WHERE p.[ListPrice] >

(SELECT AVG(p.[ListPrice]) FROM [dbo].[DimProduct] as pINNER JOIN [dbo].[DimProductSubcategory] as scON sc.ProductSubcategoryKey = p.ProductSubcategoryKey

WHERE sc.[EnglishProductSubcategoryName]IN ('Touring Bikes', 'Mountain Bikes', 'Road Bikes'))ORDER BY [ListPrice] DESC

The Adventure Works company sells 125 different bicycles. This query provides a list of the bicycle products whose sales price is higher than the average for the bicycle sub-categories. If you run the sub-query alone you will see that the average for products in the three bicycle sub-categories is $1,524. There are 50 bicycles with above average sales process.

Look at the WHERE clause. It is its own query (called the inner query). Be sure to run this to see the results it returns. A prior version of this query did not include the inner join in the sub-query and all 89 bikes where returned.

This is a filtering exercise. A similar topic is NTILE() where you categorize the rows for later filtering.

2

Page 3: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

USE [Featherman_Analytics];

SELECT [Sub Category], [Model], [ProductKey], [Part #], [Product], [Dealer Price], [Web Price]FROM [dbo].[AW_Products_Flattened]WHERE [Web Price] >

(SELECT AVG([Web Price] ) FROM [dbo].[AW_Products_Flattened]WHERE [Sub Category] IN ('Touring Bikes', 'Mountain Bikes', 'Road Bikes'))ORDER BY [Web Price] DESC

Here the query is re-written using the flattened, denormalized [dbo].[AW_Products_Flattened] data table. Special care has been applied to re-naming columns (as you would when creating a view for an analyst or manager). Notice how much easier the SQL is to read!

In the next block of code, look at the denominator of the % of total field. Notice the denominator is its own query! This is the sub-query which must be encased in parenthesis. Here two different sub-queries are run, but the concept is the same…it’s a separate query that is not confined by the GROUP BY statement (or it can even have a different group by statement). The denominator can be calculated two different ways. One at a time run these two simple Select statements. The first finds a grand total from the sales table, the second the grand total from the salestotals table. Next run the select statements without the SUM function to get an idea of the data. The SalesTotals table is an intermediate aggregated table, which are both very helpful but can introduce errors when they are not updated as part of a regular ETL data management processes.

USE [Featherman_Analytics]; SELECT SUM([Total_Sale]) FROM featherman.sales

3

Page 4: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

USE [Featherman_Analytics]; SELECT SUM([Total]) FROM featherman.salesTotals

Here are the rows of aggregated salesTotals table.

Now the denominators in the next two lines should make more sense, the first queries an aggregated table, the second queries the transaction table. Both are useful strategies, especially when dealing with millions of rows of data. Run the query below and changeout the line of code. The results are a little different as the .SalesTotals table is undoubtedly out of date, whereas the sales table is up to date.

,FORMAT( SUM([Total_Sale])/(SELECT SUM([Total]) FROM featherman.salesTotals), 'P2') AS [% of Total],FORMAT( SUM([Total_Sale])/(SELECT SUM([Total_Sale]) FROM featherman.sales), 'P2') AS [% of Total]

USE [Featherman_Analytics]; SELECT c.[CustomerID], [CustomerName], COUNT([Sale_ID]) AS [# Sales], FORMAT(SUM([Total_Sale]), 'N0') AS [Total Sales]

--Because the next line uses aggregates (SUM(), COUNT()) the GROUP BY interprets the next line as: for each customer divide the sum of the total sales by the count of the number of the sales.

, FORMAT(SUM([Total_Sale])/ COUNT([Sale_ID]),'N0') AS [Average Sale]

--next divide the total sales for the customer by the total sales of the dataset,FORMAT( SUM([Total_Sale])/(SELECT SUM([Total]) FROM featherman.salesTotals), 'P2') AS [% of Total]

FROM featherman.sales as s INNER JOIN [featherman].[Customers] as c ON c.[CustomerID] = s.customerID

GROUP BY c.[CustomerID],[CustomerName]ORDER BY [Average Sale] DESC

Can you add a column that would show the percentage of # of invoices that each customer rang up? This would add an interesting insight to the analysis. Some customers may have a small % of transactions yet a large # amount and vice-versa.

4

Page 5: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

Using a local variable to change the formatting of the sub-query

USE [Featherman_Analytics];DECLARE @AvgBicycleListPrice decimal(10,2)

SET @AvgBicycleListPrice = (SELECT AVG([Web Price]) FROM [dbo].[AW_Products_Flattened]

WHERE [Sub Category] IN ('Touring Bikes', 'Mountain Bikes', 'Road Bikes'))PRINT @AvgBicycleListPrice

SELECT [Sub Category], [Model], [ProductKey], [Part #], [Product], [Web Price]FROM [dbo].[AW_Products_Flattened]

WHERE [Web Price] > @AvgBicycleListPrice ORDER BY [Web Price] DESC

--the print statement above gives this output

As will be explained later, here a prior query is re-written using a local variable. This is another option that is useful when you need to calculate a scalar (return a value from a query) that is used in subsequent processing.

You can run the top half of this query down to the Print line to see the result that is returned. The results are the same as presented in the above figure.

While you may be thinking that it will always be easiest to use local variables that are created at the top of a query program, often this is not what you need to do. Rather often subqueries are used not as a filter as show here; but rather to add new columns of calculated data for some object of interest (i.e., product ID#, employee ID#, etc.)

Local VariablesHere we create two decimal variables that receive in one number each (the total $ amount of all sales, and a count of the # sales). What we’re programming? Yes SQL is the most popular programming language in the world.

5

Page 6: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

USE [Featherman_Analytics]; DECLARE @TotalSales as decimal = (SELECT sum(Total_Sale) FROM featherman.sales)

DECLARE @CountofSales as decimal = (SELECT COUNT(*) FROM featherman.sales)

SELECT [CustomerName], FORMAT(sum(Total_Sale), 'N0') as [CustomerTotals]

--since we are grouping on customer, the numerator is the total $ sales for one customer divided by the denominator which is the referenced total sales revenue of all invoices which was pre-calculated in the 1st variable, FORMAT(sum(Total_Sale) /@TotalSales, 'P2') AS [% of $Total]

--since we are grouping on customer, the numerator is a count of the # sales for one customer divided by the denominator which is the referenced total count of all invoices which was pre-calculated in the 2nd variable, FORMAT(Count(*) /@CountofSales, 'P2') AS [% of Total Units]

FROM featherman.sales as sINNER JOIN [featherman].[Customers] as c ON c.CustomerID = s.CustomerID

--note: it's a good idea to group by a primary key, not a name field (there could be duplicate names)GROUP BY c.[CustomerID], [CustomerName]ORDER BY [CustomerTotals] DESC

Shown is the shortcut way of creating the variable and setting its value in the same line of code. Because assignment lines of code (with an = sign) read from right to left; the SELECT statement is run and the returned value is assigned to the variable. Be sure to encase the select statement for the variable assignment in parentheses.

It is interesting, that you create calculated fields (sum, AVG, count) that create totals for you for each of the group by field categories…and this behavior continues even if your query fields become very complex, such as we have in yellow to the left.

6

Page 7: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

7

Page 8: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

Here is an improved query that should give you some ideas. We create three decimal variables that receive in one number each (the total $ amount of all sales, and a count of the # sales). What we’re programming? Yes SQL is the most popular programming language in the world.

USE [Featherman_Analytics];

/* This is the easiest way to calculate and display a % of total for a dataset. Three global variables are created and values assigned */DECLARE @TotalSales as decimal = (SELECT sum(Total_Sale) FROM featherman.sales) DECLARE @CountofSales as decimal = (SELECT COUNT(*) FROM featherman.sales) DECLARE @AvgTA as decimal = (SELECT AVG(Total_Sale) FROM featherman.sales) PRINT @TotalSalesPRINT @CountofSalesPRINT @AvgTA

SELECT [CustomerName], SUM(Total_Sale) as [CustomerTotals], FORMAT(AVG(Total_Sale), 'C0') as [CustomerAvgTA$]

/* These next few lines use the global variables in the calculations. Very useful! */, FORMAT(AVG(Total_Sale) - @AvgTA, 'N0') as [Compared to Overall Customer Avg.], FORMAT(AVG(Total_Sale)/@AvgTA, 'P2') as [% of Overall Customer Avg.]

/* since we are grouping on customer, the numerator is the total $ sales for one customer divided by the denominator which is the referenced total sales revenue of all invoices which was pre-calculated in the 1st variable */

, FORMAT(sum(Total_Sale) /@TotalSales, 'P2') AS [Customer % of $Total], COUNT(*) AS [Number Orders]

/* since we are grouping on customer, the numerator is a count of the # sales for one customer divided by the denominator which is the referenced total count of all invoices which was pre-calculated in the 2nd variable */, FORMAT(Count(*) /@CountofSales, 'P2') AS [Cust% of Transactions]

FROM featherman.sales as sINNER JOIN [featherman].[Customers] as c ON c.CustomerID = s.CustomerID

8

Page 9: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

--note: it's a good idea to group by a primary key, not a name field (there could be duplicate names)GROUP BY c.[CustomerID], [CustomerName]ORDER BY SUM(Total_Sale) DESC

Shown is the shortcut way of creating the variable and setting its value in the same line of code. Because assignment lines of code (with an = sign) read from right to left; the SELECT statement is run and the returned value is assigned to the variable. Be sure to encase the select statement for the variable assignment in parentheses.

It is interesting, that you create calculated fields (sum, AVG, count) that create totals for you for each of the group by field categories…and this behavior continues even if your query fields become very complex.

9

Page 10: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

Another example comparing sub-queries to local variables. Take a look at the Humanresources.Employee table in the AdventureWorks Database. The gender field shows F or M. Lets turn those into words.

USE [AdventureWorksDW2012];SELECT CASEWHEN Gender = 'F' THEN 'Female'WHEN Gender = 'M' THEN 'Male' END AS [Gender]

, COUNT(GENDER) AS [Gender Count], FORMAT(100 * count(*) / (SELECT COUNT(*) FROM [dbo].[DimEmployee]WHERE Gender = 'M' OR Gender = 'F'), 'N2') AS [Percent]

FROM [dbo].[DimEmployee]GROUP BY [Gender]

Here the denominator again is a sub-query with no GROUP BY. A total of all the number of male and female employees is calculated by use of COUNT(*). You can also count(fieldname).

The SELECT CASE statement lets us improve the formatting of the output.

Run the query different times without the (100 * and without the FORMAT(, ‘N2’) to see the somewhat strange behavior of SQL

USE [AdventureWorksDW2012];DECLARE @TotalEmployees as int = (SELECT COUNT(*) FROM [dbo].[DimEmployee] WHERE Gender = 'M' OR Gender = 'F')

SELECT CASEWHEN Gender = 'F' THEN 'Female'WHEN GENDER = 'M' THEN 'Male' END AS [Sex] , COUNT(GENDER) AS [# Employees], FORMAT(100 * count(*) / @TotalEmployees, 'N2') AS [% of Employees]

FROM [dbo].[DimEmployee]GROUP BY Gender

Here we again take the sub-query and place it into a variable at the top of the program. Again a temporary variable is created here of integer data type. The value of the variable is set to the count of all the records in the Employees table that has a valid response for Gender (Biological Gender).

Notice in the SET statement the SELECT query is encased in parentheses. The second half of this query uses the variable in the denominator of a formula. We add 100 * (#/#) to ensure we are working with numbers > 1.

Also the WHERE clause in the SET statement filters out NULL values for the Gender field – could also use IS NUT NULL()

10

Page 11: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

You can also use WHERE GENDER IS NOT NULL

Another Example of Local Variable and Sub Queries (Used together to solve a problem)

USE [Featherman_Analytics];DECLARE @CustomerID int = 3

SELECT [CustomerID], COUNT([Sale_ID]) AS [Total #TA], SUM([Total_Sale]) as [Total Revenue], FORMAT(AVG([Total_Sale]), 'N0') as [Customers Average Invoice Amount ]

FROM [featherman].[sales] WHERE CUSTOMERID = @CustomerIDGROUP BY [CustomerID]

This is a useful query – that can be turned into a stored procedure.

--No inner join is used to add the Customers Name to save space.USE [Featherman_Analytics];DECLARE @CustomerID int = 3SELECT [CustomerID]

,(SELECT SUM([Total_Sale]) FROM [featherman].[sales]WHERE [Paid] = 'True' AND CUSTOMERID = @CustomerID) AS [$ Paid]

,(SELECT SUM([Total_Sale]) FROM [featherman].[sales] WHERE [Paid] = 'False' AND CUSTOMERID = @CustomerID) AS [$ UnPaid], SUM([Total_Sale]) as [Total Revenue]

,(SELECT COUNT([Sale_ID]) FROM [featherman].[sales] WHERE [Paid] = 'True' AND CUSTOMERID = @CustomerID) AS [# Paid]

,(SELECT COUNT([Sale_ID]) FROM [featherman].[sales]

This example uses four sub-queries to add more information. The sub-queries were needed as the WHERE clause of the outer query is not detailed enough. Each of the subqueries have their own criteria (whether the invoice was paid or not). In a normal GROUP BY() query all the aggregate fields use the same filter (the WHERE clause), the data needs are more detailed and varied so you can build the resultset one column at t atime.

Notice how useful the local variable is, in that it can be used repeatedly. Note each subquery is encased in its own set of parentheses.

11

Page 12: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

WHERE [Paid] = 'False' AND CUSTOMERID = @CustomerID) AS [# UnPaid], COUNT([Sale_ID]) AS [Total #TA]

FROM [featherman].[sales] WHERE CUSTOMERID = @CustomerIDGROUP BY [CustomerID]

This query is a very useful addition to an INSERT SQL transaction. After you add a new sales order, you can call this parameterized query and show the updated total business with the customer.

But what if you prefer a report that shows the same analytics for each and every customer?

12

Page 13: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

Table Variables Revisited – here we need to bring back table variables to solve the problem of having different calculations performed in different columns of the dataset. Note: you can provide similar functionality in Excel if you use DAX to build calculated fields.

USE [Featherman_Analytics];DECLARE @CustomerMetrics1 TABLE ([CustomerID] int)DECLARE @CustomerMetrics2 TABLE ([CustomerID] int, [# Paid TA] int)DECLARE @CustomerMetrics3 TABLE ([CustomerID] int, [# UnPaid TA] int)DECLARE @CustomerMetrics4 TABLE ([CustomerID] int, [Total #TA] int)DECLARE @CustomerMetrics5 TABLE ([CustomerID] int, [Revenue Collected] decimal )DECLARE @CustomerMetrics6 TABLE ([CustomerID] int, [Revenue Uncollected] decimal)DECLARE @CustomerMetrics7 TABLE ([CustomerID] int, [Revenue Total] decimal)

INSERT INTO @CustomerMetrics1 ([CustomerID])SELECT [CustomerID] FROM [featherman].[sales] AS [CustomerID]

INSERT INTO @CustomerMetrics2 ([CustomerID], [# Paid TA])SELECT [CustomerID], COUNT([Sale_ID]) FROM [featherman].[sales] WHERE [Paid] = 'True' GROUP BY [CustomerID]

INSERT INTO @CustomerMetrics3 ([CustomerID], [# UnPaid TA])SELECT [CustomerID], COUNT([Sale_ID]) FROM [featherman].[sales] WHERE [Paid] = 'False' GROUP BY [CustomerID]

INSERT INTO @CustomerMetrics4 ([CustomerID], [Total #TA])SELECT [CustomerID], COUNT([Sale_ID]) FROM [featherman].[sales] GROUP BY [CustomerID]

INSERT INTO @CustomerMetrics5 ([CustomerID], [Revenue Collected])SELECT [CustomerID], SUM([Total_Sale]) FROM [featherman].[sales] WHERE [Paid] = 'True' GROUP BY [CustomerID]

INSERT INTO @CustomerMetrics6 ([CustomerID], [Revenue Uncollected])SELECT [CustomerID], SUM([Total_Sale]) FROM [featherman].[sales]WHERE [Paid] = 'False' GROUP BY [CustomerID]

INSERT INTO @CustomerMetrics7 ([CustomerID], [Revenue Total])SELECT [CustomerID], SUM([Total_Sale]) FROM [featherman].[sales]

This query solves the problem. Is there an easier way to run this query?

This query creates and merges seven table variables each with the CustomerID field in common, most table variables are two columns wide and will utilize a GROUP BY query. Each column is a different metric with a different WHERE statement.

Because each column is very different some counts, some summed, most with a different criteria, you have to build the columns one at a time.

We insert selected data into each of the columns 1 through 7.

13

Page 14: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

GROUP BY [CustomerID]

SELECT DISTINCT CM1.[CustomerID], CM2.[# Paid TA], CM3.[# UnPaid TA], CM4.[Total #TA], CM5.[Revenue Collected], CM6.[Revenue Uncollected], CM7.[Revenue Total]

FROM @CustomerMetrics1 CM1, @CustomerMetrics2 CM2, @CustomerMetrics3 CM3, @CustomerMetrics4 CM4, @CustomerMetrics5 CM5, @CustomerMetrics6 CM6, @CustomerMetrics7 CM7

WHERE CM1.[CustomerID] = CM2.[CustomerID] AND CM1.[CustomerID] = CM3.[CustomerID] AND CM1.[CustomerID] = CM4.[CustomerID] AND CM1.[CustomerID] = CM5.[CustomerID] AND CM1.[CustomerID] = CM6.[CustomerID] AND CM1.[CustomerID] = CM7.[CustomerID]

Now we select the columns from the linked tables. The WHERE statement links all the virtual table variables together.

14

Page 15: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

Use of Non-Partitioned OVER() Statement

Now back to the % of total problem and different solutions. Run these two queries to see how the denominator works. The first query has an empty (non partitioned) OVER() statement so you receive the total for the entire table. The second query does partition the dataset so the totals are partitioned into different sections (called windows)

USE [AdventureWorksDW2012];SELECT sum(count(Gender)) over () AS [Count]FROM [dbo].[DimEmployee]

USE AdventureWorksDW2012;SELECT Gender, SUM(COUNT(Gender)) OVER (PARTITION BY Gender) AS [Count]FROM [dbo].[DimEmployee]GROUP BY Gender

USE AdventureWorks2012;SELECT Gender, COUNT(GENDER) AS [Gender Count], FORMAT(100 * count(*) / sum(count(Gender)) over (), 'N2') AS [Percent]

FROM [HumanResources].[Employee]

GROUP BY Gender

This methodology provides the same results as the prior methodologies which used sub-queries or variables. The denominator uses an OVER () statement to sum the count for the entire dataset.

This is the meaning for OVER() – the empty parenthesis instructs the query engine to aggregate the entire specified column for the entire resultset. Elsewhere we will place a column name in the parentheses and the query engine will perform the calculation for groups based on the specified column. For example OVER(productCategory) would perform calculations within each value of the product category column, i.e. calculate % of total within each product category (and starting over for each new product category).

The count for each gender is calculated using the GROUP BY for all values of the Gender column.

15

Page 16: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

Here is a more realistic example using the OVER() statement

USE [Featherman_Analytics];SELECT [Sub Category], [Model], SUM([OrderQuantity]) AS [Qty]

, FORMAT(100 * SUM([OrderQuantity]) / SUM(SUM([OrderQuantity])) OVER (), 'N2') AS [% of Category]

, FORMAT(100* SUM([OrderQuantity]) /SUM(SUM([OrderQuantity])) OVER (PARTITION BY [Sub Category]), 'N2') AS [% of Sub-Category]

FROM [dbo].[FactResellerSales] as rsINNER JOIN [dbo].[AW_Products_Flattened] as p ON p.[ProductKey]= rs.[ProductKey]

WHERE [Category]= 'Bikes'

GROUP BY [Sub Category], [Model] ORDER BY [Sub Category]

Notice we have filtered the data to show only the Bikes

In the yellow line of code values for a new field are calculated – the % of each model’s units sold as a proportion of the entire Bikes Category. In the numerator of the calculation, the sum of order qty. field is calculated based on the GROUPING values. Next the denominator is calculated – which is the sum of all the sums of the groups – aka the grand total. The usage of the term OVER () with an empty parenthesis acts to perform the summing calculation over the entire dataset.

Next the calculation for each row of the new column is multiplied by 100 to aid formatting and the FORMAT(), ‘N2’ command is used to limit the # of decimal places.

16

Page 17: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

The second calculated field that uses a % of total is shown in blue. The syntax provides a subtle but important difference from the yellow syntax. The % of sub-category rather than category is calculated using a windowing function. This is an example of further sub-dividing the dataset, so you can’t use an overall value, you need a total for each grouping. Notice that the same partitioning field that is in the OVER() needs to be in the SELECT and GROUP BY statement as well.

– notice the specific rather than empty () PARTITION BY fieldname – this is where the boundaries of the % of total is controlled and calculated.

You can add several fieldnames to the OVER () clause to force different calculations. There are many more powerful T-SQL Window functions that will be covered in more depth in a different document.

17

Page 18: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

A more in-depth example of using OVER(Partition BY) to calculate % of total

This next resultset shows that you can calculate % of total in different ways. Please observe that three different % calculations are performed using different PARTITION BY specifications. Code is below.

USE [Featherman_Analytics];

SELECT [Category],[Sub Category], [Model], [Product], SUM([OrderQuantity]) AS [Qty Sold]

--Here we calculate the % of total based on the entire dataset – one entire Product Category. This is the same as leaving the PARTITION in the OVER clause blank as in OVER ()

, FORMAT( 100 *SUM([OrderQuantity]) * 1.0/ SUM(SUM([OrderQuantity])) OVER (PARTITION BY [Category]) , 'N2') AS [% Category]

--Here we calculate the % of total based on one model (there are different size and color combinations for each bicycle model., FORMAT(100* SUM([OrderQuantity]) * 1.0/SUM(SUM([OrderQuantity])) OVER (PARTITION BY [Model]), 'N2') AS [% of Model]

--Here we calculate the % of total based on one Sub-category (there are three sub-categories within the Bikes product category (Mountain bikes, Road bikes, and Touring bikes)

, FORMAT(100* SUM([OrderQuantity]) * 1.0/SUM(SUM([OrderQuantity])) OVER (PARTITION BY [Sub Category] ), 'N2') AS [% of Sub-Category]

FROM [dbo].[FactResellerSales] as rsINNER JOIN [dbo].[AW_Products_Flattened] as p ON p.[ProductKey]= rs.[ProductKey]WHERE [Category]= 'Bikes'

--Remember all the non-aggregate fields from the SELECT clause need to go into the GROUP BY statementGROUP BY [Category], [Sub Category], [Model],[Product]ORDER BY [Category], [Sub Category], [Model],[Product]

18

Page 19: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

This resultset showed that you can calculate % of total in different ways. Please observe that three different % calculations are performed using different PARTITION BY specifications. Code is below.

a) % of entire dataset – here is % of product category which has been filtered to just bikes

b) % of sub-category – here there are 3 sub-categories, road bikes, touring bikes and mountain bikes

c) % of product model – there are many different size/color combinations for each model.

So can you hand-calculate these formulas in Excel being ever so careful to choose the correct cells? Yes but that would be laborious and error-prone!

19

Page 20: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

Appendix

This is the query that creates the flattened products hierarchy. Notice the green commented line, it takes the columns specified and copies them into the named table. If you want the query to work, you need have access to AdventureworksDW2012 or have the three products tables in the hierarchy copied into a database that is accessible. Because the purpose of analytics is to quickly retrieve values and shape resultsets, it is fine to introduce duplicated values (to denormalize) to a datatable. For example the values in the category and sub category columns have duplication.

USE [AdventureWorksDW2012];

SELECT [ProductKey], [EnglishProductCategoryName] as [Category],[EnglishProductSubcategoryName] as [Sub Category],[ProductAlternateKey] as [Part #],[ModelName] as [Model], [EnglishProductName] as [Product], [Color], [StandardCost] as [Cost], [DealerPrice] as [Dealer Price], [ListPrice] as [Web Price]

--INTO [Featherman_Analytics].[dbo].[AW_Products_Flattened]

FROM [dbo].[DimProduct] as pINNER JOIN [dbo].[DimProductSubcategory] as sc ON sc.[ProductSubcategoryKey]= p.[ProductSubcategoryKey]

INNER JOIN [dbo].[DimProductcategory] as c ON c.[ProductcategoryKey]= sc.[ProductcategoryKey]

WHERE [FinishedGoodsFlag] = 1ORDER BY [EnglishProductCategoryName], [EnglishProductSubcategoryName]

Notice that great care is taken to rename columns. If you refer back to page 1 and 2, you can see how a query is much easier to write if you don’t have to produce inner joins.

In production, you would need to refresh the values in the newly created AW_Products_Flattened table as other queries that analyze product data may no longer be up to date (products could get added or dropped from the product lines).

20

Page 21: faculty.business.wsu.edu€¦ · Web view2016/10/16  · There are 50 bicycles with above average sales process.Look at the WHERE clause. It is its own query (called the inner query)

References

Over clause - http://dataonwheels.wordpress.com/2014/06/26/t-sql-window-functions-part-1-the-over-clause/

Ranking functions - http://dataonwheels.wordpress.com/2014/06/26/t-sql-window-functions-part-2-ranking-functions/

Part 3 - http://blogs.lessthandot.com/index.php/datamgmt/dbprogramming/mssqlserver/t-sql-window-functions-part-03/

CTE Blog - https://www.sqlskills.com/blogs/jonathan/ctes-window-functions-and-views/

% of total - http://sqlusa.com/bestpractices/percentonbase/

% of total with group - http://stackoverflow.com/questions/1823599/calculating-percentage-within-a-group

Sub-queries - http://www.udel.edu/evelyn/SQL-Class3/SQL3_SubqEg.html

21