V 1.0 DBMAN 3 Group By, Having Cube, Rollup OLTP vs OLAP Data analysis 1

Preview:

DESCRIPTION

V 1.0 DBMAN 3 Group By, Having Cube, Rollup OLTP vs OLAP Data analysis 3

Citation preview

V 1.0

DBMAN3

Group By, HavingCube, RollupOLTP vs OLAPData analysis

szabo.zsolt@nik.uni-obuda.hu 1

V 1.0

SELECTDisplayed order of suffixes1. INTO2. FROM3. WHERE4. GROUP BY5. HAVING6. UNION/MINUS7. INTERSECT8. ORDER BY

szabo.zsolt@nik.uni-obuda.hu 2

V 1.0

DBMAN3

Group By, HavingCube, RollupOLTP vs OLAPData analysis

szabo.zsolt@nik.uni-obuda.hu 3

V 1.0

Grouping/Aggregate functions • SUM - Sum• AVG - Average• MIN - Minimum• MAX - Maximum• COUNT - Number of non null values (records)• GROUP_CONCAT - Concatenated list of elements• STDDEV - Standard deviation• VARIANCE - Variance

szabo.zsolt@nik.uni-obuda.hu 4

V 1.0

Non-grouping usage• select avg(sal) as Average from emp;• select min(sal) from emp;• select min(sal) from emp where sal>2000;• select avg(distinct sal) as Average from emp;• select count(sal) from emp;• select count(comm) from emp where sal>2000;• select comm from emp where sal>2000;• select count(*) from emp where sal>2000;• select avg(comm) from emp; NULL values are not

included!

szabo.zsolt@nik.uni-obuda.hu 5

V 1.0

Grouping• select distinct deptno from emp;• select avg(sal) from emp where deptno=10;• select avg(sal) from emp where deptno=20;• select avg(sal) from emp where deptno=30; select deptno, avg(sal) from emp group by deptno;

szabo.zsolt@nik.uni-obuda.hu 6

V 1.0

GroupingIN THE SELECTION LIST (FIELD LIST) ONLY

THE GROUPED FIELD(s) AND THE GROUPING FUNCTION(s) ARE

ALLOWED!(YES, IN MYSQL AS WELL!!!)

(ONLY_FULL_GROUP_BY)

• select deptno, avg(sal) as Average, min(sal) as Minimum, count(*) as Num from emp group by deptno;

szabo.zsolt@nik.uni-obuda.hu 7

V 1.0

Grouping and suffixes• select mgr, avg(sal) from emp group by mgr;• select ifnull(mgr, "none") as boss, lpad(avg(sal), 15, '#')

as "Averagesal" from emp group by mgr;• HAVING vs. WHERE• select mgr, avg(sal) from emp where ename like '%E%'

group by mgr;• select mgr, avg(sal) from emp where ename like '%E%'

group by mgr having avg(sal)>1300;• select mgr, avg(sal) as average from emp where ename

like '%E%' group by mgr having avg(sal)>1300 order by average desc;

szabo.zsolt@nik.uni-obuda.hu 8

V 1.0

More complex grouping queries• select min(max(sal)), max(max(sal)),

round(avg(max(sal))) from emp group by deptno; -- In Oracle this works, in MySQL „Invalid use of group function”

• select min(sal+ nvl(comm,0)), mod(empno,3) from emp group by mod(empno,3) having min(sal+nvl(comm,0)) > 800;

szabo.zsolt@nik.uni-obuda.hu 9

V 1.0

• select distinct job, substr(job, 2, 1) from emp;• select avg(sal) as average, substr(job, 2, 1) from emp

group by substr(job, 2, 1);

• select ename, sal, round(sal/1000) from emp;• select round(sal/1000) as SalCat, count(sal) as Num

from emp group by round(sal/1000);

More complex grouping queries

szabo.zsolt@nik.uni-obuda.hu 10

V 1.0

• select ename, round(datediff(curdate(), hiredate)/365.25) as diff from emp;

• select count(*), round(datediff(curdate(), hiredate)/365.25) as diff from emp group by round(datediff(curdate(), hiredate)/365.25);

More complex grouping queries (MySQL)

szabo.zsolt@nik.uni-obuda.hu 11

V 1.0

• select ename, hiredate, (to_char(sysdate, 'YYYY')-to_char(hiredate, 'YYYY')) as diff from emp;

• select count(*),(to_char(sysdate, 'YYYY')-to_char(hiredate, 'YYYY')) as diff from emp group by (to_char(sysdate, 'YYYY')-to_char(hiredate, 'YYYY'));

• OR: we could use months_between()

More complex grouping queries (Oracle)

szabo.zsolt@nik.uni-obuda.hu 12

V 1.0

• select distinct depno, job from emp;• select deptno, job, avg(sal), min(sal), max(sal) from

emp group by deptno, job order by deptno, job;

Oracle-specific „extras”:– GROUP BY GROUPING SETS– GROUP BY CUBE– GROUP BY ROLLUP

More complex grouping queries

szabo.zsolt@nik.uni-obuda.hu 13

V 1.0

DBMAN3

Group By, HavingCube, RollupOLTP vs OLAPData analysis

szabo.zsolt@nik.uni-obuda.hu 14

V 1.0

GROUP BY• Group by, Having – one-field use is "trivial": e.g.

average salary for job or department• Multiple fields: complex grouping, e.g. average salary

for job AND department• Still: only the grouped field and the grouping functions

are allowed in the selection list!!!

szabo.zsolt@nik.uni-obuda.hu 15

V 1.0

SELECT job, deptno, avg(sal) FROM emp GROUP BY job, deptno;JOB DEPTNO AVG(SAL)--------- ---------- ----------CLERK 10 1300MANAGER 10 2450PRESIDENT 10 5000ANALYST 20 3000CLERK 20 950MANAGER 20 2975CLERK 30 950MANAGER 30 2850SALESMAN 30 1400

szabo.zsolt@nik.uni-obuda.hu 16

V 1.0

SELECT mgr, job, deptno, avg(sal) FROM emp GROUP BY job, deptno, mgr; MGR JOB DEPTNO AVG(SAL)---------- --------- ---------- ---------- 7839 MANAGER 30 2850 7839 MANAGER 10 2450 7782 CLERK 10 1300 7698 SALESMAN 30 1400 7839 MANAGER 20 2975 7902 CLERK 20 800 7698 CLERK 30 950 PRESIDENT 10 5000 7566 ANALYST 20 3000 7788 CLERK 20 1100

szabo.zsolt@nik.uni-obuda.hu 17

V 1.0

DISADVANTAGES OF A SINGLE GROUP BY

• Not flexible enough• One grouping per query, thus multiple queries are

needed even if groupings are similar Slower• Aim: One query, multiple groupings GROUPING

SETS• SELECT job, deptno, avg(sal) FROM emp GROUP BY

GROUPING SETS ( (job, deptno) );

szabo.zsolt@nik.uni-obuda.hu 18

V 1.0

NVL – Type matching!• SELECT nvl(mgr, 'Nope'), deptno, avg(sal) FROM emp

GROUP BY GROUPING SETS ( (mgr, deptno) );• SELECT nvl(to_char(mgr), 'Nope'), deptno, avg(sal) FROM

emp GROUP BY GROUPING SETS ( (mgr, deptno) );• SELECT nvl(mgr, 0), deptno, avg(sal) FROM emp GROUP

BY GROUPING SETS ( (mgr, deptno) );

szabo.zsolt@nik.uni-obuda.hu 19

V 1.0

GROUP BY GROUPING SETS

• We can define multiple groupings inside one query, sub-results can be cached

• E.g. performing an MGR, DEPTNO and a JOB, DEPTNO grouping in ONE query:

SELECT nvl(mgr, 0), deptno, nvl(job, 'Nope'), avg(sal) FROM empGROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job)); szabo.zsolt@nik.uni-obuda.hu 20

V 1.0

GROUP BY GROUPING SETS

• SELECT nvl(mgr, 0), nvl(deptno,0), nvl(job, 'NO'), avg(sal) FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job), (mgr));

• SELECT nvl(mgr, 0), nvl(deptno,0), nvl(job, 'NO'), avg(sal) FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job), (mgr), ());

Why do we have 0 for the mgr value ???

szabo.zsolt@nik.uni-obuda.hu 21

V 1.0 szabo.zsolt@nik.uni-obuda.hu 22

V 1.0

GROUPING• Using the GROUPING special "grouping function" we can

determine if the given field is used for a grouping in a record

• Grouping function: allowed in the selection list• Special: It can only work with a grouped field!

szabo.zsolt@nik.uni-obuda.hu 23

V 1.0

GROUPING0 = TRUE ?

• When using with a single and multi-field simple GROUP BY, it returns with 0

• SELECT job, avg(sal), grouping(job) FROM emp GROUP BY job;

• SELECT deptno, job, avg(sal), grouping(job) FROM emp GROUP BY job, deptno;

• When using with grouping sets: grouping = 0 means that the field is being used in the aggregation for that record

szabo.zsolt@nik.uni-obuda.hu 24

V 1.0

GROUPING

• SELECT mgr, deptno, job, avg(sal), GROUPING(mgr) as GMGR, GROUPING(deptno) as GDEPTNO, GROUPING(job) as GJOB FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job), (mgr), ());

szabo.zsolt@nik.uni-obuda.hu 25

V 1.0 szabo.zsolt@nik.uni-obuda.hu 26

V 1.0

GROUPING• SELECT

CASE WHEN GROUPING(mgr)=0 THEN mgr ELSE 0 END as MGR,

CASE WHEN GROUPING(deptno)=0 THEN deptno ELSE 0 END as DEPTNO,

CASE WHEN GROUPING(job)=0 THEN job ELSE 'NO' END as JOB,

avg(sal) FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job), (mgr), ());

szabo.zsolt@nik.uni-obuda.hu 27

V 1.0 szabo.zsolt@nik.uni-obuda.hu 28

V 1.0

GROUPING_ID• Unique identifier for each possible grouping column

configuration• SELECT mgr, deptno, job, avg(sal), GROUPING_ID(mgr,

deptno, job) as GID FROM emp GROUP BY GROUPING SETS ( (mgr, deptno), (deptno, job), (mgr), ());

szabo.zsolt@nik.uni-obuda.hu 29

V 1.0 szabo.zsolt@nik.uni-obuda.hu 30

V 1.0

GROUP BY GROUPING SETSDRAWBACKS• Too complicated, too long• When do we need a query with three totally different

grouping sets? What kind of caching can we do here?• Usually, there are hierarchical relations between the

grouping fields more meaning, more caching ROLLUP and CUBE GROUPING and GROUPING_ID can be used the same

way

szabo.zsolt@nik.uni-obuda.hu 31

V 1.0

CUBE• GROUP BY CUBE (a, b, c) =

GROUP BY GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).

• CUBE(field1, field2) the two fields have the same rank, all permutations are shown

• CUBE(job, deptno): In addition for the simple two-field grouping, we get the job-averages, the department-averages, and the total average

szabo.zsolt@nik.uni-obuda.hu 32

V 1.0

SELECT job, deptno, avg(sal) FROM emp GROUP BY CUBE(job, deptno);

szabo.zsolt@nik.uni-obuda.hu 33

V 1.0

ROLLUP• GROUP BY ROLLUP (a, b, c) =

GROUPING SETS ( (a, b, c), (a, b), (a), ( ))• ROLLUP(field1, field2) the first field is hierarchically

more important, we only take the permutations where it is used

• ROLLUP(job, deptno): In addition for the simple two-field grouping, we get the job-averages and the total average

szabo.zsolt@nik.uni-obuda.hu 34

V 1.0

SELECT job, deptno, avg(sal) FROM emp GROUP BY ROLLUP(job, deptno);JOB DEPTNO AVG(SAL)

--------- ---------- ----------CLERK 10 1300MANAGER 10 2450PRESIDENT 10 5000ANALYST 20 3000CLERK 20 950MANAGER 20 2975CLERK 30 950MANAGER 30 2850SALESMAN 30 1400ANALYST 3000CLERK 1037,5MANAGER 2758,33333PRESIDENT 5000SALESMAN 1400 2073,21429

szabo.zsolt@nik.uni-obuda.hu 35

V 1.0

MIXTURE OF GROUPINGS• GROUP BY a, CUBE (b, c) =

GROUP BY GROUPING SETS ( (a, b, c), (a, b), (a, c), (a) )• GROUP BY a, ROLLUP (b, c) =

GROUP BY GROUPING SETS ( (a, b, c), (a, b), (a) )

szabo.zsolt@nik.uni-obuda.hu 36

V 1.0

DBMAN3

Group By, HavingCube, RollupOLTP vs OLAPData analysis

szabo.zsolt@nik.uni-obuda.hu 37

V 1.0

OLTP? OLAP?• OLTP = On Line Transaction Processing• OLAP = On Line Analytic Processing• OLTP

– product » price– invoice » amount– client » name

• OLAP– Product category × Region » Gross margin– Product × Warehouse » Inventory– Supplier × Time × Product » Return rate– Tables are usually a result of grouping!

szabo.zsolt@nik.uni-obuda.hu 38

V 1.0

OLTP vs OLAPOLTP OLAP

Application Operational: ERP, CRM, legacy apps

Management Information System, Decision Support System

Typical users

Staff Managers, Executives

Horizon Weeks, Months YearsRefresh Immediate PeriodicData model Entity-relationship Multi-dimensionalSchema Normalized StarEmphasis Update Retrieval

szabo.zsolt@nik.uni-obuda.hu 39

V 1.0

Star data model?

szabo.zsolt@nik.uni-obuda.hu 40

V 1.0

Star data model? • The supervisor that

gave the most discounts?

• The quantity shipped on a particular date, month, year or quarter?

• In which zip code did product A sell the most?

szabo.zsolt@nik.uni-obuda.hu 41

V 1.0

OLAP rules• Automatized data transfer

– Extract data from OLTP system(s)– Transform/standardize, if necessary– Import to OLAP database– Build cubes (GROUP BY!)– Produce reports

• Drilling– Drill down: region city district– Drill up: city region country– Drill across: north region south region west

regionszabo.zsolt@nik.uni-obuda.hu 42

V 1.0

OLAP vs Group by• Every dimension can be a result of a group by query• Every data cube will be a result of group by queries• One problem: missing/bad data points We need trends and projections!

szabo.zsolt@nik.uni-obuda.hu 43

V 1.0

DBMAN3

Group By, HavingCube, RollupOLTP vs OLAPData analysis

szabo.zsolt@nik.uni-obuda.hu 44

V 1.0

1. FROM2. WHERE3. GROUP BY4. HAVING5. UNION/MINUS6. INTERSECT7. ORDER BY8. INTO

SELECTOrder of suffixes

szabo.zsolt@nik.uni-obuda.hu 45

V 1.0

BASIC PROBLEMS• Functions: in the selection list• Order by, group by: always executed after functions, so

we might need sub-queries• ROWNUM s*cks (later...)• Solution: special functions, that can work together with

the ordering / grouping of records

szabo.zsolt@nik.uni-obuda.hu 46

V 1.0

RANK FUNCTIONS• SELECT ROW_NUMBER() OVER (ORDER BY ENAME ASC)

AS RNUM, ENAME FROM EMP;• Simple rank functions:

RANK() 1, 2, 2, 4DENSE_RANK() 1, 2, 2, 3PERCENT_RANK() percentage, [0..1]

• NO PARAMETERS!

szabo.zsolt@nik.uni-obuda.hu 47

V 1.0

LET'S TRY THOSE…• SELECT ename, sal,

RANK() over (ORDER BY sal desc)FROM emp;

• + DENSE_RANK(), PERCENT_RANK()

szabo.zsolt@nik.uni-obuda.hu 48

V 1.0

RANK WITHIN A GROUP• SELECT deptno, ename, sal,

RANK() OVER (PARTITION BY deptnoORDER BY sal

) as RANGFROM emp;

szabo.zsolt@nik.uni-obuda.hu 49

V 1.0

RANK WITHIN A GROUP• SELECT deptno, job, ename, sal,

RANK() OVER (PARTITION BY deptno, jobORDER BY sal

) as RANGFROM emp;

• + ORDER BY …

szabo.zsolt@nik.uni-obuda.hu 50

V 1.0

GROUPING FUNCTIONS WITH ANALYTICAL CLOSURES• SELECT ename, sal,

SUM(SAL) OVER (order by sal) as MySALFROM emp;

• Ordered list!

• SELECT ename, sal,AVG(SAL) OVER (order by sal) as MySALFROM emp;

szabo.zsolt@nik.uni-obuda.hu 51

V 1.0

GROUPING FUNCTIONS WITH ANALYTICAL CLOSURES• SELECT deptno, ename, sal,

SUM(SAL) OVER (partition by deptnoorder by ename

) as MySumFROM empORDER BY deptno, ename;

szabo.zsolt@nik.uni-obuda.hu 52

V 1.0

GROUPING FUNCTIONS WITH ANALYTICAL CLOSURES

• alter session set nls_date_format='YYYY-MM-DD';• select ename, hiredate, sal from emp order by hiredate;• select ename, hiredate, sal, sum(sal) over (order by

hiredate) as TOTAL from emp order by hiredate;• select ename, hiredate, sal, sum(sal) over (partition by

to_char(hiredate, 'YYYY') order by hiredate) as TOTAL from emp order by hiredate;

szabo.zsolt@nik.uni-obuda.hu 53

V 1.0

SUBSET(Sliding window)• SELECT ename, sal,

avg(SAL) OVER (order by salrows between 1 preceding and 2

following) as MyAvgFROM emp;

szabo.zsolt@nik.uni-obuda.hu 54

V 1.0

SUBSET(Sliding window)• SELECT deptno, ename, sal,

sum(SAL) OVER (partition by deptno order by salrows between 0 preceding and 1

following) as MySumFROM emp;

szabo.zsolt@nik.uni-obuda.hu 55

V 1.0

SUBSET(Sliding window)

• We can use the RANGE keyword• SELECT deptno, ename, sal,

sum(SAL) OVER (order by salrange between current row and unbounded following

) as MySumFROM emp;

szabo.zsolt@nik.uni-obuda.hu 56

V 1.0

OTHER ANALYTICAL FUNCTIONS• FIRST_VALUE(), LAST_VALUE()• RATIO_TO_REPORT() Ratio compared to the sum

valueSELECT ename, sal,RATIO_TO_REPORT(sal) OVER ()FROM emp ORDER BY sal desc;+ PARTITION BY

szabo.zsolt@nik.uni-obuda.hu 57

V 1.0 szabo.zsolt@nik.uni-obuda.hu 58

szabo.zsolt@nik.uni-obuda.hu 59