Upload
leslie-small
View
220
Download
0
Embed Size (px)
Citation preview
Denormalizing Data with PROC SQL
Demoralizing Data withPROC SQL
Is that a real word?? Spell Check…
Grocery Store
What You HaveCUSTOMER_ID NAME
1 Ansel
2 Fiona
3 James
4 Kathy
5 Ying
6 Otto
7 Costas
8 Abdul
9 Enrico
10 Mitzu
ITEM_ID ITEM_NAME
1 eggs
2 milk
3 bread
4 chicken
5 beef
6 broccoli
7 carrots
8 apples
9 peaches
10 dog food
CUSTOMER_ID ITEM_ID
1 1
1 2
1 4
1 6
2 2
2 3
2 4
2 7
2 8
… …10 4
10 5
10 8
10 10
• Relational tables in normal form
• Purchase events in many-to-many relation
• Good for relational storage, bad for computing stats
• Data step: join tables using complex merge
What you wantCUSTOMER_ID NAME EGGS MILK BREAD … DOG_FOOD
1 Ansel 1 1 0 … 0
2 Fiona 1 1 1 … 1
3 James 0 1 0 … 0
4 Kathy 1 0 1 … 0
5 Ying 1 1 1 … 1
6 Otto 0 0 1 … 1
7 Costas 0 0 1 … 1
8 Abdul 0 1 0 … 0
9 Enrico 1 0 1 … 1
10 Mitzu 0 1 0 … 1
• Matrix shows who bought what• One row per customer• One column per item• Easy to compute stats
How do you get this?
A few SQL examplesto build up to a solution…
SQL Examples
What items did customer #1 buy?
select item_id from purchaseswhere customer_id = 1;
ITEM_ID---------- 1 2 4 6
What items did customer #1 buy?Join with grocery table to get item name
select P.item_id, G.item_name fromfrom purchases P, groceries Gwhere G.item_id = P.item_idand P.customer_id = 1;
ITEM_ID ITEM_NAME---------- --------- 1 eggs 2 milk 4 chicken 6 broccoli
How many customers bought eggs?
Use SQL aggregate function count().
select count(*)
from purchases P, groceries G
where P.item_id = G.item_id
and G.item_name = 'eggs'
COUNT(*)
--------
5
Did customer #1 buy eggs?
Restrict by customer, count() function returns 0 or 1, i.e., yes or no
select count(*)
from groceries G, purchases P
where P.item_id = G.item_id
and G.item_name = 'eggs'
and P.customer_id = 1;
COUNT(*)
--------
1
Did customer #10 buy eggs?
select count(*)
from groceries G, purchases P
where P.item_id = G.item_id
and G.item_name = 'eggs'
and P.customer_id = 10;
COUNT(*)
--------
0
Subqueries• In SQL, select clause can include a query that
returns a scalar value
select name, (select count(*) from purchases) num_purchasesfrom customers
NAME NUM_PURCHASES----------- -------------Ansel 58Fiona 58James 58Kathy 58Ying 58Otto 58Costas 58Abdul 58Enrico 58Mitzu 58
Correlated Subqueries• Relate inner and outer queries via alias
select name, (select count(*)
from purchaseswhere customer_id = C.customer_id) num_purchases
from customers C;
NAME NUM_PURCHASES----------- -------------Ansel 4Fiona 9James 6Kathy 3Ying 8Otto 7Costas 7Abdul 2Enrico 7Mitzu 5
Putting the pieces together
Joins to get data from multiple tables
Count() to get 0/1, yes/no
Correlated subqueries rotate rows to columns
Aliases to name columns
Final queryselect customer_id, name, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'eggs' and P.customer_id = C.customer_id) eggs,
(select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'milk' and P.customer_id = C.customer_id) milk,
(select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'bread' and P.customer_id = C.customer_id) breadfrom customers C;
SAS Codeproc sql;
create table Work.Purchase_Matrix as select
customer_id,name,(select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'eggs' and P.customer_id = C.customer_id) eggs,
(select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'milk' and P.customer_id = C.customer_id) milk,
(select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'bread' and P.customer_id = C.customer_id) bread
from customers C;
quit;
Final Dataset
CUSTOMER_ID NAME EGGS MILK BREAD
----------- ----------- ---------- ---------- ----------
1 Ansel 1 1 0
2 Fiona 1 1 1
3 James 0 1 0
4 Kathy 1 0 1
5 Ying 1 1 1
6 Otto 0 0 1
7 Costas 0 0 1
8 Abdul 0 1 0
9 Enrico 1 0 1
10 Mitzu 0 1 0
Resources
SQL may not always be the most appropriate choice for a given problem. This technique starts to get untenable as the number of columns needed in the output increases.
DATA Step vs. PROC SQL: What’s a neophyte to do?
http://www2.sas.com/proceedings/sugi29/269-29.pdf
Proc SQL versus The Data Step
http://www.nesug.org/proceedings/nesug06/hw/hw06.pdf
Questions?