Upload
suchai
View
229
Download
0
Embed Size (px)
DESCRIPTION
Open World 2014 CON3450 Oracle Database 12c Row Pattern Matching - Beating the Best Pre-12c Solutions
Citation preview
Database 12c Row Pattern Matching
Beating the Best Pre-12c Solutions[CON3450]
Stew ASHTONOracle OpenWorld 2014
2
Photo Opportunity
• Presentation available on http://www.slideshare.net/stewashton/row-patternmatching12coow14
• For exact link:– See @StewAshton on Twitter– Or see http://stewashton.wordpress.com
Agenda• Who am I?• Pre-12c solutions compared to row pattern
matching with MATCH_RECOGNIZE– For all sizes of data– Thinking in patterns
• Watch out for “catastrophic backtracking”• Other things to keep in mind (time permitting)
OOW CON3450, Stew Ashton 3
Who am I?• 33 years in IT
– Developer, Technical Sales Engineer, Technical Architect– Aeronautics, IBM, Finance– Mainframe, client-server, Web apps
• 25 years as an American in Paris• 9 years using Oracle database
– Performance analysis– Replace Java with SQL
• 2 years as internal “Oracle Development Expert”OOW CON3450, Stew Ashton 4
1) “Fixed Difference”• Identify and group rows with consecutive values• My presentation: print slides to keep• Math: subtract known consecutives
– If A-1 = B-2 then A = B-1 – Else A <> B-1– Consecutive becomes equality,
non-consecutive becomes inequality
• “Consecutive” = fixed difference of 1OOW CON3450, Stew Ashton 5
PAGE12356710111242
1) Pre-12cselect min(page) firstpage,max(page) lastpage,count(*) cntFROM ( SELECT page, page – Row_Number() over(order by page) as grp_id FROM t)GROUP BY grp_id; OOW CON3450, Stew Ashton 6
PAGE [RN] GRP_ID1 1 02 2 03 3 05 4 16 5 17 6 110 7 311 8 312 9 342 10 32
PAGE [RN] GRP_ID1 1 02 2 03 3 05 4 16 5 17 6 110 7 311 8 312 9 342 10 32
FIRSTPAGE
LASTPAGE CNT
1 3 35 7 3
10 12 342 42 1
Think “match a row pattern”• PATTERN
– Uninterrupted series of input rows– Described as a list of conditions (“regular expressions”)
PATTERN (A B*)"A" : 1 row, "B" : 0 or more rows, as many as possible
• DEFINE each row condition[A undefined = TRUE]B AS page = PREV(page)+1
• Each series that matches the pattern is a “match”– "A" and "B" identify the rows that meet their conditions
OOW CON3450, Stew Ashton 7
Input, Processing, Output
1. Define input2. Order input3. Process pattern4. using defined conditions5. Output: rows per match6. Output: columns per row7. Go where after match?
OOW CON3450, Stew Ashton 8
SELECT *FROM tMATCH_RECOGNIZE ( ORDER BY page PATTERN (A B*) DEFINE B AS page = PREV(page)+1 ONE ROW PER MATCH MEASURES A.page firstpage, LAST(page) lastpage, COUNT(*) cnt AFTER MATCH SKIP PAST LAST ROW);
SELECT *FROM tMATCH_RECOGNIZE ( ORDER BY page MEASURES A.page firstpage, LAST(page) lastpage, COUNT(*) cnt ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW PATTERN (A B*) DEFINE B AS page = PREV(page)+1);
1) Run_Stats comparison
OOW CON3450, Stew Ashton 9
For one million rows:
“Latches” are serialization devices: fewer means more scalable
Stat Pre 12c Match_R PctLatches 4090 4079 100%Elapsed Time 5.51 5.56 101%CPU used by this session
5.5 5.55 101%
Id Operation Name Starts E-Rows A-Rows A-Time Buffers OMem 1Mem Used-Mem
0 SELECT STATEMENT 1 400K 00:00:01.83
1594
1 HASH GROUP BY 1 1000K 400K 00:00:01.83
1594 41M 5035K 40M (0)
2 VIEW 1 1000K 1000K 00:00:12.69
1594
3 WINDOW SORT 1 1000K 1000K 00:00:03.46
1594 22M 1749K 20M (0)
4 TABLE ACCESS FULL T 1 1000K 1000K 00:00:02.53
1594
Id Operation Name Starts E-Rows A-Rows A-Time Buffers OMem 1Mem Used-Mem
0 SELECT STATEMENT 1 400K 00:00:03.45
1594
1 VIEW 1 1000K 400K 00:00:03.45
1594
2 MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO
1 1000K 400K 00:00:01.87
1594 22M 1749K 20M (0)
3 TABLE ACCESS FULL T 1 1000K 1000K 00:00:02.09
1594
1) Execution Plans
OOW CON3450, Stew Ashton 10
Operation Used-Mem SELECT STATEMENT HASH GROUP BY 40M (0) VIEW WINDOW SORT 20M (0) TABLE ACCESS FULL Operation Used-Mem SELECT STATEMENT VIEW MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 20M (0)
TABLE ACCESS FULL
2) “Start of Group”
• Identify group boundaries, often using LAG()• 3 steps instead of 2:
1. For each row: if start of group, assign 1Else assign 0
2. Running total of 1s and 0s produces a group identifier
3. Group by the group identifierOOW CON3450, Stew Ashton 11
GROUP_NAME EFF_DATE TERM_DATE
X2014-01-01
00:002014-02-01
00:00
X2014-03-01
00:002014-04-01
00:00
X2014-04-01
00:002014-05-01
00:00
X2014-06-01
00:002014-06-01
01:00
X2014-06-01
01:002014-06-01
02:00
X2014-06-01
02:002014-06-01
03:00
Y2014-06-01
03:002014-06-01
04:00
Y2014-06-01
04:002014-06-01
05:00
Y2014-07-03
08:002014-09-29
17:00
2) Requirement
OOW CON3450, Stew Ashton 12
Merge contiguous date ranges in same group
OOW CON3450, Stew Ashton 13
1 2 2 3 3 3 1 1 2
X01-01 00:00
02-01 00:00 1
X03-01 00:00
04-01 00:00 1
X04-01 00:00
05-01 00:00 0
X06-01 00:00
06-01 01:00 1
X06-01 01:00
06-01 02:00 0
X06-01 02:00
06-01 03:00 0
Y06-01 03:00
06-01 04:00 1
Y06-01 04:00
06-01 05:00 0
Y07-03 08:00
09-29 17:00 1
X01-01 00:00
02-01 00:00
X03-01 00:00
05-01 00:00
X06-01 00:00
06-01 03:00
Y06-01 03:00
06-01 05:00
Y07-03 08:00
09-29 17:00
with grp_starts as ( select a.*, case when start_ts =
lag(end_ts) over( partition by group_name order by start_ts ) then 0 else 1 end grp_start from t a), grps as ( select b.*, sum(grp_start) over( partition by group_name order by start_ts ) grp_id from grp_starts b)
select group_name,min(start_ts) start_ts,max(end_ts) end_ts from grps group by group_name, grp_id;
2) Match_Recognize
OOW CON3450, Stew Ashton 14
SELECT * FROM t MATCH_RECOGNIZE( PARTITION BY group_name ORDER BY start_ts MEASURES A.start_ts start_ts, end_ts end_ts, next(start_ts) - end_ts gap PATTERN(A B*) DEFINE B AS start_ts = prev(end_ts) );
New this time:• Added PARTITION BY• MEASURES
added gap using row outside the match!
• ONE ROW PER MATCHandSKIP PAST LAST ROWare the defaults
One solution replaces two methods: simple!
Which row do we mean?
OOW CON3450, Stew Ashton 15
Expression DEFINE MEASURESALL ROWS… ONE ROW…
start_ts current row last row of matchFIRST(start_ts) First row of matchLAST(end_ts) current row last row of matchFINAL LAST(end_ts) ORA-62509 last row of match
B.start_ts most recent B row last B rowPREV(), NEXT() Physical offset from referenced row
COUNT(*) from first to current row all rows in match
COUNT(B.*) B rows including current row all B rows
2) Run_Stats comparison
OOW CON3450, Stew Ashton 16
For 500,000 rows:
Stat Pre 12c Match_R PctLatches 10165 8066 79%Elapsed Time 32,16 20,58 64%CPU used by this session 31,94 19,67 62%
Operation Used-Mem SELECT STATEMENT HASH GROUP BY 20M (0) VIEW WINDOW BUFFER 32M (0) VIEW WINDOW SORT 27M (0) TABLE ACCESS FULL Operation Used-Mem SELECT STATEMENT VIEW MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO 27M (0) TABLE ACCESS FULL
2) Execution Plans
OOW CON3450, Stew Ashton 17
2) Predicate pushing
OOW CON3450, Stew Ashton 18
Select * from <view> where group_name = 'X'
Operation Name A-RowsBuffer
s SELECT STATEMENT 3 4 VIEW 3 4 MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO
3 4
TABLE ACCESS BY INDEX ROWID BATCHED T 6 4
INDEX RANGE SCAN TI 6 3
3) “Bin fitting”: fixed size• Requirement
– Order by study_site– Put in “bins” with size =
65,000 max
OOW CON3450, Stew Ashton 19
STUDY_SITE CNT STUDY_SITE CNT
1001 3407 1026 1371002 4323 1028 60051004 1623 1029 761008 1991 1031 45991011 885 1032 1989101211597 1034 34271014 1989 1036 8791015 5282 1038 64851017 2841 1039 31018 5183 1040 11051020 6176 1041 64601022 2784 1042 968102325865 1044 4711024 3734 1045 3360
FIRST_SITE
LAST_SITE SUM_CNT
1001 1022 480811023 1044 622031045 1045 3360
20
SELECT s first_site, MAX(e) last_site, MAX(sm) sum_cnt FROM ( SELECT s, e, cnt, sm FROM t MODEL MEASURES (study_site s, study_site e, cnt, cnt sm) RULES ( sm[ > 1] = CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()] > 65000 THEN cnt[cv()] ELSE sm[cv() - 1] + cnt[cv()] END, s[ > 1] = CASE WHEN sm[cv() - 1] + cnt[cv()] > 65000 OR cnt[cv()] > 65000 THEN s[cv()] ELSE s[cv() - 1] END ))GROUP BY s;
• DIMENSION with row_numberorders data and processing
• rn can be used like a subscript• cv() means current row• cv()-1 means previous row
DIMENSION BY (row_number() over(order by study_site) rn)
rn [cv() – 1] [cv()] [cv()] [cv()] [cv() – 1] [cv()] rn [cv() - 1] [cv()] [cv()] [cv()] [cv() – 1]
OOW CON3450, Stew Ashton 21
SELECT * FROM tMATCH_RECOGNIZE ( ORDER BY study_site MEASURES FIRST(study_site) first_site, LAST(study_site) last_site, SUM(cnt) sum_cnt PATTERN (A+) DEFINE A AS SUM(cnt) <= 65000);
New this time:• PATTERN
(A+) replaces (A B*)means 1 or more rows
• Why? In previous examples I used PREV(), which returns NULL on the first row.
One solution replaces 3 methods: simpler!
3) Run_Stats comparison
OOW CON3450, Stew Ashton 22
For one million rows:
Stat Pre 12c Match_R PctLatches 357448 4622 1%Elapsed Time 32.85 2.9 9%CPU used by this session
31.31 2.88 9%
Id Operation Used-Mem 0 SELECT STATEMENT 1 HASH GROUP BY 7534K (0) 2 VIEW 3 SQL MODEL ORDERED 105M (0) 4 WINDOW SORT 27M (0) 5 TABLE ACCESS FULL
Id Operation Used-Mem 0 SELECT STATEMENT 1 VIEW
2 MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO
27M (0)
3 TABLE ACCESS FULL
3) Execution Plans
OOW CON3450, Stew Ashton 23
Name Val Val BIN1 BIN2 BIN31 1 10 10 2 2 9 10 9 3 3 8 10 9 84 4 7 10 9 155 5 6 10 15 156 6 5 15 15 157 7 4 19 15 158 8 3 19 18 159 9 2 19 18 1710 10 1 19 18 18
4) “Bin fitting”: fixed number
OOW CON3450, Stew Ashton 24
• Requirement– Distribute values in 3
“bins” as equally as possible
• “Best fit decreasing”– Sort values in decreasing
order– Put each value in least full
bin
4) Brilliant pre 12c solution
OOW CON3450, Stew Ashton 25
SELECT bin, Max (bin_value) bin_valueFROM ( SELECT * FROM items MODEL DIMENSION BY (Row_Number() OVER (ORDER BY item_value DESC) rn) MEASURES ( item_name, item_value, Row_Number() OVER (ORDER BY item_value DESC) bin, item_value bin_value, Row_Number() OVER (ORDER BY item_value DESC) rn_m, 0 min_bin,
Count(*) OVER () - 3 - 1 n_iters ) RULES ITERATE(100000) UNTIL (ITERATION_NUMBER >= n_iters[1]) ( min_bin[1] = Min(rn_m) KEEP (DENSE_RANK FIRST ORDER BY bin_value)[rn<= 3], bin[ITERATION_NUMBER + 3 + 1] = min_bin[1], bin_value[min_bin[1]] = bin_value[CV()] + Nvl(item_value[ITERATION_NUMBER+4], 0)))WHERE item_name IS NOT NULLgroup by bin;
OOW CON3450, Stew Ashton 26
SELECT * from itemsMATCH_RECOGNIZE ( ORDER BY item_value desc MEASURES sum(bin1.item_value) bin1, sum(bin2.item_value) bin2, sum(bin3.item_value) bin3 PATTERN ((bin1|bin2|bin3)+) DEFINE bin1 AS count(bin1.*) = 1 OR sum(bin1.item_value)-bin1.item_value <= least( sum(bin2.item_value), sum(bin3.item_value) ), bin2 AS count(bin2.*) = 1 OR sum(bin2.item_value)-bin2.item_value <= sum(bin3.item_value));
• ()+ = 1 or more of whatever is inside
• '|' = alternatives, “preferred in the order specified”
• Bin1 condition:• No rows here yet,• Or this bin least full
• Bin2 condition• No rows here yet, or• This bin less full than 3
PATTERN ((bin1|bin2|bin3)+) bin1 AS count(bin1.*) = 1 OR sum(bin1.item_value)-bin1.item_value <= least( sum(bin2.item_value), sum(bin3.item_value) ), bin2 AS count(bin2.*) = 1 OR sum(bin2.item_value)-bin2.item_value <= sum(bin3.item_value)
4) Run_Stats comparison
OOW CON3450, Stew Ashton 27
For 10,000 rows:
Stat Pre 12c Match_R PctLatches 3124 47 2%Elapsed Time 28 0.02 0%CPU used by this session
26.39 0.03 0%
4) Execution Plans
OOW CON3450, Stew Ashton 28
Id Operation Used-Mem 0 SELECT STATEMENT 1 HASH GROUP BY 817K (0) 2 VIEW 3 SQL MODEL ORDERED 1846K (0) 4 WINDOW SORT 424K (0)
5 TABLE ACCESS FULL
Id Operation Used-Mem 0 SELECT STATEMENT 1 VIEW
2 MATCH RECOGNIZE SORT 330K (0)
3 TABLE ACCESS FULL
Backtracking• What happens when there is no match???• “Greedy” quantifiers - * + {2,}
– are not that greedy– Take all the rows they can, BUT
give rows back if necessary – one at a time• Regular expression engines will test all possible
combinations to find a match
OOW CON3450, Stew Ashton 29
Repeating conditionsselect 'match' from ( select level n from dual connect by level <= 100)match_recognize( pattern(a b* c) define b as n > prev(n) , c as n = 0);
Runs in 0.005 secs
select 'match' from ( select level n from dual connect by level <= 100)match_recognize( pattern(a b* b* b* c) define b as n > prev(n) , c as n = 0);
Runs in 5.4 secsOOW CON3450, Stew Ashton 30
31
SELECT * FROM TickerMATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES FIRST(tstamp) AS start_tstamp, LAST(tstamp) AS end_tstamp AFTER MATCH SKIP TO LAST UP PATTERN (STRT DOWN+ UP+ DOWN+ UP+) DEFINE DOWN AS price < PREV(price), UP AS price > PREV(price), STRT AS price >= nvl(PREV(PRICE),0));
Runs in 0.02 seconds
Imprecise ConditionsCREATE TABLE Ticker ( SYMBOL VARCHAR2(10), tstamp DATE, price NUMBER);
insert into tickerselect 'ACME',sysdate + level/24/60/60,10000-levelfrom dualconnect by level <= 5000;
SELECT * FROM TickerMATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES FIRST(tstamp) AS start_tstamp, LAST(tstamp) AS end_tstamp AFTER MATCH SKIP TO LAST UP PATTERN (STRT DOWN+ UP+ DOWN+ UP+) DEFINE DOWN AS price < PREV(price), UP AS price > PREV(price));
Runs in 24 secondsINMEMORY: 13 seconds
Keep in Mind• Backtracking
– Precise conditions– Test data with no matches
• To debug:Measures classifier() cl, match_number() mnAll rows per match with unmatched rows
• No DISTINCT, no LISTAGG• MEASURES columns must
have aliases• “Reluctant quantifier” = ?
= JDBC bind variable• “Pattern variables” are
range variables, not bind variables
OOW CON3450, Stew Ashton 32
Output Row “shape”
Per Match PARTITION BY ORDER BY MEASURES Other input
ONE ROW X Omitted X omitted
ALL ROWS X X X X
OOW CON3450, Stew Ashton 33
ORA-00918, anyone?