3. 2014 MapR Technologies 3 e-book available courtesy of MapR
Also at MapR booth http://bit.ly/1jQ9QuL A New Look at Anomaly
Detection by Ted Dunning and Ellen Friedman June 2014 (published by
OReilly)
4. 2014 MapR Technologies 4 Practical Machine Learning series
(OReilly) Machine learning is becoming mainstream Need pragmatic
approaches that take into account real world business settings:
Time to value Limited resources Availability of data Expertise and
cost of team to develop and to maintain system Look for approaches
with big benefits for the effort expended
6. 2014 MapR Technologies 6 Lets Start with Trouble Monty Hall
problem (oops, done) Three doors, one with a fabulous prize You
pick one Monte shows you one of the remaining doors is empty You
can switch at this point to the other door or not Should you
switch?
7. 2014 MapR Technologies 7
8. 2014 MapR Technologies 8
9. 2014 MapR Technologies 9
10. 2014 MapR Technologies 10 The Real Problem Doing the math
isnt too hard Convincing somebody you have the right answer is
really hard
11. 2014 MapR Technologies 11 Live Coding With REAL Chaos
12. 2014 MapR Technologies 12 Geo-coding
13. 2014 MapR Technologies 13 Geo-coding Some databases have
disk locality key locality The primary key is totally ordered
Embedding a total ordering of the points in a plane is possible But
loses some distance information A line is not a square! We want to
do proximity searches This gets harder in the polar regions for
most codings
24. 2014 MapR Technologies 24 4 bit sine wave (listen for
artifacts as volume decreases) White dithering (artifacts gone, we
hear through the noise) Noise shaping (noise is easier to hear
through)
26. 2014 MapR Technologies 26 The Shape of the Noise Noise
Frequency 0.4 0.2 0.0 0.2 0.4 010003000
27. 2014 MapR Technologies 27 The Effect After Averaging 0 1 2
3 4 5 6 42024 Time
28. 2014 MapR Technologies 28 Thompson Sampling
29. 2014 MapR Technologies 29 Learning in the Real World In the
real world we get to pick our training examples Do we try this
restaurant or not? Learning has real and opportunity costs Not
learning has real and opportunity costs as well Every sub-optimal
choice we make incurs regret We would like to minimize this But we
cant quantify regret without incurring regret!
30. 2014 MapR Technologies 30 An Example Pick one of five
options Purple, blue, green, red, yellow Each has a random payoff
If you pick a bad option, regret = mean(best) mean(yours) The best
known algorithm uses randomization Best = minimal regret + minimal
code complexity
31. 2014 MapR Technologies 31 Demo The Algorithm
32. 2014 MapR Technologies 32 Synthetic Data
33. 2014 MapR Technologies 33 select IR.ENC_KEY ,IR.ENCOUNTER_
,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD ,IR.sub_source_cd
,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr ,IR.HIC_NBR
,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC ,IR.PROVIDER_NBR
,IR.provider_type ,IR.PROVIDER_SOURCE_CD ,IR.cms_provider_ty e
,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc ,IR.proc_cd
,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd
,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC FROM (SELECT distinct
enc.encounter_key as ENC_KEY, enc.encounter_nbr as ENCOUNTER_,
typ.encounter_type_cd as ETYPE, bt.bill_type, cnt.contract_nbr as
CONTR_, ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD,
enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS,
enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as
HDR_ERRCD, eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR,
prv.provider_type, prv.PROVIDER_SOURCE_CD, diag.cms_provider_type,
sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC,
svc.rev_cd, rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd,
dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY, st.rec_state_cd,
sts.rec_status_cd, derr.error_cd as DG_ERRCD, derr.error_desc as
DG_ERRDESC FROM oicpcuhg.ir_encounter enc ` Can You See the
Problem?
34. 2014 MapR Technologies 34 INNER JOIN
oicpcuhg.ir_encountertype typ ON (typ.encounter_type_key =
enc.encounter_type_key) LEFT OUTER JOIN oicpcuhg.ir_billtype bt ON
(bt.bill_type_key = enc.bill_type_key) LEFT OUTER JOIN
oicpcuhg.ir_contract cnt ON (cnt.contract_key = enc.contract_key)
LEFT OUTER JOIN oicpcuhg.ir_datasource ds ON (ds.source_key =
enc.data_source_key) LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob
ON (lob.lob_key = enc.lob_key) INNER JOIN oicpcuhg.ir_member m ON (
m.hp_cd = enc.hp_cd AND m.member_source_cd = enc.member_source_cd
AND m.member_nbr = enc.member_nbr) LEFT OUTER JOIN
oicpcuhg.ir_encountererror eerror ON (eerror.encounter_key =
enc.encounter_key and eerror.active_flg = 'Y') LEFT OUTER JOIN
oicpcuhg.ir_error eerr ON (eerr.error_key = eerror.error_key) LEFT
OUTER JOIN oicpcuhg.ir_provider prv ON (prv.hp_cd = enc.hp_cd and
prv.provider_source_cd = enc.provider_source_cd and
prv.provider_nbr = enc.provider_nbr)
35. 2014 MapR Technologies 35 LEFT OUTER JOIN
oicpcuhg.ir_encounterspecialty esp ON (esp.encounter_key =
enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_specialty sp ON
(sp.specialty_key = esp.specialty_key) LEFT OUTER JOIN
oicpcuhg.ir_service svc ON (svc.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_revenue rev ON (rev.rev_cd =
svc.rev_cd) LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag ON
(diag.encounter_key = enc.encounter_key) INNER JOIN
oicpcuhg.ir_diagcd dgcd ON (dgcd.diag_cd_key = diag.diag_cd_key)
INNER JOIN oicpcuhg.ir_recordstate st ON (st.rec_state_key =
diag.rec_state_key) INNER JOIN oicpcuhg.ir_recordstatus sts ON
(sts.rec_status_key = diag.rec_status_key) LEFT OUTER JOIN
oicpcuhg.ir_diagnosiserror derror ON (derror.diagnosis_key =
diag.diagnosis_key and derror.active_flg = 'Y') LEFT OUTER JOIN
oicpcuhg.ir_error derr ON (derr.error_key = derror.error_key)) IR
INNER JOIN oicpcuhg.umr_req_inbound umr ON (trim(umr.member_nbr) =
IR.member_Nbr AND trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND
trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND trim(umr.diag1) =
IR.diag_cd)
36. 2014 MapR Technologies 36 One Attack The customer cant give
you the data They cant trust you, by law But they can probably
summarize the data How many columns What types Perhaps statistical
summaries
37. 2014 MapR Technologies 37 Bug Replication Without Security
Violation Customer You DataData DataFake DataFake x y x y
38. 2014 MapR Technologies 38 The Upshot So random numbers are
useful But simple distributions not so much How can YOU generate
cool data?
39. 2014 MapR Technologies 39 e-book available courtesy of MapR
http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted
Dunning and Ellen Friedman June 2014 (published by OReilly)
40. 2014 MapR Technologies 40 Last October: Time Series
Databases by Ted Dunning and Ellen Friedman Oct 2014 (published by
OReilly)
41. 2014 MapR Technologies 41 Coming in February: Real World
Hadoop by Ted Dunning and Ellen Friedman Feb 2015 (published by
OReilly)
42. 2014 MapR Technologies 42 Thank you for coming today!