Upload
-
View
133
Download
0
Embed Size (px)
Citation preview
MakeAccumulatedDatainCompaniesEloquent
bySQLStatementConstructors
IEEEBigData2017(Boston)Dec.11
ToshiyukiShimonoDigitalGarage,Inc.
WorkContributions1. ForexploringanunknownDB,
Ø Organizedthemilestones.Ø Conceivedthemethods.
2. Proposedabeneficialsoftwaretoolfor“BigData”.Ø Seemsthatnoothertoolsexcept[Shimono16],basedonthesurveys [Saltz,Shamshurin 16],[Kumar,Alencar 16].
3. ReducinglabortounderstandaDB.Ø Byshrinkingitfrommonthstoaweek.Ø “Knowing” latently dominatesadataanalysisproject.
Asimilarslideappearsagainintheending.
2
I.Background6slides
3
Background
•Manyorganizationshaveaccumulatedtheirownbusinessdata,intherecentyears.
• Butactually,theirDBarerarelywelldesigned.Thustheirdataisfarfrombeingfullyutilized.
4
Therealsituationtodayasof2017Thedataaccumulatedissobigandcomplex.Whichpartofitshouldbetakenforanalysis?
Whichtables areneeded?
Whichcolumns areneeded?
Whereisthemeaningful date/user columns?
Howmanybyteswillbeexported?
Whatdothetables/columnsmean?
Howcanthedates/customersbenarroweddown?
Howistheexporteddatadamagedetected?
Andthedatabasesystemissooldinthatmeaningful“pre-analysis”isverydifficult!
5
Howcandatascientistsutilizedata?
Somanytablesandcolumns.Difficultiesoccurin:Øknowingthemeanings,Øreadingthedocuments,Ødiscussingwiththeclients.
Whatisthegoodwayforutilizingdatasleepinginthedatabase?
6▲Manytables,eachentailingmanycolumns!
1.UnderstandingDB
3.SerendipitousdiscoveryinBusinessbysomeadvancedanalysis
2.Newenvironmentbuildingforanalysis
Ø Arethedataeffectivetoanalyze?Ø Whatarespecial/errorvalues?Ø Howcolumnsareconnectedovertables?
7
PreliminaryKnowledge
•Databaseisacollectionoftables.•AtableislikeaMSExcelsheet,withrows (records) andcolumns (attributes).•ManyDatabasesarehandledbySQL.suchasMySQL,Oracle,PostgreSQL,SQLServer.
8
Ø Somecolumnsareconnectedtosharethesamecodingsystem.Ø Thenhowcanonedeterminealloftheconnectedcolumns?
9uOneneedstoseethevalues ofeachcolumn,buthow?
Retouched.https://commons.wikimedia.org/wiki/File:Data_model_in_ER.png
II.Introductionsofthenovelsoftware10slides
10
•Currentsoftwaredoesn’tcovertheneedstoday.•Newsoftwareisnecessary.•SoIcreateditforSQL-typeDB.
11
Assumptioninenvironmentn CLI (CommandLineInterface)
n toproduceSQLstatements.n tostorethedata.n toprocessthedata.
n SQL-typeDB.
▲ CommandLineInterface(CLI) ▲ SQLClientsoftware
SQLstatementsareenteredhere.
TheSQLoutput
appearshere.
12
Complement:SQLstatements
createtableT(xnumeric,yvarchar,zdate)ßMakeatable.insertintoTvalues(2,’abc’,’2017-12-11’)ß addarecord.select*fromTß outputalltherecordsofT.selectcount(*)fromTß TherecordnumberofT.
13
Togetan“listed”resultbySQL:1.PreparetheSQLstatement:
2.ThereturnedtablebytheSQLstatement:Thisoutputismuchuseful,buttheSQLstatementistoolongtomanuallyenter.
SoSQLstatementgeneratorisdesirable!
14
Timeflies.Timeinfoarehelpful.
Process-timeanddate-timeinformationareattached.
15
Acombinationof(20of) SQLstatementsyieldedbythesamecommand(usinganoption).
16
SQLstatementsareyieldedintheCLIenvironment.
20commandsformanyfunctions
Eachcommandreceiveseither:(1)tablenamesor(2)columnnameswiththeirtablenames
Eachcommandutilizesoptionswitches:--help:toshowtheonlinemanual-a,-b,-c..,-z:variousminorfunctions.
17
Theprogramfunctions
18
Programname Whatfunctionthe producedSQLstatement(s)has.serverInfo SQLDBsystemversioninformation.
tableLines Counting therecordsof eachtable.
tableColumns Columninformationofall.
sampleRows Randomsampling ofrows.
minMax Takingmin/maxofeach columns.Alsotakingthe4values.
mostFreq/FewId Takingthemost/fewfrequentvaluesofeach columns.
distinctCount Counting distinctvaluesofeachcolumns.
hasChar/nullCount Counting thevalueswithspecificcharacterornullvalues.
byteTable/byteCol Computeorestimatethebyte-sizeofeachtable orcolumn.
vennTwo Tocalculate howsetsofvaluesoverlap.
newTable Creatinga tablewithease.
hashSum Summing numericallymappedSHA-1valuetocomparetables.
SQLgeneratorsDemo(PowerPointanimation)
19
GitHubpage (programrepository)
20
Findthewebpage:github.com/tulamiliBothEnglishandJapaneseskillsarenecessarytouseit.Sorry!
ToimprovetheUIonCLI:• Ihavebeencreatingcommandsonebyonewiththepolicyof“using2Englishwordsforaprogram”.• Tokeepcleanthenamespace ofUnix/Linuxcommandnames,UIshouldbealtered.• Nextstepwouldbelikethis, thestyleofusingacommandargumenttospecifythefunction.
21
Skipthispageunlesstimeisenough.
III.TrickyfunctionstoseethevaluesofDB11slides
22
Whatisthemostconcisewaytoseethevalues??
Ø Columnnamesdon’tusuallytellifthevaluesare:Ø substancename(man,woman,Japan,USA,..)
Ø codedvalue (1,2,JP,US,…)
Ø Thecolumnrelationsovertablesareuneasytosee.Ø Knowingthespecial/error valuesisacraftwork.
seeingtheconcretevaluesisamust.
23
Ideatoget4valuesfromeachcolumn
24
(1) Colorthevaluesiftheirfirstcharacteristheminimumcharacter.(2) Fromthecoloredvalues,extracttheminimumandthemaximum.(3) Fromtheuncoloredvalues,extracttheminimumandthemaximum.(4) Those4values*would*tellthecolumncharacteristicwellJ
Whatisthegood/simplemethodtogetsometypicalvaluesfromacolumnifitsdatatypeistext,number,dateorwhatever?
TheVennDiagramandSQLstatement
Allthevaluesfromacolumn
Allthosewhosefirstdigitistheminimum.
Alltheothers
Theminimumoftheabove
v11
Themaximumoftheabove
v12
Theminimumoftheabove
v21
Themaximumoftheabove
v22
selectCfrom Twhereleft(C,1)=(select min(left(C,1))from T)
selectC from Twhereleft(C,1)!=(select min(left(C,1))from T)
selectmin(C),max(C)from Twhereleft(C,1)=(select min(left(C,1))from T)
selectmin(C),max(C)from Twhereleft(C,1)!=(select min(left(C,1))from T)
25
Skipthispageunlesstimeisenough.
Arethe4valuesenoughtoseeacolumn?
26
Skipthispageunlesstimeisenough.
• 2values(e.g.min/max) wouldnotworkL.• 4valuescancausemisleadinginsmallpossibility,butitactuallyworkswellasshownlater,sofar.• Howabout5or6ormorevalues:
• Themin/maxfromthe3rdsetcanbeadded.• Indeedgoodtoseethevarious/lengthytextvaluesJ .• Butitisbecomingnotsimple.RequiringcomplexSQL.• MuchcomputationtimeasIoncetriedL .
Appliedtothewholecolumns
27
Timedimension
Timedimension
Timedimension
UserdimensionUserdimension
Weightdimension
Non-numericorderNumericorder
Whatifonly2values?
28Hidedmeaningfulminimum2014-07-07.
Hidedmeaningfulminimum“-9990”.Hidedmeaningfulminimum“-5”.
Hidedexistenceof“00000”.
Thistableonlyassuresthereexistsatleast2distinctvaluesforeachoriginalcolumn.(3or4insteadof2isdesirable.)
Only1countrycodescanbeseenduetotheexistenceofspecialvalue“ZZZ”.
Skipthispageunlesstimeisenough.
selectmin(C),max(C)fromT
Complement:SQLstatements.
29
Relationsfoundsharingsamecodes
30
Special/anomalousvalues
31
Cf.randomsampling
Howabouttherandomsampling?1.SQLstatementbuilding
2.TheSQLoutput(partoftheresults)
32
Randomnesshelpstoseecolumnrelevance
• “Age”and“marriage”areyellow-backcolored.• Probably,1meansmarried,2meansunmarried.
33
Estimatinghow2tablesdiffer
34
Assumeyoucanaccesstoboth“therunningDB”anditsexporteddata.Exportingmaytakealotoftime,so“datachangebytime”occurs.Then,howyoucanestimatethenumberofdifferentrecordsbetweenthetwo?Establishingthemethodisrequired.
ItriedusingSHA-1functiontoseethedifference.Only6lineswasthelinecountingdifference,butactuallytheydifferinsomewhere[83,441]inthe95%confidence.
Youcanassumeeachofthe3valuesonarowlikeaGaussianvariableaccordingtothevariancedeterminedbythenumberoftherecordofeachtable,T1,T2,and(T1-T2)U (T2-T1),respectively.Bychangingtheconditions,yougotthe12repetitionmeasurements.Thenyoucanassumetherecordnumbersbasedontheestimationofthepopulationvarianceasshowninthelowerpartofthetable.
Skipthispageunlesstimeisenough.
Continuedfromthepreviousslide.
35
IfyousumupNvariables(ini.i.d.)fromadistributionwiththemeanzeroandthevarianceone,thenthesummationobeysadistributionwhichiswellapproximatedbythenormaldistributionwiththemeanzeroandthevarianceN.
IfyougotsuchnumberintherepetitionofKtimes,thenhowisitpossibletoestimatethenumberNbackward?ItcanbeestimatedfromthetotalofthesquareofthatKvaluesdividedby2.5%-tileand97.5-tilepointsofthechisquaredistributionwithKdegreesoffreedom.
AndaneasycomputationoftakingavariablewiththemeanzeroandthevarianceoneistotransformtheSHA-1valueinto[- sqrt(3),sqrt(3)].
Skipthispageunlesstimeisenough.
IV.Summary3slides
36
Rownumbers,tablecomparison.
The4valuetaking,randomsampling
Determiningallthesamecodesharing.
Randomsamplingfromthespeciallines.
StepstoknowDBtowardanalysis:
1. Knowingthetables.
2. Knowingthecolumns(individually).
3. Knowingthecolumnconnections(relations).
4. Knowinghow(row-wise)specialconditionsoccur.
Thoseaboveshouldbefulfilledbeforegoingbeyond.37
DB
SQLcmdgenerator
TableinfoColumninfo
Short-cuttingoperations
Extractedinfo Findingsbefore
main-analysis
GeneratedSQLcmd
Concretevalues
ü Valueformats
ü Special/errvalues
ü Columns’relations
ØMeanings
Simplertable(s)<- columnselecting<- time(date)narrowing<- customernarrowing
Ø VisualizationØ Mathmethods
BusinessValuebymain-analysis
Bigdiscoveryfromdata+Bigbusinessvalues
Anapplicationexample.
1. Youmayhavealotoftables.
2. Youunderstandeachcolumnofthemby:• seeingsomeoftheconcretevalues,• seeingthespecialandanomalousvalues,• determiningallofthesamecodesharingcolumns.
3. Thereafter,youcan:1. narrowdowntomodest-sizedtables.2. easilyhandlethedataforvisualizations.3. summarizethedatayouneedintoonetable
thatcanbehandledbymanymathematicalmethods.
39
40
Contributions(summary)1. ForexploringanunknownDB,
Ø Organizedthemilestones.Ø Conceivedthemethods.
2. ProposedabeneficialsoftwaretoolforBigData.Ø Seemsthatnoothertoolsexcept[Shimono16],basedonthesurveys [Saltz,Shamshurin 16],[Kumar,Alencar 16].
3. ReducinglabortounderstandaDB.Ø Byshrinkingitfrommonthsonlytoaweek.Ø “Knowing” latently dominatesadataanalysisproject.
41
V.ExtraSlides9slides
42
Wemust“understandDBcontents”⏤ beforeanyanalysis
Reasons:1. Effectivenesscheckforanalysispurpose2. Seeingtypical/special/anomalousvalues3. Handlingrelations amongcolumns4. RebuildinganotherDBenvironments
43
1.UnderstandingDB
3.Business-relatedcalculation(monthlysales,..)Advancedanalysisemployingmath-relatedmethods
2.Newenvironmentbuildingforanalysis
※ Note:Preprocessing existseverywhere,butwedonottouchthisexplicitly.
1. Effectivenesscheckforanalysispurpose2. Seeingtypical/special/anomalousvalues3. Handlingrelations amongcolumns4. RebuildinganotherDBenvironments
ReasonswhywemustunderstandDB:
WefocusonDBunderstanding
44
SquirrelSQL(since2001) 45
Detail:LineNumberListingExample
Thecommand“lineNumber”canyieldvarioustypeofSQLstatementsbyutilizingcommandoptionsuchas-n,-t.
Toproperlymakeoutput,itisdesignedsothattheSQLoutputcontains:1)sequencenumber,2)table(andcolumn)names,3)processtimeinseconds,4)thetimeofcalculations.
46
ü ADB hastables whichhavecolumns whichhavevalues.Ø Oneneedstodeterminecolumnconnections overtables.
47
©2013MicrosoftCorporation.Allrightsreserved.
uOneneedstoseethevalues ofeachcolumn,buthow?
Howtoseevaluesineachcolumn.
48
Tooutputan“integrated”tablebySQL:1.AnewcommandyieldsaSQLstatement:
2.ThereturnedtablebyqueryoftheSQLstatement:
Thisoutputismuchuseful,buttheSQLstatementistoolongtomanuallyenter.
ThusSQLstatementgeneratorisdesirable!
Outputtedby“newCmd <tables.txt”
49
Decipheringsomanycolumnsatonce.
50
Arethe4valuesenoughtoseeacolumn?
51
Skipthispageunlesstimeisenough.
• Only2values(e.g.min/max) wouldnotworkL.• Only4valuesmaycausemisleadingpossiblyL.• Aligningmorethan4values:
• Themin/maxfromsome(thethird)setcanbeadded.• Indeedgoodtoseethevarious/lengthytextvaluesJ .• MuchcomputationtimeasIoncetriedL .• SQLmayreallyneed“second_min”and“second_max”.
• Misc.• Nullvaluecareisdesirable.• Thefrequencynumbermaybedesirable.• Thevaluelengthsinformationishelpful.
52
RandomSampling,alsoweighting
53
Remainingissues⏤ beforetobuildnewDBenv.foranalysisCombinationalexplosion incalculationcanoccurtoreducetheredundancyofcolumns/rows.• Graspingalltheredundantcolumnsthroughknowingtherelationsinsideatable:1. Thevaluesof2ormorecolumnsofeveryrow
hasthe samevalues.2. Thevaluesofacolumncanbedeterminedother
columnvalues.• Graspingalltheredundantrows
• Howtoknowtheconditionwhetheracolumnhasavalueofnull,special,anomalous,rare values,whentheothercolumnvaluesseemstohaveclue?
54
55
DB
SQLcmdgenerator
TableinfoColumninfo
Short-cuttingoperations
Extractedinfo Findingsbefore
main-analysis
GeneratedSQLcmd
Concretevalues
ü Valueformats
ü Special/errvalues
ü Columns’relations
ØMeanings
Simplertable(s)<- columnselecting<- time(date)narrowing<- customernarrowing
Ø VisualizationØ Mathmethods
BusinessValuebymain-analysis
Bigdiscoveryfromdata+Bigbusinessvalues