14
Introduction to Big Data 1

Introduction to Big Data - stg-tud.github.iostg-tud.github.io/ctbd/2016/CTBD_03_bigdata_intro.pdfBig Data – Business perspective It is a new business model • People pay with data,

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

IntroductiontoBigData

1

BigData– Philosophicalperspective

Whatismorevaluable,ifyouhadtopickone?• experienceorintelligence?

• Traditional(computer)science:logic![intelligence]• understand theproblem, buildmodel/algorithm• answerquestion from implementationofmodel

• Newscience:statistics![experience]• collectdata• answerquestion fromdata(whatdidothersdo?)

2

Questionsand(some)answers

• Findaspouse?• ShouldAdambiteintotheapple?• 1+1?• Cureforcancer?• Howtotreatacough?• ShouldIgiveDonaldaloan?• Premiumforfireinsurance?• Whenshouldmysoncomehome?• WhichbookshouldIreadnext?• TranslatefromGermantoEnglish.

3

Questionsand(some)answers

• Findaspouse?Idonotwanttoknow!• ShouldAdambiteintotheapple?Ifyoubelieve...• 1+1?Definition• Cureforcancer?Idonotknow.Maybe.• Howtotreatacough?Yes.(GoogleInsight)• ShouldIgiveDonaldaloan?Yes.(e.g.,Schufa)• Premiumforfireinsurance?Yes.(e.g., … )• Whenshouldmysoncomehome?No!But...• WhichbookshouldIreadnext?Yes.(Amazon)• TranslatefromGermantoEnglish.Yes.(GoogleTransl.)

4

DataScience

• Newapproachtodoscience• Step1:Collectdata• Step2:GenerateHypotheses• Step3:ValidateHypotheses• Step4:(Goto Step1or2)

• Whyisthisagoodapproach?• Automated:no thinking, lesserror

• Whyisthisabadapproach?• Howtodebugwithoutaground truth?

• Moregenerally,interdisciplinaryemergingfield(seeimages)

5

“Big”data- Pros&Cons

• Pros• tolerateerrors• discoverthelongtailandcornercases– machinelearningworksmuchbetter

• Cons• Moredata,moreerror(e.g.,semanticheterogeneity)• Withenoughdatayoucanproveanything• stillneedhumans toaskrightquestions

6

BigDataSuccessStory

• GoogleTranslate• Youcollectsnippetsoftranslations• Youmatchsentencestosnippets• Youcontinuouslydebugyour system

• Whydoesitwork?• TherearetonsofsnippetsontheWeb• Thereisaground truththathelps todebugsystem

7

GoogleTranslateisbasedonsomethingcalled"statisticalmachinetranslation".Thismeansthattheygatherasmuchtextastheycanfind thatseemstobeparallelbetweentwolanguages, andthentheycrunchtheirdatatofind thelikelihood thatsomethinginLanguageAcorrespondstosomethinginLanguageB.Thismethodworkstosomeextentforlanguage pairswherealotofmore-or-lessparallel dataisavailable, forexampleEnglish-Spanish. […](quora.com)

BigData– Businessperspective

Itisanewbusinessmodel

• Peoplepaywithdata,e.g.Facebook,Google,Twitter:• useservice,givedata• Googlesellsyourdatatoadvertisers• youpayadvertisersindirectly

• 23andMe,Amazon:• payservice+givedata• sellsdataand• usesdatatoimproveservice

8

Bigdata:Thenextfrontierforinnovation, competition,andproductivity,McKinseyGlobalInstitute,June2011

BigData– Technicalperspective

• Youcollectalldata• themorethebetter->statisticalrelevance,• keepingallischeaperthandecidingwhattokeep

• Youdecideindependentlywhattodowithdata• runexperimentsondatawhenquestionarises

• Hugedifference totraditionalinformationsystems• Designupfront whatdatatokeepandwhy!!!(e.g.,waterfallmodelofsoftwareengineering!)

9

Consequences

• Volume:dataatrest• itisgoingtobealotofdata

• Velocity(Speed):datainmotion• itisgoingtoarrivefast

• Variety(Diversity):datainmanyformats• Differentshapes(e.g.,differentversions,differentsources)

• Veracity:dataindoubt• doyouknowwhatyouhave?

10

11

12

13

14