69
Social Data Science Data and Big data David Dreyer Lassen UCPH ECON August 12, 2016

Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

SocialDataScience

DataandBigdataDavidDreyerLassen

UCPHECONAugust12,2016

Page 2: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

InGodwe trust,allothers mustbringdata

W.EdwardsDewing

Differenttypesofdata 2

Page 3: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Today:1.Empirical design2.datagenerating process3.modesofcollection

standardvsbig data;examples4.strategic dataprovision

Page 4: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 4

Page 5: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Different datafordifferent questionsor

Different questions fordifferent data

Sometimes possible toseparatedatacollection processfromunderlying datagenerating process – andsometimes not

Fundamentaldifferencebetween what people doandwhat they say they do‘cheap talk’/‘putyour money where your mouth is’/honest/costly signaling

Differenttypesofdata 5

Page 6: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 6

Page 7: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

What isyour question,again?1. Researchquestion from

theory2. Idealempirical design3. Feasible empirical

design/collection4. Results5. Adjustment of

theory/question/design6. Newresults7. …

A. WhatdatadowehaveB. Whatquestioncanthey

answerC. ResearchquestionD. Results

Differenttypesofdata 7

Page 8: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Allmodelsare wrong –butsome are useful

Two key goals1. Forecasting:individual behavior,policy

consequences,voting,ChampionsLeague,grades…Datascience/machine learning (butalsomacroeconomics)

2. Hypothesis testing,derived fromtheory´Traditional’socialscience

Differenttypesofdata 8

GeorgeBox

Page 9: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

1. Forecasting• Example:Bankwants toforecast non-payment onloans(P_d:probability ofdefault)

• Couldn’t care less about theory• Rough”DataScience”: try topredict fromallavailabledata

• Suppose we findthat birth weight predicts default– Bankishappy,better fit (defer ethics etc)– Policy:does investing inpre-natal care reduce defaults?

• Inpractice: setofpredictors typically taken from(some)theory,even ifcasual

• Complications:ifcustomers knowthat P_d depends onbirth weight,would/should they disclose it?What ifloans only todisclosers?Would they tell thetruth?

Differenttypesofdata 9

Page 10: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

2.Hypothesis testing• Theory (rationalchoice,sociology,biology,common sense,…)posits effect ofXonYA. Selection/typetheory:Peoplewho are impatient

cannot defer immediate pleasures ->smoke anddrinkwhile pregnant ->givesbirth sooner.Ifimpatient parents ->impatient children (whether bynatureornurture),we haveanexplanation.

B. Biological theory:low birth weight affects braindevelopment andneurological wiring forpatience.

• If(A),little role forpolicy;also,both can be trueatsametime

• Howtodistinguish:exogenous shock tobirthweight,butethically tricky...

Differenttypesofdata 10

Page 11: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Goodhart’s law

• Mostpopular:“Whenameasurebecomesatarget,itceasestobeagoodmeasure.”

• What he wrote:“Anyobservedstatisticalregularitywilltendtocollapseoncepressureisplaceduponitforcontrolpurposes.”

Differenttypesofdata 11

Page 12: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

TargetsandMeasures

• You cannot be toldhow your bankconstructsyour P_d.Why?– Goodhart’s law:people will attempt tooutmaneuver measure

– (thought)example:spending onshoes goodindicator ofaccount overdraft ->shoe lovers willhaveothers buy forthem,ceases tobe agoodmeasure

Differenttypesofdata 12

Page 13: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

CaseofGoogleFlu

• GoogleFlu:websearches forFlu symptomspredicted actual flu cases

• By-product ofGoogle’s main service• Butfrom2010,notsowell:overestimatedactual flu cases,partly asresult ofautosuggestfeature,partly becausemodelwas overfitted(we’ll return tothat)

• Bestpredictor:number ofcasespast week

Differenttypesofdata 13

Page 14: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 14

Page 15: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Effects ofcausesvs.

Causes ofeffects

Different questions• Effects ofcauses:intervention,what iseffectofpolicyXonoutcomeY

• Causes ofeffects:Why does Zoccur?

Differenttypesofdata 15

Page 16: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Effects ofcauses(forwardcausal questions)

• Narrow questions,sometimes (butnotalways)policyinterventions– Effect oftax change onbehavior– Effect ofregulation onrisk taking– Effect ofschooling onearnings– Effect ofsmokingonlung cancerpropensity– Effect ofpublichealth onschooling inAfrica– …

• Often,butnotalways,amenabletotreatments/randomization/experimentation

Differenttypesofdata 16

Page 17: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Causes ofeffects(reverse causal inference)

• Much harder,butoften moreinteresting–Why dosome people smoke?–What are thecauses ofdemocratization?–Why dosome people pursue aPhD why othersdropoutafter primary school?

–Why didGreece (almost)gobankrupt?• Tensionswith”effects ofcauses”– search forcauses sometimes derided as‘partychatter’

Differenttypesofdata 17

Page 18: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 18

Page 19: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

What isthedatagenerating process?

Observational:endogenousdecisions,researcherpassivecollector ofdataRandomization:treatment-control(Some)exogeneity:policyinterventions,sometimeswithcomparisons,researcherssometimes involved

Important:moredatadoes notgivebetterresult/moreprecision if estimator isbiased

Differenttypesofdata 19

Datagenerating process

Page 20: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Randomizedexperiments• Distinguish– Labexperiments:traditionallycomputer-basedinecon,butalsoeyetracking/brainimages(fMRI)/physiological

– Surveyexperiments:assignsurveyrespondentstodifferentframes/treatments/primings,e.g.haveSocDems andLiberalssaysamethingandlookatsupport

– Fieldexperiments:experimentalcontrolintherealworld,e.g.bankschargingdifferentratestolearnaboutmobilityofcustomers;interventionsagainstteacherabsenteeisminIndia;…)

Different typesofdata 20

Page 21: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Randomizedexperiments

• Distinguish– Naturalexperiments(weatherinduced:effectsofpovertyonviolence,randomizationofnamesonelectionballots,…)

– Quasi-experiments(effectsofchangeinpolicy;effectoftaxreformontaxplanning;effectofimmigrantallocationoncrime)

• Throughout:exogenous(outsideoftheindividual)change

Differenttypesofdata 21

Page 22: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Randomizedexperiments

• Large,importantcurrentdebatein(development)economics

• CofE:whatareeffectsofpenaltiesonteachers’absenceinIndianvillageschools– evidencefromrandomizedexperiments

• Randomlyselectedteachersgetharshpenaltyforno-shows->differenceinabsenteeismcausaleffect ofpenalty

• (BroaderEofC Q:whyiseducationsectorinruralIndiasoinefficient?)

Differenttypesofdata 22

Page 23: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Randomizedexperiments

• Strongoninternalvalidity:fromrandomizationany effectonabsenteeismisfromharsherpenalties;goodfortestingtheory

• Weak(er)onexternalvalidity– wouldeffectbesimilarinAfrica?Wouldeffectfromlabworkoutsidelab?Why,whynot?

• (compare:medicineworksinsimilarwaysacrosslocations)

Differenttypesofdata 23

Page 24: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Randomizedexperiments

• Challenges– Limitstowhatcanbestudiedbyexperimentation( ethics;law;feasibility)

– Funding(fieldexperimentsexpensive,surveyexplessso)

– Oftenparticipationconstraint– voluntaryparticipants’gain>=0ornoincentive

– Subjectsleaveforvarious(systematic)reasons– Large-scalerandomizationcanbehardinfieldexperiments

Differenttypesofdata 24

Page 25: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Observationaldata

• Generatedwithoutexperimentalorexogenousintervention

• Typicallyrevealscorrelationsordescriptivepatternsthatcanbeinterestinginthemselves

Differenttypesofdata 25

Page 26: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:Inequality

Differenttypesofdata 26

Source:Piketty andSaez,Science2014,taxreturndata

Page 27: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Observationaldata

• Generatedwithoutexperimentalorexogenousintervention

• Typicallyrevealscorrelationsordescriptivepatternsthatcanbeinterestinginthemselves– Areinthemselvessilentaboutcausality– Theorymaybeprovidestructuretolearnaboutcausalmechanismunderstrongassumptions

– Mayconflatecorrelationandcausality

Differenttypesofdata 27

Page 28: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Observationaldata

• Exple:Doesbeinginprivateschoolsaffectgrades– Classic:CatholicschoolsandgradesinUS– Collectattendanceandgrades->runregression

• But:supposesomeparentsaremorefocusedonschoolingthanothers– Sendkidstoprivateschoolmore– Moreinvolvedinschool+homework

• Whatdohighergradesmeasure?– EffectofprivateschoolOReffectofinvolvedparents?

Differenttypesofdata 28

Page 29: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Observationaldata

• Whattodo?– Assignkids/parentsrandomlytoprivateschools?

• Morecomplicated–Waiting-listexperimentdesign:peoplewhosignuprevealthemselvesasschoolinterested,comparegradesbetweenthoseinprogramandonwaitinglist->muchnarrowerdesign

– Modeling(UScase):usefactthatCatholicsaremuchmorelikelytochooseCatholicschools

Differenttypesofdata 29

Page 30: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 30

Page 31: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Modesofdatacollection• (Ethnographic/participantobserver)• Survey– Interviewsurvey(inperson),phonesurvey,internetsurvey,…

• Administrativedata– Usedforadministrativepurposes– Somecountries:census,taxreturn– DK:CPR-registrybased

• (Primarycollection: texts,counting)• “Bigdata”:insocialsciencestypicallyaby-productofdigitalinformation

Differenttypesofdata 31

Page 32: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Modesofdatacollection

• Note:survey,admindata,bigdatacanallhaverandomized/exogenouselementsorbepurelyobservational

• OfteninLab/fieldexperiments:askaboutincome,educationetc – butmaybebiased

• Sometimes:combineexperimentaldatawithadminorbigdata(butrare)

Differenttypesofdata 32

Page 33: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Ethnographic

• Pros– Attempttounderstandsituationsfromparticipants’perspective

– Verydetailedobservations(e.g.dynamicsatameeting:whospeakswhen,wholistens,whonodsoffandflirtsetc)

• Cons– Verydifficulttogeneralize(ifeventhegoal)

– Typicallyverysmalln,notforstats

– Hardtoreproduce/replicate

Differenttypesofdata 33

Page 34: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Surveys• Pros

– Canbecheap– Elicitinfoonattitudes,

beliefs,expectations– Necessarywhennoother

meansexist– Combinewithopen-ended

info– Easilyanonymized (firms;

China)

• Cons– Canbeexpensive– Non-randomsamples,

sometimesverymuchso(paidsurveys)

– Cheaptalk– Diverseinterpretations

(e.g.1-10scales,Maasaiexample)

– Verydifferentquality:interviewvs.internet

– Notfullresearchercontrol:Interviewercompletions

Differenttypesofdata 34

Page 35: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Administrativedata

• Denmark,Norway,Sweden– Population-wide– Ex:Knowpopulation‘bypressingEnter’

• Mostothercountries:census(countingpeople),surveys,roughapproximations

– InDK,builtonCentralPersonRegistrynumber– Systemconstructedforsourcetaxationin1960s,nowusedasubiquitousidentifier

• WhydosomecountrieshaveCPR-likesystemsandsomenot?

BigDatainEconomics

Page 36: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Administrativedata

• Pros– Oftenfullpopulation– InDK:thirdpartyreported->noreportingbias,nosurveybias

– Verydetailed,nosurveyfatigue

– Oftenveryprecise,sinceusedforadminpurposes

• Cons– Nosoftdata(attitudes,expectations);canbelinkedtosurveys

– Privacyconcerns– Restrictedtowhatiscollectedforadminreasons,bothtypeandfrequency(e.g.annual)

BigDatainEconomics

Page 37: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Administrativedata

• LotsofworkinDanisheconutilizesregisterdata– Taxation– Education– Health– Financialdecisions– Labormarket

• Combinedwith– Personalitymeasures– Attitudes/politicalprefsfromsurveys

– Expectationsfromsurveys

– Biologicaldata(neuro-measures,genetics)

– Datafromexperiments

BigDatainEconomics

Page 38: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Viva la revolución?HarnessingtheDataRevolution

forGood

HumanDevelopmentReportOffice

Page 39: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Big data

BigDatainEconomics

Page 40: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

NoagreedupondefinitionwhatBigDatais

• LargeN?• Highfrequency/muchdetail?

• Manydifferentmeasurements?

• Basedonwhatpeopledo(‘honestsignals’)– ctr surveys– Notalwayshonest

• Differenttodifferentpeople/traditions

• ToAmericans,Danishadmin/registerdataisbigdata

BigDatainEconomics

Page 41: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

‘Bigdata’

• Pros– Oftenbasedonrealdecisions (asadmindata),butmoredetail,e.g.auctions

– Highfrequency (e.g.wifi),highgranularity->almost‘largeNethnographicdata’

– Sometimescheap/free

• Cons– Noestablishedprotocolforcollection

– Sometimesdubiousquality,selectionissues(bothknown/unknown)

– Start-upcosts– Evenmoreprivacyconcerns

– Corporategatekeepers->biasinaccess(Facebook,Google)

BigDatainEconomics

Page 42: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Characteristicsof‘bigdata’

• Structured(row/column-style)vs.unstructured(images/sound)

• Temporallyreferenced(date,time,frequency)• Geographicallyreferenced(wifi,bluetooth,Google)

• Personidentifiable(identifyvs.distinguishindividualsvs.notdistinguishindividuals)– Separatemedium(e.g.phone)fromowner

BigDatainEconomics

Page 43: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:SocialFabric

• Large-scale(N=1000)bigdataproject• HandedoutsmartphonestoDTUfreshmen• Collectedphone,SMS/text/email(notcontent),GPS,wifi,bluetooth data

• ->Where,when,withwhom• ->socialnetworks

BigDatainEconomics

Page 44: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Whyphonedata

• Phonesassociometers• Many/mostpeoplecarryphonewiththemallthetime

• WouldbeIMPOSSIBLEtohavepeoplereportindetailforevery10mineverydayforayear

• Forthisproject:tailoredsoftware,butrealizedthatmanyappscollectdetailedwifi-datawithouttelling

• Concern:take-upofphones

BigDatainEconomics

Page 45: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:SocialFabric

BigDatainEconomics

Phone locations0500hMondaymorning ->canpredictwherepeopleatgiventimewith85%accuracy

Page 46: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:SocialFabric

BigDatainEconomics

10minGPS wifi

Page 47: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:SocialFabric

BigDatainEconomics

Page 48: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:peereffectsineducationeconomics

• Studentsallocatedtostudyandsocialgroups,calledvectorgroups(randomly)

• Aretherepeereffects,i.e.arestudents’grades/healthbehavior/studybehavioraffectedbythegroup?

• Literature:sometimesyes,sometimesno;veryheterogeneous

• Why?Perhapsbeingallocatedtogroupisnot=toactuallymeeting/usinggroup

BigDatainEconomics

Page 49: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:peereffects• Thinkofallocationtogroupasintentiontotreat(similartoofferingtreatment)

• Interestingexample:Carrell etal,ECMA2013.Smallgroups,yespeereffects;largegroups:no/negativepeereffects– WHY?

• Usephonetomeasurefrequencyofgroupmembersbeingtogetherphysically,measuredbybluetooth

• Threeparts:(i)yestheyaremoretogether;(ii)moretogether=>workbettertogether;(iii)peereffects?

BigDatainEconomics

Page 50: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Broaderissue:Whomeets,andhowclosearethey?

• Again:usebluetooth signalstomeasuremeetings(duration,participants)

• Analyzes3.1mio meetingsovertwomonths• Someresults:– Women/womenpairs->closer– Facebookfriends->closer– Samestudy->closer– Differenceinbeauty->furtherapart– Oneoverweight,onenot->furtherapart

• Peoplewhostandvery(too)closetoothershavefewerfriends(!?)

BigDatainEconomics

Page 51: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Predictionvscausality

• Measureclassattendancefromphonedata(wifi/GPS/bluetooth)– Either:constructclustersatslotsknownasteachingtime;or:useadmininfoonclasslocationsandconstructGPSoverlays

• Facebookactivity

• Predictgrades

BigDatainEconomics

Page 52: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

BigDatainEconomics

Page 53: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

BigDatainEconomics

Page 54: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

BigDatainEconomics

Page 55: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Predictionvscausality

Attendance->grades/comprehension– Peoplewhoattendmorelearnmore– PeoplewhospendlesstimeonFacebookhavemoretimeforstudying

AND/OR

Grades/comprehension->attendance– Findcourseshard->stayathome,moretemptedbyFacebook

BigDatainEconomics

Page 56: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:CSS

BigDatainEconomics

HeatmapofpeoplewithmobiledevicesonCSS(anonymous)

Page 57: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:DavidonSaturday

BigDatainEconomics

Page 58: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:DavidsomeSaturday

BigDatainEconomicsFleamarket

Page 59: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:howtomeasureconsumerspending

• Economicallyimportant:– Indicatorofhealthofeconomy– Importantforunderstandingindividualresponsestopolicy

– d.o.toeconomicshocks– Importantforconsumerprices->inflation->adjustmentsofwagesandtransfers

– Indevelopingcountries:importantforestimatesofpoverty,inequality

BigDatainEconomics

Page 60: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:consumerspending• Traditionalmethods:

– Consumerexpendituresurveys(DK:forbrugsundersøgelsen)

– Diaryorscanner– Errors,selection

• EconomistswantedaccesstoindividualspendingdatafromDankort foralongtime– Noluck

• Recently,StatisticsDenmarkgotaccesstoCOOP-carddatatomeasureinflation– Tobemadepublicsoon,

prettygoodfitwithexistingmeasures(andmuchfaster)

– Niceidea,incentivecompatible

– Indep ofpaymenttype– Butselection?

BigDatainEconomics

Page 61: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Example:consumerspending

• Attemptsindevelopingeconomics– Usesmartphonesasscannerormeansofpayment– whatcanweinferaboutindividualsfromsmartphoneuse(dedicatedusers)

– Selectionintowhohassmartphones– Butshouldbeseenagainstotherwaysofcollectingdata

• Qs:– Howcanweusesmartphonestoinferspendingbetter?– Whatkindsofeconomicallyinterestingdatacanwecollectviasmartphones?

BigDatainEconomics

Page 62: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

StatisticalanalysisofBigData

• Manyobservations:whatdoesstatisticalsignificancemean?– Andwhatispracticalrelevance?Sizeeffects

• Multipletestingproblems?Ifbigdatageneratesmanyvariables,whynotrunthroughthemalltoseewhatissignificant?– Correctstandarderrors

• Insomecases,‘eyeballeconometrics’canbedifficult– Needsystematicapproach

BigDatainEconomics

Page 63: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Statistical/machinelearning

• Supposeyouhavenoorverylittletheorytoguideyou

• OLSisnotonlylinear,butalsopresumessomeideaofwhatactuallygoesinthereandhow

• Varian’sTitanicexample:whosurvivedtheTitanic– Twovariables:Classandage– Researcherdecide/guessvs.dataanalysisyieldmostlikely(decisiontree,butlotsmorecomplicated->Sebastian,later)

– Einav,Levin:Econshouldconsidermachinelearning

BigDatainEconomics

Page 64: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

StatisticalanalysisofBigData

• Butwhatifyouhavetheory(orthinkyouhave)– e.g.combine econometricsandmachinelearning

• Goesbacktoolddebateineconomics– MiltonFriedman(1953): judgeamodelbyitspredictions,notitsassumptions

– Machinelearningmadeforpredictionnotforhypothesistestingandtheory(in)validation

BigDatainEconomics

Page 65: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

roadmap

• Different datafordifferent questions• Theory andempirics,forecasting andhypothesis testing

• Effects ofcauses vs.Causes ofeffects• Datagenerating process• Modesofdatacollection – pros andcons• Strategicdatamanagementanddataproduction

Differenttypesofdata 65

Page 66: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Strategicdatamanagementandproduction

• People/firms/governmentsdonotalwaysprovidetruthfuland/orcompletedata

• Example:Nopenaltyforlyinginsurveys– butnoreasonnottoeither

• Politicalreasonsforobscuringorinventingdata:GreeceinEU,Chineseeconomy

• Firms:Proprietaryinfo,competitionreasons,foolingcustomersandregulators(VW)

BigDatainEconomics

Page 67: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Strategicdatamanagementandproduction

• Individualdemandforprivacy(Wereturntothis)– Couldbeinstrumental:• lackofprivacydecreasesconsumersurplusbybetterestimateofreservationprice(e.g.Steering:Macvs PCwhenorderingonline)• Concernsaboutpoliticalissues

– Oranobjectiveinitself:Privacyasapoliticalgoal

BigDatainEconomics

Page 68: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Socialdesirability biasI

• Key concern insurveys,butmoregeneralproblem:What ifpeople answer soastoconformwithgeneralnotions ofwhat’s desirable?– Examples:Won’t admit tonotvoting orhavingsexually transmitted diseases,exaggerates income

– Reportsbuying healthy food vs unhealthy food– Important forasking/assessing sensitivequestions

BigDatainEconomics

Page 69: Social Data Science Data and Big data · (paid surveys) – Cheap talk – Diverse interpretations (e.g. 1-10 scales, Maasai example) – Very different quality: interview vs. internet

Socialdesirability biasII

• Why?• Distinguish

a) self-deceptionb) impression management

• Example:What doyou value mostinapotentialmate?– Peoplesay:"kindandunderstanding”– Fromdatingdata:physical attractiveness,status– Biascould be both (a)and(b)

BigDatainEconomics