25
SoDA 501: Approaches and Issues in Big Social Data Spring 2019 Burt L. Monroe Office: Sparks B002 (The Databasement) or Pond 207 Course website: https://burtmonroe.github.io/SoDA501 Office Hours: TBD, but Friday. Appointments: http://burtmonroe.youcanbook.me Contact: [email protected], 814-867-2726 or 814-865-9215 Description This seminar is part of the core seminar series for students in the Social Data Analytics dual-title PhD and doctoral minor. The primary objective of the seminar is interdisciplinary exposure to, engagement with, and integration of the tools, practices, language, and standards used in the col- lection and management of data in the component disciplines of the Social Data Analytics field. Each of you is well on your way toward a PhD – formal certification as an “expert” – in one of the component disciplines of Social Data Analytics and has in your coursework and research become well versed in one or more of the many computational, informational, statistical, visual analytic, or social scientific approaches to data, and the issues faced by those approaches. Here, we are interested in trying to integrate your multidisciplinary expertise, particularly in the context of data that are social (about, or arising from, human interaction) and big or intensive (of sufficient scale, variety, or complexity to strain the informational, computational, or cognitive limits of conventional approaches to data collection, management, manipulation, or analysis). The SoDA core seminars are organized around the metaphor of the social data stack. The social data stack consists of three fuzzily boundaried layers: the “data layer,” the “analytics layer,” and the “relevance layer” (Fig. 1). The data layer is comprised of the processes and technologies by which human interactions are translated into data about human interactions. These are the themes emphasized in SoDA 501, “Approaches and Issues in Big Social Data,” offered in the spring semester. Some SoDA / IGERT students will take more in depth seminars with focus on computational and informational aspects (primarily in Information Sciences & Technology, Geography, or engineering departments) and re- search design aspects (primarily in social science departments or Statistics) of the data layer. The analytics layer is comprised of the processes and technologies by which social data are translated into knowledge about society. These are the themes emphasized in SoDA 502, “Approaches and Issues in Social Data Analytics.” Some of you will take more in-depth seminars on machine / statistical learning, visual analytics, or other statistical or social scientific approaches to inference. The relevance layer is comprised of the processes and technologies by which knowledge about so- ciety is translated into value for science or society. Within the SoDA seminars, this is addressed primarily through exposure to and participation in projects that require an interdisciplinary team science approach. 1

SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

SoDA 501 Approaches and Issues in Big Social DataSpring 2019

Burt L MonroeOffice Sparks B002 (The Databasement) or Pond 207Course website httpsburtmonroegithubioSoDA501Office Hours TBD but Friday Appointments httpburtmonroeyoucanbookmeContact burtmonroepsuedu 814-867-2726 or 814-865-9215

Description

This seminar is part of the core seminar series for students in the Social Data Analytics dual-titlePhD and doctoral minor The primary objective of the seminar is interdisciplinary exposure toengagement with and integration of the tools practices language and standards used in the col-lection and management of data in the component disciplines of the Social Data Analytics field

Each of you is well on your way toward a PhD ndash formal certification as an ldquoexpertrdquo ndash in one of thecomponent disciplines of Social Data Analytics and has in your coursework and research becomewell versed in one or more of the many computational informational statistical visual analyticor social scientific approaches to data and the issues faced by those approaches Here we areinterested in trying to integrate your multidisciplinary expertise particularly in the context of datathat are social (about or arising from human interaction) and big or intensive (of sufficient scalevariety or complexity to strain the informational computational or cognitive limits of conventionalapproaches to data collection management manipulation or analysis)

The SoDA core seminars are organized around the metaphor of the social data stack The socialdata stack consists of three fuzzily boundaried layers the ldquodata layerrdquo the ldquoanalytics layerrdquo andthe ldquorelevance layerrdquo (Fig 1)

The data layer is comprised of the processes and technologies by which human interactions aretranslated into data about human interactions These are the themes emphasized in SoDA 501ldquoApproaches and Issues in Big Social Datardquo offered in the spring semester Some SoDA IGERTstudents will take more in depth seminars with focus on computational and informational aspects(primarily in Information Sciences amp Technology Geography or engineering departments) and re-search design aspects (primarily in social science departments or Statistics) of the data layer

The analytics layer is comprised of the processes and technologies by which social data are translatedinto knowledge about society These are the themes emphasized in SoDA 502 ldquoApproaches andIssues in Social Data Analyticsrdquo Some of you will take more in-depth seminars on machine statistical learning visual analytics or other statistical or social scientific approaches to inferenceThe relevance layer is comprised of the processes and technologies by which knowledge about so-ciety is translated into value for science or society Within the SoDA seminars this is addressedprimarily through exposure to and participation in projects that require an interdisciplinary teamscience approach

1

Figure 1 SoDA and the social data stack

Assignments and Grades

That leads us to the main pedagogical components of SoDA 501

bull Engagement in Seminar - 40

Guest Speakers For half of the session most weeks we will host a guest speaker(typically a member of the Graduate Faculty in Social Data Analytics drawn from thefull range of participating disciplines) discussing an active research project or relatedtopic that touches on one or more areas of concern in the course For each speaker wewill have two or more of you acting as ldquodesignated respondentsrdquo with extra responsibilityfor having questions for discussion with the speaker

Readings and Seminar Discussion The readings discussion and what lecturing Iwill do will focus on interdisciplinary integration In part this involves identifying thoseconcepts that may be new to some of you in this setting ndash eg how ldquobigrdquo or ldquomachine

2

learningrdquo or ldquovisual analyticsrdquo approaches challenge conventional social science method-ology or how social scientific thinking challenges emerging practices and conventionalwisdom in data science ndash and tools associated with those concepts In part this involvesinterdisciplinary arbitrage and translation ndash identifying common concepts and structurethat may go by slightly different names in different disciplines and settings To thisend I want you each to send me ndash by SUNDAY 700am each week by email ndash lists ofterms concepts that you encountered in that weekrsquos reading in three categories (1)termsconcepts that were new but you think you now understand (2) termsconceptsthat seem to be used differently than in the context of your home discipline and (3)termsconcepts you still find confusing

Grading Criteria Full 40 points if you are present every week have made a good faitheffort to provide your lists of confusing terms and concepts on time have thoughtfullyread all of the assignments are prepared to talk about the weekrsquos readings and themesand consistently contribute in ways that are productive to the discussion (good ques-tions thoughtful responses etc) with all of that weighted more heavily when you are adesignated respondent If you donrsquot do any of that 0 points Sliding scale in between

bull Exercises - 20 It is explicitly not an objective of this course to ldquotrainrdquo you in all of thetools we will mention much less those that we could mention a task that would take yearsIn the interest of collectively ldquomoving the ball forwardrdquo for each of you however we will havea few assigned exercises Some of these will be done in (assigned) interdisciplinary teamsSome exercises will be individual

bull Semester Team Project - 40 You will in a team consisting of at least three disciplinescreate gather andor organizemanipulateprepare for analytics a ldquobigrdquo ldquosocialrdquo dataset forsome plausible social scientific purpose The data must be at least partly social (arise fromhuman interactions) There must be some nontrivial computational or informational elementto the project There need not be a final analysis of the data but there must be some basiccalculation of descriptive statistics over the data and some demonstration of the validity ofthe data for the (or an) intended scientific purpose ndash eg representativeness (and of what)balance randomization measurement validity etc

Feb 26 - Deadline for approval of teams and (proposed) projects

March 12 - 25 Project Review

March 26 - 50 Project Review

April 9 - 75 Project Review

April 23 - Team Project Presentations

May 1 - White Paper and Data Replication Archive Due Submit a 4-5 pagesummary paper that explains what was done and why discusses problems you encoun-tered provides an assessment of the validity of the data for an analytic purpose (or howit might be validated) and discusses what further work might be done to make the datainto a useful resource for others andor to publish an analysis based on the data Shareyour code and data with me in maximally documented and reproducible form (ideally anotebook stored on github or similar) Your code notebooks documentation datacan be as long or as large as necessary

3

Course Schedule 2019

January 8

bull Introductions

bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)

bull Further reference

10RulesforData SoftwareCarpentry Lessons

Section ldquoGeneral Resources for Python and Rrdquo

Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)

Exercise 1 (on website due Jan 15th by 700 am)

January 15

bull Readings (send list of confusing terms concepts by 700am Sunday)

BitByBit Ch 1 amp 2

CompSocSci Monroe-No Monroe-5Vs Jordan-NotYetAI

One article from a discipline different than yours listed in the ldquoMultidisciplinary Per-spectivesrdquo section excluding Business-BigData

bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo

January 22

bull Readings (send list of confusing terms concepts by 700 am Sunday)

GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census

ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics

bull Further reference

Section ldquoiexclCuidadordquo

Section ldquoMeasurement Reliability and Validityrdquo

bull Determine ldquodiscussion leadrdquo dates

Jan 29

bull Speakers may start as soon as this which makes everything that follows somewhat unpre-dictable See the Spring 2018 syllabus for a rough indication of how topics will be arranged

4

February 5

February 12

February 19

Feb 26

bull Semester Projects Must be Approved Before Spring Break

March 5 - Spring Break

March 12

bull 25 Project Review

March 19

bull 50 Project Review

April 2

April 9

bull 75 Project Review

April 16

April 23 - LAST CLASS MEETING

bull Team Project Presentations

May 1 - Team Projects Due

5

Readings and References (updated Spring 2019)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 Bit by Bit Social Research in the Digital Age PrincetonUniversity Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

6

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull Dagger[Jordan-NotYetAI] Michael Jordan 2018 ldquoArtifical Intelligence ndash The Revolution Hasnrsquot Hap-pened Yetrdquo httpsmediumcommijordan3artificial-intelligence-the-revolution-hasnt-

happened-yet-5e1d5812e1e7

7

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

8

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

9

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

10

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 2: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Figure 1 SoDA and the social data stack

Assignments and Grades

That leads us to the main pedagogical components of SoDA 501

bull Engagement in Seminar - 40

Guest Speakers For half of the session most weeks we will host a guest speaker(typically a member of the Graduate Faculty in Social Data Analytics drawn from thefull range of participating disciplines) discussing an active research project or relatedtopic that touches on one or more areas of concern in the course For each speaker wewill have two or more of you acting as ldquodesignated respondentsrdquo with extra responsibilityfor having questions for discussion with the speaker

Readings and Seminar Discussion The readings discussion and what lecturing Iwill do will focus on interdisciplinary integration In part this involves identifying thoseconcepts that may be new to some of you in this setting ndash eg how ldquobigrdquo or ldquomachine

2

learningrdquo or ldquovisual analyticsrdquo approaches challenge conventional social science method-ology or how social scientific thinking challenges emerging practices and conventionalwisdom in data science ndash and tools associated with those concepts In part this involvesinterdisciplinary arbitrage and translation ndash identifying common concepts and structurethat may go by slightly different names in different disciplines and settings To thisend I want you each to send me ndash by SUNDAY 700am each week by email ndash lists ofterms concepts that you encountered in that weekrsquos reading in three categories (1)termsconcepts that were new but you think you now understand (2) termsconceptsthat seem to be used differently than in the context of your home discipline and (3)termsconcepts you still find confusing

Grading Criteria Full 40 points if you are present every week have made a good faitheffort to provide your lists of confusing terms and concepts on time have thoughtfullyread all of the assignments are prepared to talk about the weekrsquos readings and themesand consistently contribute in ways that are productive to the discussion (good ques-tions thoughtful responses etc) with all of that weighted more heavily when you are adesignated respondent If you donrsquot do any of that 0 points Sliding scale in between

bull Exercises - 20 It is explicitly not an objective of this course to ldquotrainrdquo you in all of thetools we will mention much less those that we could mention a task that would take yearsIn the interest of collectively ldquomoving the ball forwardrdquo for each of you however we will havea few assigned exercises Some of these will be done in (assigned) interdisciplinary teamsSome exercises will be individual

bull Semester Team Project - 40 You will in a team consisting of at least three disciplinescreate gather andor organizemanipulateprepare for analytics a ldquobigrdquo ldquosocialrdquo dataset forsome plausible social scientific purpose The data must be at least partly social (arise fromhuman interactions) There must be some nontrivial computational or informational elementto the project There need not be a final analysis of the data but there must be some basiccalculation of descriptive statistics over the data and some demonstration of the validity ofthe data for the (or an) intended scientific purpose ndash eg representativeness (and of what)balance randomization measurement validity etc

Feb 26 - Deadline for approval of teams and (proposed) projects

March 12 - 25 Project Review

March 26 - 50 Project Review

April 9 - 75 Project Review

April 23 - Team Project Presentations

May 1 - White Paper and Data Replication Archive Due Submit a 4-5 pagesummary paper that explains what was done and why discusses problems you encoun-tered provides an assessment of the validity of the data for an analytic purpose (or howit might be validated) and discusses what further work might be done to make the datainto a useful resource for others andor to publish an analysis based on the data Shareyour code and data with me in maximally documented and reproducible form (ideally anotebook stored on github or similar) Your code notebooks documentation datacan be as long or as large as necessary

3

Course Schedule 2019

January 8

bull Introductions

bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)

bull Further reference

10RulesforData SoftwareCarpentry Lessons

Section ldquoGeneral Resources for Python and Rrdquo

Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)

Exercise 1 (on website due Jan 15th by 700 am)

January 15

bull Readings (send list of confusing terms concepts by 700am Sunday)

BitByBit Ch 1 amp 2

CompSocSci Monroe-No Monroe-5Vs Jordan-NotYetAI

One article from a discipline different than yours listed in the ldquoMultidisciplinary Per-spectivesrdquo section excluding Business-BigData

bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo

January 22

bull Readings (send list of confusing terms concepts by 700 am Sunday)

GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census

ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics

bull Further reference

Section ldquoiexclCuidadordquo

Section ldquoMeasurement Reliability and Validityrdquo

bull Determine ldquodiscussion leadrdquo dates

Jan 29

bull Speakers may start as soon as this which makes everything that follows somewhat unpre-dictable See the Spring 2018 syllabus for a rough indication of how topics will be arranged

4

February 5

February 12

February 19

Feb 26

bull Semester Projects Must be Approved Before Spring Break

March 5 - Spring Break

March 12

bull 25 Project Review

March 19

bull 50 Project Review

April 2

April 9

bull 75 Project Review

April 16

April 23 - LAST CLASS MEETING

bull Team Project Presentations

May 1 - Team Projects Due

5

Readings and References (updated Spring 2019)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 Bit by Bit Social Research in the Digital Age PrincetonUniversity Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

6

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull Dagger[Jordan-NotYetAI] Michael Jordan 2018 ldquoArtifical Intelligence ndash The Revolution Hasnrsquot Hap-pened Yetrdquo httpsmediumcommijordan3artificial-intelligence-the-revolution-hasnt-

happened-yet-5e1d5812e1e7

7

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

8

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

9

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

10

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 3: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

learningrdquo or ldquovisual analyticsrdquo approaches challenge conventional social science method-ology or how social scientific thinking challenges emerging practices and conventionalwisdom in data science ndash and tools associated with those concepts In part this involvesinterdisciplinary arbitrage and translation ndash identifying common concepts and structurethat may go by slightly different names in different disciplines and settings To thisend I want you each to send me ndash by SUNDAY 700am each week by email ndash lists ofterms concepts that you encountered in that weekrsquos reading in three categories (1)termsconcepts that were new but you think you now understand (2) termsconceptsthat seem to be used differently than in the context of your home discipline and (3)termsconcepts you still find confusing

Grading Criteria Full 40 points if you are present every week have made a good faitheffort to provide your lists of confusing terms and concepts on time have thoughtfullyread all of the assignments are prepared to talk about the weekrsquos readings and themesand consistently contribute in ways that are productive to the discussion (good ques-tions thoughtful responses etc) with all of that weighted more heavily when you are adesignated respondent If you donrsquot do any of that 0 points Sliding scale in between

bull Exercises - 20 It is explicitly not an objective of this course to ldquotrainrdquo you in all of thetools we will mention much less those that we could mention a task that would take yearsIn the interest of collectively ldquomoving the ball forwardrdquo for each of you however we will havea few assigned exercises Some of these will be done in (assigned) interdisciplinary teamsSome exercises will be individual

bull Semester Team Project - 40 You will in a team consisting of at least three disciplinescreate gather andor organizemanipulateprepare for analytics a ldquobigrdquo ldquosocialrdquo dataset forsome plausible social scientific purpose The data must be at least partly social (arise fromhuman interactions) There must be some nontrivial computational or informational elementto the project There need not be a final analysis of the data but there must be some basiccalculation of descriptive statistics over the data and some demonstration of the validity ofthe data for the (or an) intended scientific purpose ndash eg representativeness (and of what)balance randomization measurement validity etc

Feb 26 - Deadline for approval of teams and (proposed) projects

March 12 - 25 Project Review

March 26 - 50 Project Review

April 9 - 75 Project Review

April 23 - Team Project Presentations

May 1 - White Paper and Data Replication Archive Due Submit a 4-5 pagesummary paper that explains what was done and why discusses problems you encoun-tered provides an assessment of the validity of the data for an analytic purpose (or howit might be validated) and discusses what further work might be done to make the datainto a useful resource for others andor to publish an analysis based on the data Shareyour code and data with me in maximally documented and reproducible form (ideally anotebook stored on github or similar) Your code notebooks documentation datacan be as long or as large as necessary

3

Course Schedule 2019

January 8

bull Introductions

bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)

bull Further reference

10RulesforData SoftwareCarpentry Lessons

Section ldquoGeneral Resources for Python and Rrdquo

Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)

Exercise 1 (on website due Jan 15th by 700 am)

January 15

bull Readings (send list of confusing terms concepts by 700am Sunday)

BitByBit Ch 1 amp 2

CompSocSci Monroe-No Monroe-5Vs Jordan-NotYetAI

One article from a discipline different than yours listed in the ldquoMultidisciplinary Per-spectivesrdquo section excluding Business-BigData

bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo

January 22

bull Readings (send list of confusing terms concepts by 700 am Sunday)

GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census

ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics

bull Further reference

Section ldquoiexclCuidadordquo

Section ldquoMeasurement Reliability and Validityrdquo

bull Determine ldquodiscussion leadrdquo dates

Jan 29

bull Speakers may start as soon as this which makes everything that follows somewhat unpre-dictable See the Spring 2018 syllabus for a rough indication of how topics will be arranged

4

February 5

February 12

February 19

Feb 26

bull Semester Projects Must be Approved Before Spring Break

March 5 - Spring Break

March 12

bull 25 Project Review

March 19

bull 50 Project Review

April 2

April 9

bull 75 Project Review

April 16

April 23 - LAST CLASS MEETING

bull Team Project Presentations

May 1 - Team Projects Due

5

Readings and References (updated Spring 2019)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 Bit by Bit Social Research in the Digital Age PrincetonUniversity Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

6

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull Dagger[Jordan-NotYetAI] Michael Jordan 2018 ldquoArtifical Intelligence ndash The Revolution Hasnrsquot Hap-pened Yetrdquo httpsmediumcommijordan3artificial-intelligence-the-revolution-hasnt-

happened-yet-5e1d5812e1e7

7

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

8

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

9

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

10

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 4: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Course Schedule 2019

January 8

bull Introductions

bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)

bull Further reference

10RulesforData SoftwareCarpentry Lessons

Section ldquoGeneral Resources for Python and Rrdquo

Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)

Exercise 1 (on website due Jan 15th by 700 am)

January 15

bull Readings (send list of confusing terms concepts by 700am Sunday)

BitByBit Ch 1 amp 2

CompSocSci Monroe-No Monroe-5Vs Jordan-NotYetAI

One article from a discipline different than yours listed in the ldquoMultidisciplinary Per-spectivesrdquo section excluding Business-BigData

bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo

January 22

bull Readings (send list of confusing terms concepts by 700 am Sunday)

GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census

ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics

bull Further reference

Section ldquoiexclCuidadordquo

Section ldquoMeasurement Reliability and Validityrdquo

bull Determine ldquodiscussion leadrdquo dates

Jan 29

bull Speakers may start as soon as this which makes everything that follows somewhat unpre-dictable See the Spring 2018 syllabus for a rough indication of how topics will be arranged

4

February 5

February 12

February 19

Feb 26

bull Semester Projects Must be Approved Before Spring Break

March 5 - Spring Break

March 12

bull 25 Project Review

March 19

bull 50 Project Review

April 2

April 9

bull 75 Project Review

April 16

April 23 - LAST CLASS MEETING

bull Team Project Presentations

May 1 - Team Projects Due

5

Readings and References (updated Spring 2019)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 Bit by Bit Social Research in the Digital Age PrincetonUniversity Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

6

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull Dagger[Jordan-NotYetAI] Michael Jordan 2018 ldquoArtifical Intelligence ndash The Revolution Hasnrsquot Hap-pened Yetrdquo httpsmediumcommijordan3artificial-intelligence-the-revolution-hasnt-

happened-yet-5e1d5812e1e7

7

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

8

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

9

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

10

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 5: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

February 5

February 12

February 19

Feb 26

bull Semester Projects Must be Approved Before Spring Break

March 5 - Spring Break

March 12

bull 25 Project Review

March 19

bull 50 Project Review

April 2

April 9

bull 75 Project Review

April 16

April 23 - LAST CLASS MEETING

bull Team Project Presentations

May 1 - Team Projects Due

5

Readings and References (updated Spring 2019)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 Bit by Bit Social Research in the Digital Age PrincetonUniversity Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

6

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull Dagger[Jordan-NotYetAI] Michael Jordan 2018 ldquoArtifical Intelligence ndash The Revolution Hasnrsquot Hap-pened Yetrdquo httpsmediumcommijordan3artificial-intelligence-the-revolution-hasnt-

happened-yet-5e1d5812e1e7

7

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

8

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

9

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

10

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 6: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Readings and References (updated Spring 2019)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 Bit by Bit Social Research in the Digital Age PrincetonUniversity Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

6

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull Dagger[Jordan-NotYetAI] Michael Jordan 2018 ldquoArtifical Intelligence ndash The Revolution Hasnrsquot Hap-pened Yetrdquo httpsmediumcommijordan3artificial-intelligence-the-revolution-hasnt-

happened-yet-5e1d5812e1e7

7

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

8

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

9

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

10

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 7: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull Dagger[Jordan-NotYetAI] Michael Jordan 2018 ldquoArtifical Intelligence ndash The Revolution Hasnrsquot Hap-pened Yetrdquo httpsmediumcommijordan3artificial-intelligence-the-revolution-hasnt-

happened-yet-5e1d5812e1e7

7

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

8

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

9

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

10

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 8: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

8

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

9

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

10

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 9: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

9

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

10

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 10: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

10

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 11: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

bull dagger[MTurkQuality] Ryan Kennedy Scott Clifford Tyler Burleigh Ryan Jewell Philip Waggoner2018 ldquoThe Shape of and Solutions to the MTurk Quality Crisisrdquo httpspapersssrncomsol3

paperscfmabstract_id=3272468

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

11

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 12: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

12

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 13: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

bull Dagger[FairAlgorithms] Rachel Courtland 2018 ldquoBias Detectives The Researchers Striving to MakeAlgorithms Fairrdquo httpswwwnaturecomarticlesd41586-018-05469-3

13

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 14: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

14

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 15: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

15

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 16: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

16

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 17: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

17

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 18: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

18

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 19: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

19

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 20: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

20

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 21: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

21

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 22: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

22

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 23: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

23

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 24: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

24

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25

Page 25: SoDA 501: Approaches and Issues in Big Social Data Spring ... · spectives"section, excludingBusiness-BigData. Further reference: Section \Big Data & Social Data Analytics." January

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

25