Upload
jonathan-stray
View
223
Download
0
Embed Size (px)
Citation preview
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
1/56
Fron%ersofComputa%onalJournalism
ColumbiaJournalismSchool
Week1:Basics
September4,2013
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
2/56
Lecture1:Basics
ComputerScienceandJournalism
Represen%ngData
Interpre%ngHighDimensionalData
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
3/56
Computa%onalJournalism:Defini%ons
Broadlydefined,itcaninvolvechanginghow
storiesarediscovered,presented,aggregated,
mone%zed,andarchived.Computa%oncan
advancejournalismbydrawingoninnova%ons
intopicdetec%on,videoanalysis,
personaliza%on,aggrega%on,visualiza%on,and
sensemaking.
-Cohen,Hamilton,Turner,Computa(onalJournalism
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
4/56
Computa%onalJournalism:Defini%ons
Storieswillemergefromstacksoffinancialdisclosureforms,courtrecords,legisla%vehearings,officials'calendarsormee%ngnotes,and
regulators'emailmessagesthatnoonetodayhas%meormoneytomine.Withasuiteofrepor%ngtools,ajournalistwillbeabletoscan,transcribe,analyze,andvisualizethepaUernsinthese
documents.
-Cohen,Hamilton,Turner,Computa(onalJournalism
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
5/56
Cohenetal.model
Data Repor%ng
ser
Computer
Science
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
6/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
7/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
8/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
9/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
10/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
11/56
CSforpresenta%on/interac%on
Data Repor%ng
ser
CS
CS
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
12/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
13/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
14/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
15/56
Filtermanystoriesforuser
ser
DataRepor%ng
CS
DataRepor%ng
CS
DataRepor%ng
CS
Filtering
CSCS
CS
CS
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
16/56
Whataneditorputsonthefrontpage GoogleNewsRedditscommentsystem
TwiUer Facebooknewsfeed Techmeme
Examplesoffilters
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
17/56
MemetrackerbyLeskovic,Backstrom,Kleinberg
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
18/56
Kony2012earlynetwork,byGiladLotan/Socialflow
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
19/56
Trackeffects
ser
DataRepor%ng
CS
DataRepor%ng
CS
DataRepor%ng
CS
Filtering
CSCS
CS
CS
Effects
CS
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
20/56
ComputerScienceinJournalism
Repor%ng
Presenta%onFiltering
Tracking
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
21/56
Computa%onalJournalism:Defini%ons
theapplica%onofcomputersciencetothe
problemsofpublicinforma%on,knowledge,and
belief,byprac%%onerswhoseetheirmissionas
outsideofbothcommerceandgovernment.
-JonathanStray,AComputa(onalJournalismReadingList
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
22/56
CourseStructure
Informa%onretrieval:TF-IDF,searchengines Textanalysis:clusteringandtopicmodeling Informa%onfilteringsystems Socialnetworkanalysis Knowledgerepresenta%on Drawingconclusionsfromdata Informa%onSecurity Trackingflowandeffects
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
23/56
NaturalLanguage
Processing
DataScience
Sociology
Ar%ficial
Intelligence
Cogni%veScienceSta%s%cs
GraphTheory
Clustering
TextAnalysis
FilterDesign
SocialNetworkAnalysis
KnowledgeRepresenta%on
DrawingConclusions
Informa%onRetrieval
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
24/56
Administra%on
Assignmentaereachclass
Fourassignmentsrequireprogramming,but
yourwri%ngcountsformorethanyourcode!
Courseblog
hUp://jmsc.hku.hk/courses/jmsc6041spring2013/
Finalproject
tobecompletedFeb-April
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
25/56
Lecture1:Basics
ComputerScienceandJournalism
Represen%ngData
Interpre%ngHighDimensionalData
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
26/56
acollec%onofsimilarpiecesofinforma%on
Defini%onofdata
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
27/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
28/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
29/56
structureddata
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
30/56
unstructureddata
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
31/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
32/56
Vectorrepresenta%onofobjects
Fundamentalrepresenta%onfor(almost)all
datamining,clustering,machinelearning,
visualiza%on,NLP,etc.algorithms.
x1
x2
x3
xN
!
"
###
####
$
%
&&&
&&&&
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
33/56
Eachxiisanumericalorcategoricalfeature
N=numberoffeaturesordimension
x1
x2
x3
xN
!
"
######
#
$
%
&&&&&&
&
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
34/56
Examplesoffeatures
numberofclaws la%tude color{red,yellow,blue} numberofbreak-ins 1forboughtX,0fordidnotbuyX %me,dura%on,etc. numberof%meswordYappearsindocument votescast
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
35/56
Featureselec%on
Technicalmeaninginmachinelearningetc.:
whichvariablesma.er?
Werejournalists,sowereinterestedinan
earlierprocess:
howtodescribetheworldinnumbers?
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
36/56
ChoosingFeatures
wherekN
x1
x2
x3
xN
!
"
#####
##
$
%
&&&&&
&&
xf(1)
xf(2)
xf(k)
!
"
#####
$
%
&&&&&
Journalism
Howdowerepresentthe
world
numerically?
MachinelearningWhichvariables
carrythemost
informa%on?
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
37/56
Differenttypesofquan%ta%ve
Numericcon%nuouscountablebounded?unitsofmeasurement?
Categoricalfinite,e.g.{on,off}infinitee.g.{red,yellow,blue,...chartreuse}
ordered?equivalenceclassesorotherstructure?
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
38/56
Differenttypesofscales
Temperature
Con%nuousscale,fixedzeropoint,physicalunits,
compara%ve,uniform
LikertScale
Discretescale,nofixedorigin,abstractunits,
compara%ve,non-uniform
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
39/56
Likertscalesarenon-uniform
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
40/56
Noaveragesonanon-uniformscale
Itsnotlinear,so is2X1twiceasgood?
(X1+c)(X2+c)X1X2
Lotsofthingsdontmakemuchsense,suchas
sum(X1...XN)/N=?
Averageisnotwelldefined!(Norstddev,etc.)
Butrankordersta%s%csarerobust.
Andallofthismightnotbeaprobleminprac%ce.
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
41/56
Otherissueswithquan%ta%ve
Wheredidthedatacomefrom?physicalmeasurementcomputerlogginghumanrecording
Whatarethesourcesoferror?measurementerrormissingdataambiguityinhumanclassifica%on
processerrorsinten%onalbias/decep%on
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
42/56
Evenwithallthesecaveats,thevector
representa%onisincrediblyflexibleandpowerful.
x1
x2
x3
xN
!
"
######
#
$
%
&&&&&&
&
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
43/56
Examplesofvectorrepresenta%ons
Obvious
movieswatched/itemspurchasedLegisla%vevo%nghistoryforapoli%ciancrimeloca%ons
Lessobvious,butstandarddocumentvectorspacemodelpsychologicalsurveyresults
Trickyresearchproblem:disparatefieldtypesCorporatefilingdocumentWikileaksSIGACT
h d h ?
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
44/56
Whatcanwedowithvectors?
Predictonevariablebasedonothers
thisiscalledregressionsupervisedmachinelearning
Groupsimilaritemstogether
Thisisclassifica%onorclusteringWemayormaynotknowpre-exis%ngclasses
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
45/56
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
46/56
Interpre%ngHighDimensionalData
KHouseofLordsvo%ngrecord,2000-2012.
N=1043votesbyM=1630lords
2=aye,4=nay,-9=didn'tvote
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
47/56
Votevectors
letv(i,j)=voteofMPionissuej.Thenwecanlookatallvotesforapar%cularMP
Nowwehave1043vectors,eachofdimension1630.
Whatcouldwelearnfromthis?Whatistheir
structure?
mpi = v(i, 0) v(i,1)
v(i,N)!" #$
l h l
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
48/56
VisualizingHighDimensionalData
Wecanvisualize3dimensionsata%me.
Whatdowedowith1043?
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
49/56
LookingatallMPsforvotes100,200,300
i i li d %
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
50/56
Dimensionalityreduc%on
Problem:vectorspaceishigh-dimensional.ptothousandsofdimensions.Thescreenistwo-
dimensional.
Wehavetogofrom
xRN
tomuchlowerdimensionalpoints
yRK
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
51/56
Thisiscalled"projec%on"
Projec%onfrom3to2dimensions
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
52/56
Thinkofthisasrota%ngtoalignthe"screen"withcoordinate
axes,thensimplythrowingoutvaluesofhigherdimensions.
Projec%onfrom3to2dimensions
Di % f j % U !
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
53/56
Direc%onofprojec%onmaUers!
Whi h di % h ld l k f ?
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
54/56
Whichdirec%onshouldwelookfrom?
Intui%on:findadirec%onthat"spreadsout"points.
H f L d PCA l i
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
55/56
HouseofLordsPCAanalysis
PrincipalComponentsAnalysisfindsthedirec%onsofmaximum
variance.Here,we'replongthetwodimsofgreatestvariance.
I t t % i t t
7/29/2019 Computational Journalism at Columbia, Fall 2013: Lecture 1, Basics
56/56
Interpreta%onrequirescontext
Conserva%veandLiberalDemocratsreallydovotetogether,
mostly Cross-benchers and bishops in the middle Labor opposite