FrontiersofComputationalJournalism
ColumbiaJournalismSchoolWeek9:SocialNetworkAnalysis
November20,2015
Asetofpeople
Network
andasetofconnectionsbetweenpairsofthem
Typesofconnections
Socialnetworkanalysis:onlyonetypeofconnectionbetweenindividuals (e.g."friend")
Linkanalysis:multiple typesofconnectionsfriendbrotheremployerwenttouniversitywithsoldacartoowns51%of
Linkanalysisismuchmorerelevanttojournalism, becauseitallowsrepresentationofmuchmoredetailandcontext.
PeopleActinGroupsFamilyandfriendships:Iammostcloselyconnectedtoasmallsetofpeople,whoareusuallycloselyconnectedtoeachother.
Business:IammuchmorelikelytodobusinesswithpeopleIalreadyknow.
Influence:IlistentopeopleIknowmorethanIlistentostrangers.
Norms:whatisrightdependsonwhatthepeoplearoundmethink.
Peopletendtomarry,dobusinesswith,spendtimewith,etc.peoplefromsimilarbackgrounds...andpeoplewhohavesocialtiestendtobesimilar.
Homophily
Homophily istheprinciple thatcontactbetweensimilarpeopleoccursatahigher ratethanamongdissimilarpeople.Thepervasivefactofhomophily meansthatcultural,behavioral,genetic,ormaterialinformation thatflowsthrough networkswilltendtobelocalized.Homophily imples thatdistanceintermsofsocialcharacteristicstranslatesintonetworkdistance,thenumberofrelationships through whichapieceofinformationmusttraveltoconnecttwoindividuals.
- McPherson,Smith-Lovin,CookBirdsofafeather:homophily insocialnetworks
StructureRelatestoBehavior
Ina1951experiment, researchershadfivepeopleworktogether,onlyallowedtocommunicateaccordingtooneofthepatternsabove.Theywereeachgivenacardwithseveralsymbolsonit.Thetaskwastodeterminewhichsymbolwasincommonbetweenallofthecards.Itwasrepeatedmanytimes.
Howdidthegroupsorganizethemselves?Whichpatternswere fastest?
From H. Leavitt, Some effects of certain communication patterns on group performance,Journal of Abnormal Psychology 46(1)
Correlationofdifferenttypesofinfo
Supposeyouhavearecordofphonenumberscalled,adatabaseofpoliticalcampaigndonations,andalistofgovernmentappointees.Putthemtogether,andyouhavethisstory:
WASHINGTON—Timeandagain,TexasGov.RickPerrypickeduphisofficephoneinthemonthsbeforehewouldannouncehisbidforthepresidency.Hedialedwealthyfriendswhowerehisbigfundraisersandstateofficialswhoowedhimfortheirjobs.
PerryalsometwithaTexasexecutivewhowould laterco-foundanindependentpoliticalcommitteethathaspromisedtoraisemillionstosupportPerrybutisprohibitedfromcoordinatingitsactivitieswiththegovernor.
- JackGillum,Perrycalledtopdonorsfromworkphones, AP,6Dec2011
SocialNetworkAnalysisinJournalism
• Identifypeopleorcommunities• Trackmoneyandcriminalnetworks• Understandspreadofinformationandbehavior• Illustratecomplexstories
UsefulinallareaswhereCSintersectsjournalism!(Reporting,communication,filtering,effecttracking)
Twomajoranalysismethods
…afteryouhavethenetworkdata,whichmaybeaverymanualprocess.
• Lookatavisualization• Applyalgorithm
Inbothcases,theresultsarenotinterpretablewithoutcontext!
Force-DirectedLayout
Eachedgeisa"spring" withafixedpreferredlength.Plusglobal repulsiveforcethatpushesallnodesapart.
FromTheEffectofGraphLayoutonInferencefromSocialNetworkData,Blytheetal.
FromTheEffectofGraphLayoutonInferencefromSocialNetworkData,Blytheetal.
FromTheEffectofGraphLayoutonInferencefromSocialNetworkData,Blytheetal.
Weaskedrespondentsthreequestionsaboutthesamefivefocalnodesineachsociogram:
1)howmanysubgroupswereinthesociogram2)how“prominent”waseachplayerinthesociogram3)howimportanta“bridging”roledideachplayeroccupyinthesociogram
Centrality
Oftenidentifiedwith"influence"or"power."Oftenimportantinjournalism.
Wecanvisualizethegraphanduseoureyes,orwecancomputecentralityvaluesalgorithmically.
Degreecentrality:numberofedges
Models: caseswherethenumberofconnections isimportant.Example:whichcelebritycanreachthemostpeopleatonce?
Closenesscentrality:averagedistancetoallothernodes
Models: caseswheretimetakentoreachanodeisimportant.Example:whofindsoutaboutgossip first?
Betweenness centrality:numberofshortestpathsthatpassthrough node
Models: caseswherecontrolovertransmission isimportant.Example:whohasthemostpower tomakeintroductions?
Eigenvectorcentrality:howlikelyyouaretoendupatanodeonarandomwalk
(sameideaasPageRank)
Models: caseswhereimportanceofneighbors isimportant.Example:theprivateadvisertothepresident
Journalismcentrality:howimportant isthispersontothisstory?
Whois"important"?Whattypeofpersondoyouwanttoidentify inthenetwork?
Oftenassumedwe'reafter"influential."Butsociology says"power" isacomplicatedthinganddifficult todefineandmeasure.
Networkanalysishasmostlyignored thisproblem.Iknowofnosuccessfuluseofcentralitymetricsinjournalism– maybeyou'llbethefirst.
FindingCommunities
Noonedefinitionof"community."Couldmeanatown,oraclub,oranindustrynetwork.
Butforourpurposes,acommunityis"agroupofpeoplewithpre-existingpatternsofassociation."
Insocialnetworkanalysis,thattranslatesintoclustersinthegraph.
Friends/followers
Co-consumption – Networkofpoliticalbooksales, Orgnet.com
Communicationsnetwork– ExploringEnron, JefferyHeer
Weblinkstructure–MapofIranianBlogosphere,Berkman Center
Individual time/locationtrails– CitySense,SenseNetworks
Warning:nonetworkisever"complete."Otherwisetherewouldbe7billionpeople init
Mathematicaldefinitionsof"cluster"
You'vealreadyseenseveral!Ifyoucancomputedistancebetweenanytwoitems,youcancluster.
Butinsocialnetworks,noteveryoneisconnectedtoeveryoneelse...
Modularity
Aretheremoreintra-groupedgesthanwewouldexpectrandomly?
Modularity
n=numberofverticeski =degreeofvertexiAij =1ifedgebetweeni,j,0otherwisegij =1ifi,j insamegroup,0otherwise
Therearetotaledgesinthegraph.Iftheygobetweenrandomverticesthennumberofedgesbetweeni,j is
m = 12 ki∑
kik j / 2m
Modularity
n=numberofverticeski =degreeofvertexiAij =1ifedgebetweeni,j,0otherwisegij =1ifi,j insamegroup,0otherwise
Modularity
IfQ>0thenthereare"excess"edgesinsidethegroups(andfeweredgesbetweenthem.)
Q = Aij − kik j / 2m( )ij∑ gij
Modularityalgorithm
• LookforadivisionofnodesintotwogroupsthatmaximizesQ
• Canfindthisthrougheigenvectortechnique• Possiblethatno divisionhasQ>0,inwhichcasethegraphisasinglecommunity
• IfadivisionwithQ>0found,split• Recursivelysplitsub-graphs
TheHairballproblem
Realsocialnetworksarebig,withcomplex,overlappingcommunities inthecentralcomponent.Modularityandothercommunitydetectionalgorithmsgivepoor results.
K-coreDecomposition
Findthenodesatthe"center"ofanetwork.
for k=1 to maximum node degreerepeat
remove all nodes with degree < kuntil all remaining nodes have degree >= kset "core number" of remaining nodes to k
K-coreDecomposition
Carmietal.,AmodelofInternettopologyusingk-shelldecomposition
ProtestDynamicsonTwitter
González-Bailon etal,TheDynamicsofProtestRecruitmentthroughanOnlineNetwork
k-corenumbervs.maximumcascadesize.Color=sentatleastonetweetwhichreachedthisfractionofusers(orange=reachedallusers)
Keyinsight:trianglesnotedges
Simmel's theoryofsociology(early20th C.)saysrelationshipbetweentwopeoplecannotbeunderstoodwithoutcontext.
Idea:countsharedtriangles
1.GiveneachnodeA,giveneachofA'sfriendsB,countthenumberoftrianglesinvolvingAandeachB(=numberofsharedfriendsofAandB).2.RankA'sfriends(eachB)bynumberofsharedfriends(numberofC'sforA,B)tocreate"topfriends"listforA.2.KeeptheedgebetweennodesA,Donlyifthereissomethresholdpercentageoverlapintheirtopfriendslist.
Simmelian Backbones
SNAinjournalism
• ICIJOffshoreTaxHavenleak• ICIJhumantissueinvestigation• OrganizedCrimeandCorruptionReportingProject• WSJGalleon'sWeb insidertradingstory• SCMP'sWhoRunsHongKong• Muckety.com
Theotherchallengewasthedataitself.Howtoseparatetheextraordinaryfromtheroutineandfindthepublicinterestinsideamazeofmorethan37,000offshorecompanyholders?Afirststepwastobuildasmanylistsaspossibleofpublicfigures:Politburomembers,militarycommanders,mayorsoflargecities,billionaireslistedinForbesandHurun’s rankingsofthemega-wealthyandso-calledprincelings(relativesofthecurrentleadershiporformerCommunistPartyelders).
Throughpainstakingdatabasework,areporterinSpaincross-referencedthelistsofnotableChineseagainstthenamesofoffshoreclientslistedwithinICIJ’sOffshoreLeaksdata.Theaddeddifficultywasthatinmostcases,namesintheoffshorefileswereregisteredinRomanizedform,notChinesecharacters.Thismademakingexactmatchesextremelyhard,becauseRomanizedspellingsfromChinesecharacterstendtovarywidely:WangmightbespelledWong,ZhangcouldbeCheung,andYemightbespelledYeh.AddressesandIDnumbershelpedconfirmedmanyidentitiesbutmanyothersnamesweredroppedbecausethereportingteamcouldnotbe100percentsurethatthepersonwasacorrectmatch.
Apictureslowlybegantoemerge:China’seliteswereaggressivelyusingoffshorehavenstoholdassets,listcompaniesintheworld’sstockexchanges,buyandsellrealestateandconducttheirbusinessawayfromBeijing’sredtapeandcapitalcontrols.
HowWeDidOffshoreLeaksChina,ICIJ
AnalyzingtheDatabehindSkinandBone,ICIJ
WhoRunsHK?TheFightoverStanleyHo'sFortuneSouthChinaMorningPost,2010
SNAthatcouldbeusedinJournalism
• TheNetworkofGlobalCorporateControlpaper• Networkofcampaignfinancecontributions(SuperPACs)• Internationalfinancialsystem/HFT• "Revolvingdoor"/regulatorycapture• Politicaleliteinanycountry• Findaudienceforstory,akintotargetedmarketing• ...
Vitali,Glattfelder,Battiston,TheNetworkofGlobalCorporateControl
SNAinjournalism• Visualizationwidelyused• Linkanalysissuccessfulininvestigativereporting• Mostoftheworkrequiredtodothesetypesofstoriesistraditional research,notalgorithmically-guided.
• Iamnotawareofsuccessfulapplicationofcentralitymetricsorcommunitydetectionalgorithms.
• Thismaychangeasthegraphsjournalismexaminesgetbigger...
• Woulditbepossibletousecommunitydetectiontofindthe"right"audienceforastory?
Recommended