33
COSSCI HIGH PERFORMANCE COMPUTING FOR ANTHROPOLOGY AND THE SOCIAL SCIENCES Lukasz Lacinski 1 Presenter (ECSS, U Chicago) Douglas White 2 Presenter and PI Rachana Ananthakrishnan 1 (Future planning) Tom Uram 3 (ECSS developer year 1) Tolga Oztan 2 (Section on DEf Modeling) Bob Sinkovits 4 (Section on Cohesion Modeling) Paul Rodriguez 4 (Section on Causal Modeling) Nancy Wilkins-Diehr 4 (SDSC ECSS) 27 slides plus 2 live demos 2+20 minute and 10 min discussion 1 University of Chicago 2 University of California, Irvine 3 Argonne National Labe 4 San Diego Supercomputer Center

CoSSci High Performance Computing for Anthropology and the Social Sciences

  • Upload
    albina

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

CoSSci High Performance Computing for Anthropology and the Social Sciences. Lukasz Lacinski 1 Presenter (ECSS, U Chicago) Douglas White 2 Presenter and PI Rachana Ananthakrishnan 1 (Future planning) Tom Uram 3 (ECSS developer year 1) Tolga Oztan 2 (Section on DEf Modeling) - PowerPoint PPT Presentation

Citation preview

Page 1: CoSSci High Performance Computing for Anthropology and the Social Sciences

COSSCI HIGH PERFORMANCE COMPUTING FOR ANTHROPOLOGY AND THE SOCIAL SCIENCESLukasz Lacinski1 Presenter (ECSS, U Chicago)Douglas White2 Presenter and PIRachana Ananthakrishnan1 (Future planning) Tom Uram3 (ECSS developer year 1)Tolga Oztan2 (Section on DEf Modeling) Bob Sinkovits4 (Section on Cohesion Modeling) Paul Rodriguez4 (Section on Causal Modeling)Nancy Wilkins-Diehr4 (SDSC ECSS)

27 slides plus 2 live demos 2+20 minute and 10 min discussion

1University of Chicago2University of California, Irvine3Argonne National Labe4San Diego Supercomputer Center

Page 2: CoSSci High Performance Computing for Anthropology and the Social Sciences

Outline: Lukasz• Motivation• Architecture• Gateway tools• Anthropology and Social Sciences• Gateway to Galaxy • Demo screencast• LiveDemo: How to Share Histories• LiveDemo: Complex Social Science Gateway• New work for new modeling approaches

Page 3: CoSSci High Performance Computing for Anthropology and the Social Sciences

MotivationCreate a gateway to support researchers and students, without requiring them to understand underlying computational resources and how to use them.Research analysis performed with the gateway should be:• Accessible – users can easily specify parameters and

run tools• Reproducible – input parameters and results are

captured so that any user can repeat and understand any result as a complete computational analysis

• Transparent – users share and publish analysis through on-line histories and clips of variables for model storage and reproduction of results

Page 4: CoSSci High Performance Computing for Anthropology and the Social Sciences

Architecture (1)• Web service

Galaxy – scientific workflow, data integration and data and analysis persistence and publishing platform

• Compute resources• 2 UCI virtual machines, one planned at Santa Fe Institute

each with 2 cores of Xeon CPU• XSEDE cluster – Trestles

324 compute nodes,each with 4 AMD Magny-Cours CPUs (32 cores)

Page 5: CoSSci High Performance Computing for Anthropology and the Social Sciences

Architecture (2)

End users

UCI VMsCoSSci Gateway2 Xeon cores each

XSEDE Trestles324 compute nodes4 AMD Magny-Cours CPUs each

http://socscicompute.ss.uci.edu/

Page 6: CoSSci High Performance Computing for Anthropology and the Social Sciences

Gateway tools• Can perform cross-cultural analysis on six different ethnological

datasets with of 2,657 variables to date:

SCCS – Standard Cross-Cultural Sample (n=186,v=2109)LRB – Lewis R. Binford’s forager data (n=339,v=506)

WNAI – Western North American Indians (n=172,v=496)XC – merged variables above from EA cases (n=371,v=2657)EA – Ethnographic Atlas (n=1270,v=166)AWC (Atlas of World Cultures (n=557,v=166)

• Use the Dow-Eff functions implemented in an R workspace. The functions estimate OLS, logit, and multinomial logit models, using multiple imputation to handle the problem of missing data, and network lag terms to handle Galton’s Problem.

Page 7: CoSSci High Performance Computing for Anthropology and the Social Sciences

Future work, through this fall, 2014• Use the mkmapping package to generate world maps with

convex hulls for autocorrelation clusters.• Improve color maps generated by the Rworldmap

package. Reduce the ordinal categories to maximum 9 values and 9 corresponding coloring of nodes.

• Add scaling to support the fv4scale and mkscale functions implemented in DEf01d R-workspace.

• Extend information printed to output CVS files.

live demo by Lukasz can begin here

Page 8: CoSSci High Performance Computing for Anthropology and the Social Sciences

instructional youtubes and options for VM and Galaxy Modeling: Gateway screenshot

SKIPc

Page 9: CoSSci High Performance Computing for Anthropology and the Social Sciences

Windows for entering model variables and modeling histories: Gateway screenshot

SKIP

Page 10: CoSSci High Performance Computing for Anthropology and the Social Sciences

SKIP

Page 11: CoSSci High Performance Computing for Anthropology and the Social Sciences

Outline: Doug• Anthropology and Social Sciences • Examples of Current Modeling (DEf: Dow and Eff functions)• Testing prior anthropological theories (e.g., Tolga Oztan)

• (and the discovery process with new models, methods, manuals)• Ongoing: Predictive cohesion in Complex Social Networks (Bob

Sinkovits)• Future work for new modeling approaches• Cutting edge: Install and use Libraries for Causal models

• Testing New Procedures: Bayesian Network • Finding (Causal) Network Structure• Comparing two (Moral Gods) models • Trestles bootstraps, Paul Rodriguez SDSC

• Does solving Galton’s Problem lead to different resuts?

Page 12: CoSSci High Performance Computing for Anthropology and the Social Sciences

Anthropology and Social Sciences • CoSSci tools provide great advantages to observational sciences:

Solution to the problem of greatly inflated significance tests with clustering, evolutionary histories, proximities of sample units – producing 50% or more spurious results

Thus: Solve actual determinants of variation in cross-cultural variation in beliefs and behaviorsHow evolutionary and economic processes are deeply embedded in culture – new fields like roots of economic developmentLinks of ecology to human cultural behaviorsAdjustments appropriate to archaeological and ethnographic dataUnderstanding our human past basic to understanding our future

• The databases analyzed by Dow-Eff functions, compensating for missing data and Galton’s Problem of nonindependence of cases, are essential in integrating physical and biological science understandings with anthropological, historical and social sciences.

Page 13: CoSSci High Performance Computing for Anthropology and the Social Sciences

e.g., Regression Diagnostics

Weaknesses in this model are (1) Wald: some additional variables may be missing; (2) The error terms (residuals) are heteroskedastic and not normally distributed.

Examples of Current Modeling• Scores of models have been done as chapters for the

Wiley Companion for Cross-Cultural Research and in the classroom -- for which CoSSci manuals are now available

Page 14: CoSSci High Performance Computing for Anthropology and the Social Sciences

Testing prior anthropological theoriesE.g., Anthropologists have assumed that if a couple lives with the wife’s

family, the WiMo is likely to be avoided; if with the husband’s family the HuFa is avoided, and so on, but there is no such evidence in any of our data. Nor that avoidances of any sort arise from projection of incest taboos, or variants of the Oedipus complex. Good samples and the correct statistical methods have been lacking.

Tolga Oztan, using DEf and our databases, shows the first evidence that avoidance behaviors involving kin predict broader networks of cooperative behavior through new in-laws and predict the expansion of political alliances and population sizes up to the appearance of formalized intercommunity government. The discovery here is that kin behavior is a key source of the development of cooperation in foragers, and probably in early human evolution. The data match Fred Eggan’s and Radcliffe-Brown’s descriptions of formalized kin avoidances as maintaining respectful distance rather than conflicts with in-laws.

Page 15: CoSSci High Performance Computing for Anthropology and the Social Sciences

DEf Autocorrelation regression shows evolutionary development of types of Avoidances

1 Variables 2 WiMo 3 WiFa 4 HuFa 5 HuMo 6 WBW 7 Sum Frequency

LoPopDen X X X X     4

Lo Hunting   X X X     3

JurisHier1 X X     -X   3

JurisHier2 X       -X   2

JH1 X JH2 -X       -X   2

NuclearFam X         2

LoPDSquared     X X     2

Bio.2     X X     2

Distance 0 0 80% 80% 80% 20% 3x80%

Language 65% 20% 0 0 0 0 1X65%

Ecology 35% 80% 20% 20% 20% 80% 1x80%

Sum 5 3 4 4 3 1 21/40

Cases:Av/total 25/60 7/35 14/50 3/33 7/13    

Frequency of shared predictors for different types of Avoidances

Page 16: CoSSci High Performance Computing for Anthropology and the Social Sciences

more• Avoidance theory is supported by the SCCS and WNAI data.• We take the evidence of eventual decline in the co-evolution

of Avoidances and greater complexity to be due to the competition from other forms of integrative hierarchy with the expansion of political complexity.

• Matrilineality, which disperses matrilineage men, is also a predictor of avoidances and creates effective defense against raiding, again linking avoidances to extended kin networks.

• Avoidance relationships are not based on fear but on respect. Gift-giving following stability in a recent marriage often leads to cessation of Avoidance.

• All these features are key to understanding cooperativity in human societies, which operates through cohesiveness.

Page 17: CoSSci High Performance Computing for Anthropology and the Social Sciences

Social networks: Cohesiveness• With Bob Sinkovits of SDSC, a second ECSS award aims to

achieve new measurements for one of the most important and complex problems in network mathematics, that of large overlapping sets of nodes that are structurally cohesive in both multi-connectedness and separating clusters by removing nodes, two measures that were proven to be precisely equivalent by Menger’s theorem, a fundamental key to understanding cooperativity in networks.

• These larger-scale network models lend a high level of predictability to sets of network science measures, which are often loosely defined and imprecise. Menger-based methods provide the tools for understanding how the larger contexts of human societies and their multilevel organizational entities give the network embedding for other phenomena. (They provide a potential for transforming our understanding of how complex networks act dynamically in today's globally networked world.)

Page 18: CoSSci High Performance Computing for Anthropology and the Social Sciences

ST

ST

Number of vertex disjoint paths (no two simultaneous paths share a vertex)

Minimum number of vertices that need to be removed so that source and target are no longer connected

ST

Menger’s Theorem in a nutshell

=

Page 19: CoSSci High Performance Computing for Anthropology and the Social Sciences

Pair-wise cohesion matrix• Element (i,j) of the pair-wise cohesion matrix (PCM) is the number of vertex

disjoint paths between vertices i and j• Binarized PCM: mij ≥ k, then mij 1; otherwise mij 0

• Treat the binarized PCM as a connectivity matrix; cliques are upper bounds on the k-components

0 3 2 3 3

3 0 2 3 4

2 2 0 2 2

3 3 2 0 3

3 4 2 3 0

0 1 0 1 1

1 0 0 1 1

0 0 0 0 0

1 1 0 0 1

1 1 0 1 0

Pair-wise Cohesion matrix Binarized PCM

Vertices (1,2,4,5) form a candidate 3-component

Page 20: CoSSci High Performance Computing for Anthropology and the Social Sciences

Tackling the co-author data set

128,151

20,181 disjoint clustersw/ 2-48 vertices

68,285

29,462

25,822 biconnected clustersw/ 2-36 vertices

Co-author data set obtained from sociology journals 1963-99 (vertices are authors, edges connect co-authors). 128,151 authors reduced to 29,462 by focusing on the largest bi-connected component

Page 21: CoSSci High Performance Computing for Anthropology and the Social Sciences

Constructing the PCMConstructing the PCM is a lot of work. Can reduce the effort by a factor of more than 10x by using some clever techniques to fill in the 2s and 3s

D1

D2

Use 2-vertex separators to find 2s Use 3-vertex separators to find 3s

Fill in remaining elements of PCM using more expensive algorithms from the iGraph library and using the power of parallel computing

Page 22: CoSSci High Performance Computing for Anthropology and the Social Sciences

Not quite done

• The methods described above are a big step in the right direction, but the results are too inclusive and contain both the k-component and possibly other vertices (k-candidates)

• Currently working on techniques to address these shortcomings

• Construct a modified pair-wise cohesion that will lead to less-inclusive k-candidates

• Identify vertices or sets of vertices within the k-candidate that can be rejected

• The object here, using HPC, is to be able to move from analysis of small-scale networks to the very large scale of complex or contemporary networks.

Page 23: CoSSci High Performance Computing for Anthropology and the Social Sciences

Future work for new modeling approaches• In the first round of work on Cross-Cultural Anthropological Modeling,

Aug 2012-Aug 2014, ECSS installed 4 successive improvements of R software by Mathematical Anthropologist Malcolm Dow and Comparative Econometrician Anthon Eff ending with DEf01, DEf01c, DEf01d and code for creating scales.

• New work involving Paul Rodriguez@SDSC. These single-variable dependent models also provided networks of variables that were fully imputed, and analyzed on Trestles using the R library(bnlearn) for Bayesian graphical network models. Next: Paul Rodriguez

• A second round, Aug 2014-2016, is proposed for Paul Rodriguez and Tolga Oztan to develop these new modeling using Trestles HPC, illustrated in the next slides for the variables in White’s HighGods models. The other big problems tackled will be time series, Akaike Information Criterion (AiCc) multivariate modeling, and path analysis with imputed variables, none of which are discussed here.

Page 24: CoSSci High Performance Computing for Anthropology and the Social Sciences

Testing New Procedures: Bayesian Network • Get probability tables (i.e. frequency counts) for all variables

(i.e. nodes)• Consider Joint Probability over all configurations of variable

values, e.g.P(HiGd,FxCmW,AnXbw,NoRnDry,Wrtng,v1695,v270,v1650)

• Dependencies (edges) determine conditioning variables for each table, e.g.

P(HiGod |AnXbwlth, No_Rn_Dry) = P(HiGod | AnXbwlth)

Anxbw

HiGod NoRnD

Page 25: CoSSci High Performance Computing for Anthropology and the Social Sciences

Finding Network Structure• Network Fit Measure

For a given graph (i.e. dependencies), all frequency counts can be reproduced

• Dependencies are given or discovered:all searching needs to score network on fit

locally (are edges good)globally (is whole network good)

greedy search or ‘hill climbing’ (heuristics guide search),

BUT, many solutions with same fit

Approach: using R package bnlearn with bootstrap samples to get network statistics (borrowing ideas from biological network discovery)

Page 26: CoSSci High Performance Computing for Anthropology and the Social Sciences

AnimXbwealthHiGod 0 1 2 3 4 5 7 8 9 1 54 7 6 1 0 0 0 1 0 2 40 6 5 0 0 0 0 0 0 3 13 1 4 3 1 0 1 0 0 4 21 2 0 9 3 1 0 3 4

HiGodFxCmtyWages 1 2 3 4 0 43 27 11 17 1 18 11 5 233=not Islam or Christianity 4=supportive of morality

Writing & RecordsHiGod 1 2 3 4 5 1 35 16 10 0 8 2 25 17 6 0 3 3 7 9 3 2 2 4 6 7 2 10 18

White, Oztan & Snarey (2014)

Brown & Eff (2010)

New Bayesian Network Learning Results with DEf imputed data and library(bnlearn) in comparing two Moral Gods models (left/right)

26

Causal modeling (Trestles HPC)

Page 27: CoSSci High Performance Computing for Anthropology and the Social Sciences

Trestles bootstraps, Paul Rodriguez SDSC, cont.• 1000 bootstrap resamples were taken by sampling the

original dataset with replacement (only takes few minutes)• For each new sample dataset, a bayes network was found

using the grow-shrink algorithm • The binary valued adjacency matrix was averaged across

all 1000 networks• Adjacency matrices were sorted and counted

27

Unique Adjacency Matrices0

155

frequ

ency

Page 28: CoSSci High Performance Computing for Anthropology and the Social Sciences

library(bootstrap) blocLite(Rgraphviz)

28

Page 29: CoSSci High Performance Computing for Anthropology and the Social Sciences

Bioconductor.blocLite.R library(bootstrap) Paul Rodriguez SDSCblocLite(Rgraphviz)V=letters[1:10]M=1:4g1=randomGraph(V,M,0.2)plot(g1)

Probabilities are generated by bootstrap, run on SDSC Trestles supercomputer

1695=No Scarification, 270=Class stratification

29

Page 30: CoSSci High Performance Computing for Anthropology and the Social Sciences

3rd-step regression with imputed variables: White et al (bold or red) vs. Brown & Eff Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.006272 0.473491 0.013 0.98945 Wy 0.651359 0.136649 4.767 0.00061 p<.001FxCmtyWages* 0.751684 0.259420 2.898 0.00426 p<.01 +Missions 0.334426 0.140836 2.375 0.01868 p<.02 bio.5 (temp) -0.150861 0.079799 -1.891 0.06039 p<.10 PCsizeSq** -0.077855 0.041264 -1.887 0.06090 p<.01 Writing 0.115742 0.064661 1.790 0.07524 p<.01 Caste 0.171590 0.104619 1.640 0.10283 AnimXbwealth 0.090378 0.059296 1.524 0.12932 DistantFather -0.129960 0.087188 -1.491 0.13793 No_rain_Dry 0.120057 0.083832 1.432 0.15395 PCsize 0.102367 0.078573 1.303 0.19440 ExtWar -0.013503 0.010783 -1.252 0.21221 AgPot -0.053785 0.064506 -0.834 0.40556 FoodScarcity 0.018975 0.056114 0.338 0.73567 Anim -0.006878 0.057393 -0.120 0.90476 *The FxCmtyWages variable is, as hypothesized, significant.**Works in both models. All variables imputed for n=186

Page 31: CoSSci High Performance Computing for Anthropology and the Social Sciences

31

• Estimate Std. Error t value Pr(>|t|) • (Intercept) 1.019415 0.729651 1.397 0.16577 • dx$FxCmtyWages 0.023184 0.273012 0.085 0.93251 <- n.s. • dx$v2006 Missions 0.457471 0.220324 2.076 0.04068 P<.05 • dx$v149 Writing 0.260651 0.104351 2.498 0.01429 p<.05• dx$v272 0.193109 0.182208 1.060 0.29203 • dx$AnimXbwealth 0.105582 0.079593 1.327 0.18798 • dx$v3 -0.003290 0.072426 -0.045 0.96387 • dx$No_rain_Dry 0.340791 0.126310 2.698 0.00831 p<.01• dx$v1650 -0.012738 0.015911 -0.801 0.42546 • dx$v1685 -0.038787 0.082818 -0.468 0.64066 • dx$v206 -0.008370 0.072604 -0.115 0.90848 • dx$bio.5 -0.002922 0.001762 -1.659 0.10064 • PCAP 0.139448 0.101782 1.370 0.17404 • PCsize 0.025052 0.140601 0.178 0.85898 • PCsizeSq -0.054963 0.057641 -0.954 0.34284

In regard to autocorrelation, i.e., Galton’s problem, do our DEf results differ from OLS? Yes, these are ols.

Page 33: CoSSci High Performance Computing for Anthropology and the Social Sciences

Trestles bootstraps, Paul Rodriguez SDSCA bootstrap procedure was used to explore the distribution of possible network models (Efron & Tishbrini, 1986). One thousand bootstrap resamples were taken by sampling the original dataset with replacement. For each new sample dataset, a bayes network was found using the grow-shrink algorithm (heeding independencies in the data). The binary valued adjacency matrix for each network was saved and then averaged across all 1000 networks, thereby producing an expectation for the presence of every edge (Figure with graph in file named 'BNwboot_nowy_05thresh'). This approach has proved very useful in biological network discovery (e.g. Marbach, etal. 2012). The expectation serves as a weight on the edge, but it does not indicate what typical networks appear in the bootstrap samples. Therefore, we also sorted and counted the adjacency matrices, and printed out the most frequent networks. Efron, B.; Tibshirani, R. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC. Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR; The DREAM5 Consortium, Kellis M, Collins JJ, Stolovitzky G. 2012. Wisdom of crowds for robust gene network inference. Nature Methods 9(8):796-804. 58 collaborators. Margaritis, D. and Thrun, S. 2000. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12. (“the bootstrap.”)

33

SKIP