It’s About Data: 50,000 0 Overview of Data To
Insight Center
Beth Plale Professor and Director, Data To Insight Center
Dataset: D2I-‐AMSR-‐E-‐Provenance Dataset
Owner and Creator: Data to Insight Center Size: 15MB The University of Alabama in Huntsville processes data from the NASA AMSR-‐E instrument. The Karma project at Indiana University instrumented the ingest processing system and captured provenance for 3,890 runs for the period of September 2 -‐ October 4 2011. The details of the runs are in Figure III-‐16 below; the largest provenance graph is the monthly rain graph that, when represented as a XML is approximately 13MB. Luo, Yuan, Plale, Beth, Jensen, Sco^, Cheah, You-‐Wei, Conover, Helen. 2012. Provenance of AMSR-‐E Data from the Na`onal Snow and Ice Data Center (NSIDC). OPM XML Ver. 1.1., Sep 2 -‐ Oct 4, 2011. Bloomington, Indiana: Data to Insight Center. h^p://dx.doi.org/10.5967/M0F47M2D
Dataset: 10GB Noisy Provenance Dataset
Owner and Creator: Data to Insight Center Size: 10GB Provenance of scien`fic data is a key piece of the metadata record for data's ongoing discovery and reuse. Provenance collec`on systems capture provenance on the fly, however, the protocol between applica`on and provenance tool may not be reliable. Consequently, the provenance record can be par`al, par``oned, and simply inaccurate. The Gigabyte Synthe`c Provenance Database is a “noisy” data collec`on generated using the Workflow Emulator Tool (WORKEM) with a number of scien`fic workflow examples that includes modeled failures. h^p://d2i.indiana.edu/provenance_gigabyte_database
Dataset: D2I-‐Vortex2 Dataset Owner and Creator: Data to Insight Center Size: 100GB D2I generated a couple hundred short-‐term (12-‐15 hour) regional (state in size) weather forecasts late Spring 2010. These forecasts were generated to support the Vortex2 campaign. Funded by the Na`onal Science Founda`on, Vortex2 was an effort to move sensi`ve mobile data gathering instruments to the loca`on of severe weather ac`vity during late spring tornado season. The Vortex2 campaign covered loca`ons from Texas to Wisconsin over the course of six weeks. The dataset was carefully curated in 2011 with metadata added for each forecast. It has since become a significant, used in subsequent projects to develop other research products such as an ontology and federated searching of research data metadata. Plale, Beth, Brewster, Keith, Ma^ocks, Craig, Bhangale, Ashish, Withana, Eran C., Herath, Chathura, Terkhorn, Felix, Chandrasekar, Kavitha. Weather Forecast Data from the D2I-‐Vortex2 project. May 1 to June 15, 2010. Bloomington, Indiana: Data to Insight Centerh^ps://scholarworks.iu.edu/dspace/handle/2022/14983
Visualiza`on of Network Data Provenance
With Global Research Network Opera`ons Center (GNOC)
Mul`-‐layer provenance capture
• Provenance: lineage of data product or collec`on of data resul`ng from computa`onal execu`on of some sort
• Applica`ons o0en run in mul`ple phases, on mul`ple machines, and over mul`layer so0ware stacks. Provenance can be captured at mul`ple layers of stack.
What if just provenance capture just at layers below the applica6on?
6
Case I: GENI WiMAX DDoS WiMAX network; DoS attack exploiting WiMAX system parameters. Experiment uses 100 subscribers with varied configurations of 6 parameters. Current version runs on NS2.
With Clemson University
Provenance of WiMAX DDoS Experiment • Provenance capture with NetKarma. NetKarma captures
• provenance of packet movement, and
• infers critical provenance about packets that were dropped, and by doing so is able to convey information about DDoS attacks through visualization
• Improvement over earlier hand-worked ANOVA analysis.
Provenance captures causality:
Dropped packets increases as
frame duration increases from 0.01s to 0.02s
Run id
Frame duraJon
number of aMackers
aMack backoff start
aMack request retry
bw backoff start
bw request retry
1 0.004 20/80 1 2 1 2
244 0.01 20/80 1 2 1 2
487 0.02 20/80 1 2 1 2
Provenance capture AMSR-‐E data processing pipeline
Aug 2013 10
Advanced Microwave Scanning Radiometer (AMSR-‐E) : one of six
sensors aboard Aqua satellite. A passive microwave radiometer.
It observes atmospheric, land, oceanic, and cryospheric
parameters, including precipita`on, sea surface
temperatures, ice concentra`ons, snow water equivalent, surface
wetness, wind speed, atmospheric cloud water, and water vapor.
NASA AMSR-‐E imagery ingest processing schedule at Univ of Alabama Huntsville
Aug 2013 12
Provenance History Layout Algorithm • Provenance for 1 month
processing of NASA satellite ingest processing pipeline.
• Can help tracing error back to its cause.
• Shows rela`onship between daily products (each clover flower in the clover leaf chain) and final monthly products at the le0-‐end.
Provenance of a seaIce daily workflow
Provenance graph compare: failed runs
13
Provenance graph on le0 is complete provenance of successful execu`on. Comparing it with provenance graph on right shows that right one is a failure, because of final data product (green) in le0 graph cannot be matched.
Graph compare: dropped provenance
14
Le0 graph is provenance of successful execu`on. Graph on right shows that although right graph is successful execu`on, it has dropped no`fica`ons in provenance capture, because all nodes except some edges in le0 graph cannot be matched.
Hathi Trust Research Center
Text mining at scale
#HTRC #HathiTrust #HTRC #HathiTrust
à HathiTrust is large corpus providing opportunity for new forms of computation investigation. à The bigger the data, the less able we are to move it to a researcher’s desktop machine à Future research on large collections will require computation moves to the data, not vice versa
HTRC Partners
IU SoIC and Libraries (Beth Plale and Robert McDonald); UIUC GLIS and Libraries (J. Stephen Downie and Beth Sandore), Brandies (John Unsworth) ; University of Michigan (HathiTrust) h^p://www.hathitrust.org/htrc
#HTRC #HathiTrust
HTRC Non-‐Consump`ve Research Paradigm
• No ac6on or set of ac6ons on part of users, either ac6ng alone or in coopera6on with other users over dura6on of one or mul6ple sessions can result in sufficient informa6on gathered from collec6on of copyrighted works to reassemble pages from collec6on.
• Defini`on disallows collusion between users, or accumula`on of material over `me. Differen`ates human researcher from proxy which is not a user. Users are human beings.
#HTRC #HathiTrust
Themes for Authors
• Two topics with iden`cal centrali`es but separate themes
Yearly values of a ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899.
Underwood et al. Research
Corpus Usage Pa^erns Chapter 1
Chapter 1
Chapter 1
Page IV
Page IV
Page IV
Table of Contents 1………….# 2…………##
Table of Contents 1………….# 2…………##
Table of Contents 1………….# 2…………##
Access by chapter
Access by page
Access by special contents (table of contents, index, glossary)
21
• Philosophy: computa`on moves to data • REST based Web services architecture and protocols
• Registry of services and algorithms • Solr full text index • noSQL store as volume store • openID authen`ca`on • Portal front-‐end, programma`c access • SEASR text mining algos
8/13/13 22
Studies in Social-‐Ecological Systems Data Management
with David Leake and Xiaozhong Liu
What is SES Framework?
24
Source: Ostrom, E. 2009. A General Framework for Analyzing the Sustainability of Social Ecological Systems. Science 325: 419-‐422
e.g., Forests, fisheries, grazing, irriga`on systems for agriculture
25
SES Coding categories
Int’l Forestry Resources and Ins`tu`ons (IFRI) database: collec`on
• Data collected in 23 countries by 13 Collabora`ng Research Centers.
• Data collected using research instrument with 10 forms packages, totaling 180 pages, with some packages used up to 30 `mes or more per visit
IFRI database: data • Over 18 years longitudinal data on forest resources, use, and governance
• Consists of 346 separate site visits • Rela`onal database captures rela`ons between data collec`on packages
• Responses to each ques`on are a column in the database; 922 ques`ons in the IFRI database
RS1
RS2
RS3
RS4
RS5
RS6
RS9
RU2
RU3
RU4
RU5
RU7
GS1
GS2
GS3
GS4
GS5
GS6
GS7
GS8U1U2U3U4U5U6U7U8U9 I1 I2 I3 I4 I5 I6 O1
O2
O3S1S3
2-IND-305_07-16-19932-IND-304_07-14-19932-IND-256_08-10-19932-IND-310_08-25-19932-IND-253_08-04-19932-IND-300_06-24-19932-IND-255_08-08-19932-IND-254_08-06-19932-IND-302_07-02-19932-IND-303_07-02-19932-IND-311_08-30-19932-IND-312_09-05-19932-IND-301_07-16-199322-IND-14_05-29-200522-IND-15_06-01-200522-IND-12_05-14-200522-IND-11_04-01-200422-IND-13_05-18-200522-IND-16_06-03-200522-IND-10_03-19-2004
Site Visits 161 - 180
Resource System
Resource Units Governance
System
Users
Interac`ons
Outcomes Social
Economic Poli`cal Setngs
stud
y sites
SES Category
-‐2 0 2 4 6 8
Data Density (Z-‐Score)
IFRI map to SES for data discovery
28
IFRI data densi`es seen
through SES
ordered heatmap
Automated Classifica`on of Survey Ques`ons to SES
• Categoriza`on of each ques`on in instrument to SES Framework in automated way
• Carry out word frequency calcula`on on instrument ques`ons and SES categories. Used this to cluster ques`ons.
• Apply machine learning to avoid need for learning data set – Examined decision tree classifiers, Naïve Bayes, support vector machine (SVM). Decision tree classifiers performed best*.
* Jensen, S., et al. IEEE Int’l Conference on e-‐Science, Oct 2012
With IU Libraries, Umichigan (lead), and UIUC
SEAD
SEAD Virtual Archive (SVA) -‐-‐ manage sustainability science
window to mul`ple IRs -‐-‐OAIS model
IU Scholarworks IR
publish associate
discover
UIUC IDEALS IR
UMich Deep Blue IR
ingest
Hathi Trust Research Center h^p://www.hathitrust.org/htrc
SEAD DataNet h^p://www.sead-‐data.net
socioeco informa`cs h^p://d2i.indiana.edu/socio-‐eco-‐informa`cs
data provenance h^p://d2i.indiana.edu/provenance