Addressing the Challenges of the Scientific Data Deluge

Preview:

DESCRIPTION

Addressing the Challenges of the Scientific Data Deluge. Kenneth Chiu SUNY Binghamton. Outline. Overview of collaborative projects that I’m working on. Discussion of challenges and approaches. Technical overview of specific projects. Autoscaling Project. - PowerPoint PPT Presentation

Citation preview

1

Addressing the Challenges of the Scientific Data Deluge

Kenneth ChiuSUNY Binghamton

2

Outline

• Overview of collaborative projects that I’m working on.

• Discussion of challenges and approaches.

• Technical overview of specific projects

3

Autoscaling Project

• Traditional research focus in sensor networks on energy, routing, etc.

• In “environmental observatories”, management is the problem.

• Adding a sensor takes a lot of manual reconfiguration.– Calibration, recalibration.– QA/QC is also a major issue.

• What corrections have been applied to the data, and, what calibrations/maintenance have been applied to the sensor?

• With U. Wisconsin, SDSC, and Indiana University.

4

Motivation

• Adding a sensor requires a great deal of manual effort.– Reconfiguring datalogger– Reconfiguring data acquisition software– Reconfiguring QA/QC triggers– Reconfiguring database tables

• QA/QC is not very automated• Result: Sensor networks are not very scalable.• Goal: Automate.

5

Metadatafor each final table

Metadata

• describes each final table

• are used to generate forms dynamically for data retrieval from website

• entered manually

6

Approach

• Use a agent-based, bottom-up approach.

• Agents coordinate among themselves, as much as possible.

• Unify communications. All communications done via data streams.

• Data streams represented as content-based, publish-subscribe systems.

7

Long-Term Ecological Research (LTER)

Data-logger

QAAgent

Env.Agent

QAAgent

Oracle

WebServer

WebBrowser

Config. Event (CIMA)

Other Connection

Trout Lake Station

University ofWisconsin Campus

Buoy

ORB

ARTS Connection

Sensors

Env. Event (CIMA)

JDBC/ODBC Connection

WebBrowser

Config.Agent

Config.Agent

Config.Agent

Config.Agent

1

2

3

3

4

Other Locations

8

Agents

• Characteristics– Autonomous– Bottom up– Distributed coordination– Independence/loosely-coupled

• Can be thought of as a “style” for implementing distributed systems.

9

Sensor Metadata

• Each sensor has intrinsic and extrinsic properties.– Intrinsic are type, model number, etc.

• Static: Cannot be changed.• Dynamic: SDI-12 address.

– Extrinsic are location, sampling rate, etc.

• Use code generation techniques to generate the proper code based on the sensor data.

10

Automatic Sensor Detection and Inventory

InstrumentAgent

WebService

Acquisition Computer

Field Station Computer

DataloggerProgram

SensorMetadata

Repository

3

5: Upload

4: Generate

Datalogger

Sensor

Response

Request

2

2

3

1: Detection event

Data Center

Database

6: Data

7

11

QA/QC

• Malfunctioning anemometer detected as an abnormal occurrence of zero wind speed values.

0

50

100

150

200

250

Jan-95 Jan-97 Jan-99 Jan-01 Jan-03

frequency ofzero hourly average wind speed values per month

12

Another Example

• Buoy was pulled down in the water by the ice.

-2

-1

0

1

2

3

4

23-Nov 23-Dec 22-Jan 21-Feb

watertemperature(deg C)

-2

-1

0

1

2

3

4

15-Nov 15-Dec 14-Jan 13-Feb

sensors displaced normal winter

Hu and Benson

13

Crystal Grid Framework

• Seeks to develop standards and middleware for integrating instrument and sensor data into wide-area infrastructures, such as grid computing.

• With Indiana University.

14

Motivation• Process of collecting and generating data is often critical.

– Current mechanisms for monitoring and control either require physical presence, or use ad hoc protocols, formats.

• Instruments and sensors are already “wired”.– Usually via obscure, or perhaps proprietary protocols.

• Using standard mechanisms and protocols can give these devices a grid presence.– Benefit from a single, unified paradigm, terminology.– Single set of standards; exploit existing grid standards.– Simplifies end-to-end provenance tracking.– Faster, seamless interactions between data acquisition and data

processing.– Greater interoperability and compatibility.

Philosophy: Push grid standards as close to the instrument or sensor as possible. (But no further!) Deal with “impedance mismatches” close to the instrument, so as to localize complexity.

15

• Develop a set of standard grid services for accessing and controlling instruments.– Based on Web standards such as WSDL, SOAP, XML, etc.

• Develop a instrument ontology for describing instruments.– Applications use the description to interact.

• The goal is to develop middleware that abstracts and layers functionality.– Minor differences in instruments should only result in minor loss

of functionality to the application.

• Move metadata and provenance as close to the instrument as possible.

Goals

16

Overview

Physical Network Transport

Data Pipeline

AcquisitionComponent

AcquisitionCode

InstrumentAccess

AnalysisComponent

AnalysisCode

InstrumentAccess

CurationComponent

CurationCode

InstrumentAccess

Instrument

Sensor 1

Controller

Sensor 2

InstrumentPresentation

Scientist

InstrumentAccess

RemoteAccess

GUI

Device-Independent Application Module

Device-Dependent Virtualization Module

Shared Implementation

17

Distributed X-Ray Crystallography

• Crystallographer, chemist, and technician may be separated.– Large resources such as synchrotrons– Convenience and productivity– Expanding usage to smaller institutions

• Data collection, analysis, and curation may be separated.

• Approximate data requirements: 1-10 TB/year.– Currently stored at IU.

• Real-time data collection and control.• Collaboration with IU, Sydney, JCU,

Southampton.

18

X-Ray Crystallography

• Scientists are very reluctant (understandably) to install your software on the acquisition machine.– Use a proxy box by which to access files via CIFS or

NFS.– Scan for files which indicate activity.

• Unfortunately, scientists can manually create files, which can confuse the scanner. No ideal solution.

• For sensor data, request-response is not ideal.– Push data using one-way messages.

• In WSDL 2.0, consider “connecting” out-only services to in-only services.

19

X-Ray Crystallography

Portal

InstrumentManager

DataArchive

Non-grid service

Grid service

Persistent

Non-persistent

Portal

InstrumentManager

DataArchive

Indiana University

University of Sydney

InstrumentServices

Proxy BoxAcquisition Machine

CIFS

Argonne National Labs

University of Southhampton

Fromdiffractometer Instrument

Services

Proxy BoxAcquisition Machine

CIFS

Fromdiffractometer

20

TASCS: Center for Technology for Advanced Scientific Component

Software

• Multi-institution DOE project.• Seeks to develop a common component

architecture for scientific components.• My focus within it is to develop a

BabelRMI/Proteus implementation.– And develop C++ reflection techniques to improve

dynamic connection abilities.

• With LLNL and many other institutions.

21

Babel

• Language interoperability toolkit developed at LLNL.

• Allows writing objects in a number of languages, including non-OOP ones such as Fortran.

• Began as a purely in-process tool, now includes an RMI interface.

22

Proteus

• Started off as a unification API for messaging over multiple standards and implementations, such as CORBA, JMS, SOAP.

• Moving towards focusing on multiprotocol web services.

• Though almost always bound to SOAP, WSDL actually fully supports almost any protocol.

23

Runtime

Stub

IOR

C++ Skel

RMI Stub

Proteus

Impl

Skel

IOR

C++ Stub

Proteus

SerializableObject

B-PAdapter

B-PAdapter

SerializableObject

Generated

Library

User

Babel-ProteusGenerated

WSIT WSIT

24

Multiprotocol

Network

Proteus

Client

ProviderA

ProviderB

Proteus

Client

ProviderA

ProviderB

Protocol A

Protocol B

Process 1 Process 2

28

Lake Sunapee• Most e-Science/cyberinfrastructure R&D is for

institutional science.– Assume significant resources and expertise.

• Much less work on CI for citizen science, non-profits organizations, etc.

• This project explores how to engage them in the development of cyberinfrastructure and e-Science.– Also with a focus on how to use e-Science to

engage and educate K-12.– Also with a focus on how to train CS students to

better engage scientists.• With U. Wisconsin, U. Michigan, LSPA, and

IES.

29

• Hold a series of workshops to understand needs.

• Research and develop systems to allow them accessible means to interpret the sensor data.

• Course component: seminar/project course where students will work with citizen scientists in small groups to define and implement e-Science projects with the lake association.

30

• Semantic publish-subscribe.– Content-based publish-subscribe needs a

content model.– Semantic web/description logics provide an

ideal content model.

31

Many Small Datasets

• Much ecological data is characterized not by a few large datasets, but many small datasets.– e-Science has up to now chosen to focus on a

few large datasets, mostly.

32

Flexible Electronics and Nanotechnology

• Work with Howard Wang in BU ME.

• “Ontologies” for materials science processes (internal).

• Undergraduate education project (NSF).

33

Material Processes

• Materials science research product is the characterization of a process (vibration, heating, chemical, electrical, etc.).

• Applying such research is finding a sequence of processes that will transform a material A (with certain properties such as particle size) to a material B (with certain other properties).

• Very difficult to search the research literature.• Also, this is a type of path finding problem.

34

“annealing”

hasName

tempSchedule

“a schedule”

Conceptually, the schedule is just a function that gives the temperature as output given the time as input. One question is whether or not to attempt to represent it partially in the graph model, or to treat it’s representation as completely outside the model.

For example, a function can be represented as a table, or a Fourier series, wavelets, etc.

“annealing”

hasName

tempSchedule

“a differentschedule”

This is an anonymous node that only serves to “bind” the other nodes together. You can think of it as representing the process as a whole.

Information is sparse.

35

Undergraduate Education

• Groups of nanotechnology students develop senior design projects with CS students.

36

Programs-Australia-Canada-China-Finland-Florida-New Zealand-Israel-South Korea-Taiwan-United Kingdom-Wisconsin

First meeting:San DiegoMarch 7-9, 2005

Source: T. Kratz

37

Vision and Driving Rationale for GLEON

• A global network of hundreds of instrumented lakes, data, researchers, students,

• Predict lake ecosystems response to natural and anthropogenic mediated events – Through improved data inputs to simulation models– To better plan and preserve freshwater resources on

the planet

• More or less a grass roots organization.• Led by Peter Arzberger at SDSC, and with U.

Wisconsin.

38

Why develop such a network?

• Global e-science becoming increasingly possible

• Developments in sensors and sensor networks allow some key measurements to be automated

Porter, Arzberger, C. Lin, F. P. Lin, Kratz, et al. (2005)

July 2005 Issue

Source: T. Kratz

39

40

Outline

• Overview of collaborative projects that I’m working on.

• Discussion of challenges and approaches.

• Technical overview of specific projects

41

Research Challenges

• Biggest challenge is data.• Much time and effort is spent managing data in time-

consuming and human-intensive means.– Often stored in Excel, text files, SAS.– Metadata in notebooks, gray matter.

• No incentives to make data reusable.– Providing data is not valued academically.

• Too much manual work involved in acquisition.– Means much is not captured automatically and semantically.

• Standardization of things such as ontologies are very slowl, and tend to be top-down.– Can we first build a system that provides some benefit without

forcing them to go through a painful standardization process?

42

Cyberinfrastructure and e-Science

• There have been huge improvements in hardware.

• There have been huge local improvements in software.

• Not so many improvements in large-scale integration and interoperability.

43

Data, Data, and More Data!

• Data is the driver of science.

• Recent advances in technology have given us the ability to acquire and generate prodigious amounts of data.

• Processing power, disk, memory have increased at exponential rates.

44

It’s Not a Few Huge Datasets

• Huge datasets get more attention.– More glamorous.– Traditional type of CS problem.– Easier to think about.

• But it’s the number of different datasets that is the real problem.– If you have one big one, can concentrate efforts on

the problem.– Not very amenable to traditional CS “thinking”, since

there is a very significant “human-in-the-loop” component.

– The best CS research is useless if the human ignores it.

45

We Are The Same!(More or Less)

Technology advances fast.

People advance slowl!People compose our institutions, our organizations, our

modes of practice.

Result: The old ways of doing things don’t cut it. But we haven’t yet figured out the

new ways.

46

Technology Impacts Slowly

• Technologies often require many systemic changes to bring benefits.– Sometimes require other complementary technologies

to be invented.

• Steam engine invented in 1712, did not become huge economic success till 1800’s.

• Motor and generator invented in early 1800s.– Real benefits did not occur till 1900s.

47

• Steam-powered factories built around a single large engine.

• Belts and other mechanical drives distributed power.• If you brought a motor to a factory foreman:

– His factory wasn’t built for it.– He might not be able to power it.

• Chicken-and-egg problem.

– He doesn’t even know how to use it.

• It took decades.• Similarly, I believe we are in the early stages when it

comes to computer technology.

Steam To Electric

48

Socio-Technical Problem

• What will it take to figure out how to use all this data?

• Not a pure CS problem, people’s actions affect how easy is it use all the data.

• Many problems these days are sociotechnical in nature.– Password security is a solved problem.– Interoperability is a solved problem.

• Figuring out how to use data is even harder than power, since power distribution is physical, easy to see.– Data/info flow is hard to see.

49

A Vision

• A scientist sits in his office.• He wonders: “I wonder if children who live closer

to cell towers have higher rates of autism?”• How much time would it take a scientist to test

this hypothesis?– Find the data.– Reformat the data, convert it, etc.– Run some analysis tools. Maybe find time on a large

resource.• But the data is out there!

– There are many hypotheses that are never tested because it would take too much work.

50

• This vision also applies to business, military, medicine, industry, management, etc.

• There are a million sources of data out there.– Real-time data streams, archived data, scientific

publications, etc.

• How can we build a flexible infrastructure that will allow analyses to be composed and answered on the fly?

• How do we go from data+computation to knowledge?

51

RDF-like Data Model

• We hypothesize that part of the problem is that RDBMS are based on data models that do not fit scientific data well.– This “impedance” mismatch is a barrier.

• Thus, develop models that more closely resemble the mental model that scientists use when thinking about data.– The less a priori structure imposed on the

data, the better.

52

Goals• Allow some common subset of code and design to be used for

many scientific data and applications.• Suggest a data and information architecture for querying and

storage.• Provide some fundamental semantics. Each discipline would then

refine these semantics.• Don’t get bogged down in trying to figure out everything. Just try to

find some LCD.• This is a logical model of data. Also need a “physical” model to

handle transport, archiving, etc. Then need to map from the physical model to the logical model. For example, an image file has more than just the raw intensities. But some metadata may not be in the file. We don’t want the logical model to be concerned about the how the data is actually arranged.

• Promote bottom-up, grass-roots approaches to building standards.

53

One Person’s Metadata Is Another Person’s Data

• Distinction between data and metadata is artificial and problematic.– What is metadata in one context becomes data in another. For

example, suppose you are taking the temperature at a set of locations (determined via GPS). So for each reading, the temperature is the data, and the metadata is the location. But now suppose that you need the error of the location. So now the error becomes the metametadata of the location metadata?

– A made-up example based loosely on crystallography: The spatial correction is based on a calibration image obtained from a brass plate. So the calibration image is metadata for the set of frames. Now suppose that they need the temperature of the brass plate when the image was made. So now the temperature is metametadata.

54

• Use a graph-based model.– Base on RDF.– Actual data is stored as a graph

• Contrast with models like E-R, where the graph “models” the data, rather than actually being the data.

• A node in E-R might be “customer”, and represent the class of entities that are customers, rather than any specific customer.

• The model:– Each node is a datum.– Each edge denotes an association/attribute/property.– Nodes can be grouped into nodesets, which are also nodes.

• A node may be in more than one nodeset.

– A node-edge-node triple can also be a node.– Main difference from RDF is an attempt to build reification into

the model.

• Somewhat similar to a hypergraph.

55

• The edge with the attribute name set_attr_1 is an attribute of a nodeset.• The edge with the attribute name triple_prop is an attribute of the above

edge.

13

20

temperature

angle

set_attr_1

triple_prop

Nodeset

Nodeset

57

Complete Capture of Raw Data

• Complete digital capture of data and metadata.– Already digital.

• Must have full provenance and other metadata.

58

Put Everything In the Triplestore

• Unify semantic networks and data graphs.

• Metadata relationships can use reified triples.

• Don’t wait for standards, people take too long to decide.– Bottom-up standards tend to work better.– First must have the demand for the standard.

• All data is read-only.

59

But We Can Never Store That Much

• Maybe we can.

• But to drive a technology, first need to show a need.

• RDBMS have had several decades of research to improve performance.

60

Publications Are Data

• In some fields, such as materials science, papers are 80% boilerplate text.

• It’s better to directly publish this as structured, semantic data.– No NL.

• Use NL annotations where needed.

61

• A scientist runs experiments.– All data is captured.

• She reaches a point where she wishes to publish.

• She reviews her experimental data (all captured with provenance, and full metadata, sensor calibration, etc.), and drags and drops what is most relevant.

• She creates a narrative by creating some annotated links between experiments to explain the insights.– Typically probably at most one page of text, maybe

less.• She clicks a button to submit for publication.

62

Closer Ties Between Theoreticians and Practitioners

• In the real world, likely that semantic data treatments will need to deal with uncertainty, quantitativeness, ambiguity, fuzziness.– There is research in these areas, but not a lot of

penetration into practice, which prevents good feedback to the theoreticians.

– For example, many practitioners don’t even know about polyhierarchies. (Clay Shirky)

• Often attempts to create ontologies result in trying to figure out which class is the parent.

63

Outline

• Overview of collaborative projects that I’m working on.

• Discussion of challenges and approaches.

• Technical overview of specific projects

64

Distributed Triplestores

• Published in e-Science 2007.

• With IU student Tharaka Devadithya.

65

Motivation

• Data in some domains is dynamically structured.• Predefining structures (e.g., schemas in RDBMS)

creates a barrier for storing such data.– Certain minute details may get discarded.

• Scientists generally store experiment details in text or binary files (e.g., spreadsheets, word processing documents).– These files can be stored in databases as BLOBs.– However, it is not possible to efficiently query these

data.– Sharing data among other collaborators require that

everyone can read the format used by the author.

66

Storing Dynamically Structured Data

• An RDBMS can be used by modifying its schema each time the structure of data changes. – Not a feasible option if the schemas need to be

modified very frequently.• Data can be stored in a file system with a

hierarchical directory structure to organize the data.– The author needs to remember the organization of

data.– Difficult to share data among collaborators.

• There is a strong requirement for a store of dynamically structured data that does not hinder the ability of efficient querying.

67

Dynamic Structures with Databases

Timestamp Value Units

2006-10-12 14:23:33

25.2 Celsius

2006-10-12 16:44:25

25.5 Celsius

Timestamp Timezone Value Units

2006-10-12 14:23:33

EST (or NULL?)

25.2 Celsius

2006-10-12 16:44:25

EST (or NULL?)

25.5 Celsius

Date Time Timezone Value Units

2006-10-12 14:23:33

EST (or NULL?)

25.2 Celsius

2006-10-12 16:44:25

EST (or NULL?)

25.5 Celsius

New column

68

Dynamic Structures with Databases…more issues

• Suppose the following information is stored about a sensor.– Manufacturer– Measurement type (e.g., temperature, humidity)– Measurement units

• What if there is one sensor whose manufacturer is not known?– Insert NULL to the Manufacturer field?

• Now, what if it is required to store purchased date only for one sensor?– Add a new column? What value to store in this column

for other sensors?– Add another table and join with the original table?

69

Semantic Web Solution

• Semantic web solutions have been successfully used both in scientific and commercial environments.– Do not impose any structure on the data.– Data modeled as a directed graph.

• Resource Description Framework (RDF) is the most commonly used standard for representing such graphs.– Can be used to describe any property about any

resource.

70

RDF and Triplestores

• Triple– Subject: the resource being described– Predicate: the property being described– Object: the value of the property

• E.g., methyl-cyanide crystallographer John– The crystallographer for methyl-cyanide is John.

• A graph in RDF is represented as a set of triples. – Each triple connects a subject node to an object node

in the graph.• A persistent set of such triples is known as a

triplestore.

Subject Predicate Object

71

Example of RDF Graph

72

XML Databases

• Proposed as suitable for such dynamically structured data.

• Commercial databases starting to provide native support for XML.

• XML is extensible and does not impose any structure on the data.– Therefore, it allows to dynamically build

structures

• Suffers from update anomalies

73

Update Anomalies with XML• Assume an XML database is used for storing information about

crystallography experiments, as follows.

<experiment> <crystallographer> <name>John Smith</name> <designation>Scientist</designation> <address>...</address> </crystallographer> <startTime>...</startTime> <location>IUMSC</location> ...</experiment>

• Results in storing redundant information– Address of John Smith will be the same for all experiments.– What happens if he changes his address? Update all previous XML fragments?

• Solution: Normalize certain details as in relational DBMS.– E.g., separate address information from the experiment details and provide a link

(reference) to an address document

74

• However, in order to normalize, the schema should be known in advance.

• This is not possible when data gets added arbitrarily without being compliant with any predefined schema.

• The user has to determine how to normalize the data.• Solution: to normalize everything

– resulting only in attribute, value pairs. E.g.,

<experiment> <crystallographer ref=“JohnSmith”></exeriment>

<JohnSmith> <name ref=”John Smith”></JohnSmith>

…– Very similar to RDF model

75

Need for a Distributed Triplestore

• Origination points• Ownership• Scalability

– Large number of triples.• E.g., consider a table in a RDBMS having 15 columns.

Migrating its data to a triplestore would result in 15 triples for each row in the table.

• Also there will be – data from more than one table– data that normally do not get stored in a database

– This leads to scalability issues.• E.g., querying would be slow, indices might often need to be

fetched/stored from disk.– In order to go beyond the scalability limits of a single triplestore,

triples need to be distributed across multiple triplestores.

77

Our Approach

• Clients access the triplestores via a mediator. • Mediator maintains several indexes to facilitate efficient

querying. • When the mediator receives a query

– breaks down the query in to several sub-queries– find out which triplestores are capable of responding to each sub-query.

• Indexes are mainly used to– build a cost model for the querying– eliminate the triplestores that are unable to give results for a given sub-

query.

78

Types of Indexes at the Mediator

• Predicate Index– Contains details about the predicates in each triplestore– Certain fields are used for cost estimation for sub-queries.

• Node Index– Maintains a list of nodes in the triple graph along with the

triplestores in which these nodes exist.– Contains only resources (E.g., ns:crystallographer); Literals

(E.g., “John Smith”) are not stored.– Used to eliminate certain triplestores when sub-querying.

• Edge Index– Two edge indexes are used for outgoing and incoming edges,

respectively. – Used to avoid querying triplestores that do not have the

corresponding edges from or to them.

79

Future Work

• Minimize joins between triplestores– Identify frequent joins– Instruct the triplestores to re-distribute their

triples such that most of the future joins will be performed locally.

• Avoid extra level of network hop due to the mediator by using a mediator cache.

• Consider network communication when estimating costs for the query plan.

80

Parallel XML Parsing

• Published in Grid 2006, CCGrid 2007, e-Science 2007, IPDPS 2008, ICWS 2008 (streaming), HiPC 2008 (streaming).

• With BU students Yinfei Pan and Ying Zhang.

81

Motivation

• XML is gained wide prevalence as a data format for input and output.

• Multicore CPUs are becoming widespread.– Plans for 100 cores.

• If you have 100 cores, and you are only using one to read and write your output, that could be a significant waste.

82

Parallel XML Parsing

• How can XML parsing be parallelized?– Task parallelism.– Pipeline parallelism.– Data parallelism.

83

• Task parallelism.– Multiple independent processing steps.– The sauce for a dish with sauce can be made in parallel to the

main part.

Step 1

Step 2A

Step 2B

Step 3

Time

Core 1

Core 1

Core 2

Core 1

84

• Pipeline parallelism.– Multiple stages, all simultaneously performed in parallel.– If you are making two cakes (but only have one oven), you can start

mixing the batter for the second cake while the first one is in the oven.

Stage 1Data C

Stage 2Data B

Stage 3Data A

Tim

e

Core 1 Core 2 Core 3

Stage 1Data D

Stage 2Data C

Stage 3Data B

Stage 1Data E

Stage 2Data D

Stage 3Data C

85

• Data parallelism– Divide the data up, process multiple pieces in parallel.

Input Chunk 1 Input Chunk 2 Input Chunk 3

Core 1 Core 2 Core 3

Output Chunk 1 Output Chunk 1 Output Chunk 1

Merge

Output

86

But XML is Inherently Sequential

• How can a chunk be parsed without knowing what came before?

• The parser doesn’t know what state to start in.• Could do various scanning forwards and

backwards, but it is ad hoc, and tricky.– Special characters like < can be in comments.

<element attr=“value”>content</element>

87

Previous work

• We used a fast, sequential preparse scan– Build an outline of the document (skeleton)– Skeleton are used to guide full parse by first

decomposing XML document into well-formed fragments on well-defined unambiguous positions

– The XML fragments are parsed separately on each core by Libxml2 APIs

– Merge the results into final DOM with Libxml2 APIs

• The preparse is sequential, however, so Amdahl’s law kicks in. We scale well to 4 cores, or so.

• So how can we parallelize the preparse?

88

Example: The Preparsing DFA

• The preparsing DFA has two actions: START and END, which are used to build the skeleton during execution of the DFA.

0 1

2

5

3

6

74

>/!"'a

> / ! a'

> / ! a"

< / ! a"'a

</

>!

>

a( START )

a

>

/

>( END )

( END )

""

'

'

89

Example of running preparsing DFA

<foo>sample</foo>

0 1 0 03 0 1 2

END

2 0

START

3

How can this be parallelized?

90

Meta-DFA• Goal

– Pursues simultaneously all possible states at the beginning of a chunk when a processor is about to parse the chunk

• Achieved by:– Transforming the original DFA to a meta-DFA whose transition

function runs multiple instances of the original DFA in parallel via sub-DFAs

– For each state q of the original DFA, the meta-DFA includes a complete copy of the DFA as a sub-DFA which begins execution in state q at the beginning of the chunk

– For the actual execution, the meta-DFA transitions from a set of states to another set of states

91

Output Merging• Since the meta-DFA pursues multiple

possibilities simultaneously, there are also multiple outputs when a chunk is finished.– One corresponding to each possible initial state.

• We know definitively the state at the end of the first chunk.– This is used to select which output of the second

chunk is the correct one.– The definitive state at the end of the second chunk is

now known.– Etc.

92

Performance Evaluation

• Machine:– Sun E6500 with 30 400 MHz US-II processors– Operating System: Solaris 10– Compiler: g++ 4.0 with the option -O3– XML Standard Library: Libxml2 2.6.16

• Tests:– We take the average of ten runs– Test file is selected from a well-known project named

Protein Data Bank (PDB), sized to 34 MB– All the speedups are measured against parsing with

stand-alone Libxml2

93

• The full parsing process is:– First do a parallel preparse using a meta-DFA.

This generates an outline of the document known as the skeleton.

– Then use techniques based on parallel depth-first tree search to parallelize the full parse.

– Subtrees of the document are parsed using unmodified libxml2.

94

Preparser Speedup

• Parallel preparser relative to the non-parallel preparser

95

Speedup on parallel full parsing

• After applying our meta-DFA technique in parallizing the preparsing stage, the parallel full parsing is now scalable.

96

Summary• Data parallel XML parsing is challenging because the

parser does not know in which state to begin a chunk.– One solution is to simply begin the parser in all states

simultaneously.

• This can be achieved by modeling the parser as a DFA with actions, then transforming the DFA into a meta-DFA (product machine).

• The meta-DFA runs multiple instances of the original DFA, one instance for each state of the original DFA.

• The number of states in the meta-DFA is finite, so it is also a DFA and can be executed by a single core.– The parallelism of the meta-DFA is logical parallelism.

97

Future Work

• Parallelizing XPath– Significantly more challenging, but due to

Amdahl’s law, first need to parallelize parsing.

• Offload preparsing to FPGA or perhaps GPU.

98

Acknowledgements

• Grateful for the support provided by the NSF and the DOE for this work.– NSF awards 0836667, 0753178, 0513687,

and 0446298– DOE Award DE-FG02-07ER25803

Recommended