19
John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

Embed Size (px)

Citation preview

Page 1: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Searching Large Scientific Data

John Wu

Scientific Data Management

Lawrence Berkeley National Laboratory

Page 2: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Outline

• Highlight of Accomplishments

• Grid Collector (accelerate others’ work)

• Query-Driven Visualization (enabling new way of knowledge discovery)

• Molecular docking (enabling others to accomplish great things)

• Outlook

• More complex searches

• Parallelization

• Supporting more data formats

• Integration with large framework

Page 3: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

FastBit In a Nutshell

• FastBit is designed to search multi-

dimensional append-only data

• Conceptually in table format

• rows objects

• columns attributes

• FastBit uses vertical (column-

oriented) organization for the data

• Efficient for searching

• FastBit uses bitmap indices with our

compression method

• Proven in analysis to be optimal

for one-dimensional queries

• Faster than other optimal indexes

for multi-dimensional queries

row

colum

n

[Wu, Otoo, Shoshani 2006]

Page 4: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Motivation

• Scientific datasets are getting larger fast

• Most data analysis algorithm can not handle a

whole dataset

• Therefore, most data analysis tasks are performed

on a subset of the data

• Some examples of searches

• Find the collision events with the most distinct features of

Quantum-Qluon-Plasma from a high-energy physics

experiment

• Find and tracking ignition in a combustion simulation

• Identify the puppet-master bedind a distribution denial-of-

service attack on a computer network

Page 5: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu 5

Highlight 1 – Grid Collector

• Searching over billions of objects with hundreds of attributes each:

• Distributed analysis over the Grid

• Make petabytes of raw data available for world wide analyses

• Benefits of the Grid Collector:• Transparent object access, select objects based on their

attributes• Improvement of analysis system’s throughput• Best Paper Award (ISC’05) [Wu, Gu, Lauret, Poskanzer,

Shoshani, Sim and Zhang 2005]

Page 6: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu 6

Grid Collector Speeds up Analyses

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1

selectivity

sp

ee

du

p

Sample 1

Sample 2

Sample 3

• Test machine: 2.8 GHz Xeon, 27 MB/s read speed• When searching for rare events, say, selecting one event

out of 1000, using GC is 20 to 50 times faster• Using GC to read 1/2 of events, speedup > 1.5, 1/10 events,

speed up > 2.• Bottom line – improve the throughtput of data analyses!

1

10

100

1000

0.00001 0.0001 0.001 0.01 0.1 1

selectivity

sp

ee

du

p

Sample 1

Sample 2

Sample 3

Page 7: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Highlight 2 – Visualization

• Query-Driven Visualization – collaboration between SDM

and VACET• Use FastBit indexes to efficiently select the most interesting data for

visualization

• Above example: laser wakefield accelerator simulation• VORPAL produces 2D and 3D simulations of particles in laser wakefield

• Finding and tracking particles with large momentum is key to design the

accelerator

• Brute-force algorithm is quadratic (taking 5 minutes on 0.5 mil particles), FastBit

time is linear in the number of results (takes 0.3 s, 1000 X speedup)

Page 8: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Bin-Based Parallel Coordinate Display

• Integrate FastBit with H5Part, a HDF5 package for particle

physics data

• Use FastBit to compute histograms efficiently

• Bin-based parallel coordinate display reduces the number

of lines displayed on screen, reduces visual clutter,

reduces response time

• FastBit further speeds up the response time further

Page 9: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

FastBit Speeds up Historgraming

• Time needed to compute desired histograms

• Custom code that directly uses the raw data directly

• FastBit can be 1000 X faster than the custom code (left)

• FastBit maintains the performance advantage on a parallel

system

Low

er is b

etter

~ 104 X

Page 10: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Highlight 3 – Molecular Docking

• Jochen Schlosser [[email protected]]Center for Bioinformatics, University of Hamburg

• Application: Structure-based virtual screening (ACS Fall 2007)

Match ligandwith cavity

Name Score

1bef -16,4

4dab -12,3

4d2a -11,6

… …

n ligands

n dockingruns

Hit list

One targetprotein

Standard approach: match every ligand with every target proteinNew approach: using FastBit indexes to avoid brute-force matching

Page 11: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Use of FastBit for Molecular Docking

Method• Specification of the descriptor

as triangle geometry• Types of interaction centers• Triangle side lengths• Interaction directions• 80 bulk dimensions

• Receptors• Receptor descriptors are

generated similarly• Using complementary

information where necessary• Use of pharmacophore

constraints on receptor triangles• Reduces number of queries• Improved query selectivity

because the pharmacophore tends to be inside the protein cavity

Page 12: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Use of FastBit for Molecular Docking

Method• Indexing system

• Properties of the problem:• Billions of descriptors (~ 1,000 for

each ligand)• High dimensional query

• Properties of bitmap indexes• Well suited for those kind of

queries• Can be run stand alone• Further compression possible• FastBit uses compression

[0] ... … … [n]

0 1 0 0 00 0 0 1 00 1 0 0 00 0 0 0 11 0 0 0 0

desc1desc2desc3desc4desc5

attribute(i)

Bitmap index

ResultsTrixX-BMI is an efficient tool for virtual screening with average runtime in

sub-second range screen libraries of ligands 12 times faster than FlexX without

pharmacophore constraintsWith pharmacophore constraints, speedup 140 – 250

Page 13: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Outline

• Highlight of Accomplishments

• Grid Collector

• Query-Driven Visualization

• Molecular docking

• Outlook

• More complex searches

• Parallelization

• Supporting more data formats

• Integration with large framework

Page 14: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Complex Searches

• So far, FastBit software primarily handles range

queries of the form “pressure > 105 and

temperature between 800 and 1000”

• Need to support complex types of searches

• GTC data analysis: find all particles with certain energy level

that have passed through a region with specified properties

on the electric field

• Network security: find the hosts that have contacted all

identified drones within an hour of the start of an attack

• Protein sequences: Identify known proteins with specified

molecular weight

• Catalog matching: matching records of stars and galaxies

from one survey / simulation to another one

• Subqueries: searching the results of previous searches

Page 15: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Complex Searches

• Extending the histograming functionality: group by,

top-k, automatic computation of derived fields

• Implement join algorithm

• Existing bitmap indexes are efficient for filtering out the

desired records for common join algorithms such as sort-

merge join

• Existing bitmap index based join algorithms appear promising

from back-of-envelope calculation

• A* algorithm: for programs such as neighborhood

expansion, formulating them as joins may be not as

efficient as using alternative searching algorithms,

such as, A*

Page 16: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Parallelization

• For I/O dominated tasks,

• Take advantage of parallel I/O system, PVFS

• Better data layout to effectively utilize the I/O hardware

• Active Storage, In-Situ data processing

• For CPU dominated tasks,

• Devise new algorithms, e.g., parallel join algorithms, new join

indexes

• Algorithms for GPU, Cell processor, and many-core

architecture

Page 17: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

More Data Formats

• Working with application specialist to integrate

FastBit with their data library

• H5Part: HDF5

• ROOT (?)

• ADIOS

• Restructure FastBit to make it easier to work with

different data formats

• Virtualize data sources

Page 18: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Integrated Data Analysis Framework

• Iterator for coarse grain data

• Examples: ROOT and Map-Reduce

• Indexing provides a way to implement a “smart iterator”,

e.g., Grid Collector for STAR data analysis framework (using

ROOT)

• Framework for fine grain data

• Tighter integration with programmatic API

• Provide scripting support for productivity layer (end user)

Page 19: John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

John Wu

Indexes Facilitate Smart Analysis

Indexes go here!

Or

How to make your system smarter!