18
Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining Storage Management and Data Mining in High Energy Physics Applications in High Energy Physics Applications Arie Shoshani Arie Shoshani Doron Rotem Doron Rotem Henrik Nordberg Henrik Nordberg Luis Bernardo Luis Bernardo (http:/gizmo.lbl.gov/DM.html) (http:/gizmo.lbl.gov/DM.html) Scientific Data Management R&D Group Scientific Data Management R&D Group Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory October , 1997 October , 1997

Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Embed Size (px)

Citation preview

Page 1: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

1

Storage Management and Data Mining in Storage Management and Data Mining in High Energy Physics ApplicationsHigh Energy Physics Applications

Arie ShoshaniArie Shoshani

Doron RotemDoron Rotem

Henrik NordbergHenrik Nordberg

Luis BernardoLuis Bernardo(http:/gizmo.lbl.gov/DM.html)(http:/gizmo.lbl.gov/DM.html)

Scientific Data Management R&D GroupScientific Data Management R&D Group

Lawrence Berkeley National LaboratoryLawrence Berkeley National Laboratory

October , 1997October , 1997

Page 2: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

2

Events Clustering and Access: Main ComponentsEvents Clustering and Access: Main Components

Tapereorganization

module

MultidimensionalClusteringAnalyzer

Clusterindex

AccessPatterns

Input Datasetdescription

HardwareCharacteristics

ClusteringSpec

ApplicationProgram

StorageManager

Originalinput DST

Restructuredoutput

DST

data path

interactionpath

n-tupleProperties

Data Manager’sWorkbench

To disk cache

M-dimqueryrequest

extractedmicro-DST

Page 3: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

3

Main TasksMain Tasks

° Discover event clusters– based on natural distribution - Data Mining– based on access patterns - consult physicists– simulate performance - data manager’s workbench

• Manage cluster access– given a query, determine clusters to access,

use multi-dimensional indexes to select events that qualify

• Reorganize DST tapes according to clusters– long process - done initially, then rarely– flow control - restart after interruption

• Cache management– determine if in cache, which incremental clusters to cache,

which clusters to purge from cache

Page 4: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

4

Discover events clustersDiscover events clusters

• Top down approach– partition each dimension into “bins”

(e.g. 1-2 GEV, ..., 1-3 pions, ...)– select subset of dimensions based on

physicist’s experience– analyze which events fall into the same “cell”

(i.e. m-dim rectangles formed by the bins)– eliminate empty cells– combine cells to form similar size clusters

• Assumption– most queries are “range queries”

Page 5: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

5

Top Down Cell ManagementTop Down Cell Management

• Assume: 7 dimensions, 10 bins each

-- Number of cells: 107, 4 byte counters

-- Number of bytes: 4x107, 40 MB

• For e.g. small dataset 97% of cells are empty:

-- store only populated cells

-- use hash tables to locate existing cells

-- use 2 bytes for bin_id per property: ratio for p% full is: 200/ (n+2)p

-- No of bytes: for 7 dim, 3% => 5.4 MB

• Sort cells by size (number of events)

no ofevents

cell #

..... .

..... .

..... .

..... .

..... .

..... .

..... .

..... .

..... .

Page 6: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

6

Cluster discoveryCluster discovery

• sort cells by size

• pick larger cell to start forming a cluster

• find all neighbors of “Manhattandistance” equal to 1

• include cells above a threshold

• iterate for all cells in cluster

• when no more cells above threshold,pick larger remaining cell and start forming a new cluster

• Display cluster distributions

..... .

..... .

..... .

..... .

..... .

..... .

..... .

..... .

..... .

Page 7: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

7

The effect of bin sizes on events distributionThe effect of bin sizes on events distribution

Distribution of events over cells

0

100

200

300

400

500

600

700

800

1 51 101 151

Cell #

Nu

mb

er

of

Ev

en

ts

1331 cells

3328 cells

7056 cells

33,046 cells

202,581 cells

1,030,301 cells

8,120,601 cells

random data

Page 8: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

8

Blowup of cell histogramBlowup of cell histogram

Distribution of events over cells

0

100

200

300

400

500

600

700

800

1 11 21 31 41

Cell #

Nu

mb

er o

f E

ve

nts

1331 cells

3328 cells

7056 cells

33,046 cells

202,581 cells

1,030,301 cells

8,120,601 cells

random data

Page 9: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

9

Cluster StabilityCluster Stability

number of cells

number of clusters

bend

Flatregion

• We apply the clustering algorithm whileincreasing the number of bins (and thus the number of cells)

• Initially the number of clustersis small (cells are too large)

• The flat region shows stabilityin the number of clusters

• When cells are too smallthe number of cell grows

• The lower bend is the largestcell size that shows clusters

Page 10: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

10

Cache ManagementCache Management

Hardware scenarios

tape library disk RAID

HPSS

CPUfarm

tape library

parallel disk system

HPSS

CPUfarm

scenario A scenario B

Page 11: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

11

Cache Management IssuesCache Management Issues

• Scenario A– RAID more expensive than a Parallel Disk System

(factor 2-3)– but, rely on HPSS to manage disk– storage management simplified

• Scenario B– a Parallel Disk System is cheaper, does not depend

on RAID vendor– but, need to manage disk allocation– has control over placement of events on cache

• Planned initial pilot– Scenario A under NERSC

Page 12: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

12

Situation at NERSCSituation at NERSC

• To test scenario A (will be done first)– tertiary storage AND cache under HPSS– use for “event reorganization module” only– dedicated experimental environment

• rs6000 connected to HPSS, dedicated RAID partition• experiment with “stage”, “migrate”, and “purge”

• To test scenario B (later)– tertiary storage under HPSS, cache: DPSS– link not set up yet

Page 13: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

13

Situation at NERSC (Con’d)Situation at NERSC (Con’d)

• To test Analysis Framework communication with Storage Manager– go over the net from PDSF to NT-machine using

CORBA

• To test Analysis Framework interacting with a cache (which is controlled by the Storage Manager)– can’t run analysis framework on T3E and C90– must use PDSF - DPSS

Page 14: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

14

QueryEstimator

CacheManager

QueryMonitor

index

cache map

(1) Query estimation request

(2) Query execution request

(3) Notify that cell was cached or removed

(5) Finished with cell / abort

(6) Ask if finished with cell

(4) Notify that cell was cached

ObjectManager

QueryObject

EventIterator

Analysis Framework

(7) Request to re-cache cell

Storage Manager

Communication between Communication between The Analysis Framework and the Storage ManagerThe Analysis Framework and the Storage Manager

Page 15: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

15

Typical ScenarioTypical Scenario

• Query Estimation request (can be repeated many times)

• Query Execution request

• Query Monitor makes decision which cells to cache

• Cache Manager checks if cell is in cache

• If so notifies process (event iterator)

• If not it schedules a “stage cell”, “purge” unused cells

• When cells is cached, Cache Manager notifies Object Manager, as well as Event Iterator

• Event Iterator notifies Query Monitor when it is done with a cell or when it aborts

Page 16: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

16

Process MisbehaviorProcess Misbehavior

• Use of a time out mechanism by the Query Monitor

• Query Monitor asks Event Iterator “are you alive”

• If no response within some time limit, all cells are removed for this process (if no other processes need them)

• Event Iterator can request a re-cache of a cell

• => status of queries that are “non-responsive” is saved for a period of time

Page 17: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

17

Query Estimator ResponseQuery Estimator Response

• Approximate Estimate (quick response)(index in memory)– No_of_events: (min,max) -- nearest bin boundaries– no_of_cells to be cached– total_MBs_to_be_moved– %_of_events_in_cells that qualify for a query (max)– no_of_events_in_cache– time_to_process_query

• Precise estimate (slower response)(Index on disk)– Precise No_of_events that qualify– ALL other measures the same

Page 18: Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics

Computer Science Research and Development Department Computing Sciences Directorate, L B N L

18

Cache Management StrategyCache Management Strategy

• Number of cells to fetch ahead - parameter– e.g. 2 cells => new cell cached only after process

informs that it is finished with one of the 2 cells

• Priority given to cells needed by multiple processes

• Priority given to cells that reside on the same tape

• Algorithm should not starve a process