14
Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions, ….. Doug Benjamin (Duke University)

Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

  • Upload
    marika

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…. Doug Benjamin (Duke University). Understanding Analysis workflows. User analysis is an interactive activity Some steps are very repetitive Much of the repetition is done - PowerPoint PPT Presentation

Citation preview

Page 1: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

Metrics & monitoring:Understanding of

workflows,DQ2 traces, file read fractions,…..

Doug Benjamin(Duke University)

Page 2: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

Understanding Analysis workflows

• User analysis is an interactive activityo Some steps are very repetitiveo Much of the repetition is done

• Understanding the work flows requires getting information from userso Long term goal understand pattern all the way down to local analysis

cluster• User analysis workflow is evolving

o Data volume in 2011 allowed for certain behavior o Data volume in 2012 requires changes from some userso Analyses are migrating from simple cut and count to more sophisticated

ones (Multivariate analyses)• Implies an increase in computing resources

o Input data products evolve• AOD to D3PD mix changes with time

Page 3: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

User analysis interviews• Recently conducted in depth interviews with over

20 peopleo Goal understand how people actually work o Develop series of questions that can be put into US ATLAS analysis

survey and ATLAS wide distributed Analysis survey (both will go out before Software and computing week mid-Oct)

• General themes are immerging but there are many different solutionso Solutions vary person to persono Task to task

• Some efforts require AOD’s and other group D3PD’s• Most users work both on the grid and off of it.

o ~ 10% of user datasets are used as inputs to further processing steps on the grid

Page 4: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

Some basic work flows• On grid process AOD’s to produce private ntuples.

Private ntuples fetched to local computing (usually with dq2-get)

• On grid process AOD’s to produce histograms – histograms fetch to local computing

• On grid use group Produced D3PD to skim events by simple cuts, bring skimmed D3PD to local computingo Data volume in 2012 makes this almost impossible

• On grid use group Produced D3PD to skim events and slim D3PD by dropping branches. Pull results to local computing for further analysis

• On grid input histogram files – run pseudo experiments – output histogram files for further review

Page 5: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

Work flows at CERN• EOS system highly successful. – Popular with users• On Sept 8th almost 700 users with EOS space –

close 50% of the people using Panda on the grid.• During past to months – almost 900 TB of data

read from user space within EOS• EOS can deliver files over the WAN

• Interesting work flow at CERN : dq2-get from lxbatch nodes.o Talked with one user doing to understand why he/she is doing it. o Doing trigger studies – code only at CERN and data elsewhere – faster to

fetch the data to a batch node than to transfer data via DATRI to scratch disk

Page 6: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

Work flow conclusions• Physicists are intelligent and creative people

o They have a job do to so they will use creative ways to get the job done• There is no one way to work – any system must be

adaptable• Increasing data volumes will provide challenges

this year and in the future• Wide area access and caching of the data could

and should be part for the solution

Page 7: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

DQ2 trace analysis• DQ2 trace data dumps – provided by Thomas Beerman –

vital for results presented hereo Have dumps for past two weeks, April 2012, May ’12, June ‘12 and July ‘12

Analysis from last 2 weeks of activity• At CERN

o 6 users use dq2-get on lxbatch• Datasets not at CERN, 20418 dq2-get requests, 666 unique datasets

o 457 users dq2-get on lxplus or other machines at CERN ( 90 US people stationed at CERN – 19.7% of users)• 161 TB transferred – (11.5 TB US users)• 24559 requests ( 925 – US users) 20500 unique DS mostly user datasets

• Activity away from CERNo 25210 requests , ~ 70 TB, 428 users, 355 nodes (US #’s , 15.7 TB, 72 nodes, 101

users)o Single site with most activity – Univ. of Chicago Tier 3 (they have a Tier 2 there

also)• In July 168 sites (away from CERN) received data via

dq2-get

Page 8: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

How much data do users read in

D3PD’s?

• Centrally produced D3PD’s very important part of the most users analysis activities

• Group D3PD’s the or of all possible variables o Many thousand branches and users only use a few hundred. o Activity started in many different fronts to determine what branches are

really being used.• Using the information collected by Sarah Williams

(Indiana University – MWT2) and published on the web.

• Can see from the storage system (due to direct read over LAN) point of view what is happening, extremely helpful – incomplete though

• EOS team at CERN collected similar information –o Effort between ATLAS – CERN IT to understand the information

Page 9: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

Fraction of file read – May ‘12

Page 10: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

Fraction of file read Aug/Sep ‘12

Page 11: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

Integral (removing peak 100% read – file

transfer)

• 85% of all reads of group D3PD’s read less than 20% of file

• 76% of all reads of user files read less than 20% of file

• Files full of a lot of unread data

Page 12: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

Efforts to understand what Users are

doing.

• Users are productive ( proof – Higgs result, paper output of ATLAS – impressive)

• Use of prun while very effective missing critical monitoring featureso Can not determine what users are actually reading

• Simple changes to existing user code can result in significant speed increaseso Important to start from the User code vs telling them from on

high how to change what they are doing.o Want evolutionary changeo Do not want to cause users to be unproductive

• In US started activity to review user code and work to speed it up.o Helps the users with faster code and we see what variables they are

using

Page 13: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

D3PD analysis wiki• Helped at least 5 users see 3-4 times increase in processing speed• Tool for others to read and speed up there own code• As part of message from Panda – plan to tell users of their

processing speed (Hz) and compare it to other jobs using same data set.

https://twiki.cern.ch/twiki/bin/view/Atlas/D3PDanalysisOptimizations

Page 14: Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…

Conclusions• Users are creative and adaptive – goal oriented –

primary goal is to get the science out• Increasing data volumes present a challenge

(including this year)• WAN access could be a good tool for the users -

needs to be trivial to use and have good performance

• Can not expect radical changes in user code – must evolutionary to really happen

• Can partial event caching and block IO help us?• Need better monitoring to increase the information

density of the files being used – want them to contain the data that users really want and little else.