16
Virtual Logbooks and Virtual Logbooks and Collaboration Collaboration in Science and Software Development in Science and Software Development Dimitri Bourilkov Dimitri Bourilkov , Vaibhav Khandelwal, , Vaibhav Khandelwal, Archis Kulkarni, Sanket Totala Archis Kulkarni, Sanket Totala University of Florida University of Florida IPAW’06 IPAW’06 Internaional Provenance and Annotation Workshop Internaional Provenance and Annotation Workshop Chicago, USA, May 3, 2006 Chicago, USA, May 3, 2006

Virtual Logbooks and Collaboration in Science and Software Development Dimitri Bourilkov, Vaibhav Khandelwal, Archis Kulkarni, Sanket Totala University

Embed Size (px)

Citation preview

Virtual Logbooks and Virtual Logbooks and CollaborationCollaboration

in Science and Software Developmentin Science and Software Development

Dimitri BourilkovDimitri Bourilkov, Vaibhav Khandelwal, , Vaibhav Khandelwal, Archis Kulkarni, Sanket TotalaArchis Kulkarni, Sanket Totala

University of FloridaUniversity of Florida

IPAW’06IPAW’06

Internaional Provenance and Annotation Internaional Provenance and Annotation WorkshopWorkshop

Chicago, USA, May 3, 2006Chicago, USA, May 3, 2006

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 22

Virtual Logbook Motivations

• Data tracktrack-ability and result auditaudit-ability: "Virtual Logbook”• In the nature of science

• Reproducibility of / confidence in results

• Individuals exploreexplore other scientists’ work and build from it

• Different Teams can work in a modular, semi-autonomousmodular, semi-autonomous fashion: reuse previous data/code/results or entire analysis chains

• Newcomers get a flying startNewcomers get a flying start

• Make group knowledge easily accessible 24/7

• Keep project history

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 33

Consider a researcher A working on her terminal

A’s session includes commands, scripts executed, output produced, environment changes made and other information for some achieving particular goal.

All such sessions of A can be logged easily using Codesh and reproduced later

Consider a collaborator B working remotely

B can extract all of A’s sessions.

B can modify some of A’s sessions and log again and also log some of his own sessions

We could have a totally Distributed System comprising of many collaborators, making use of either private repositories or shared repositories

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 44

Collaborative Community Tools:CAVES / CODESH Projects

• Concentrate on the interactions between scientists collaborating over extended periods of time

• Seamlessly log, exchange and reproduce results and the corresponding methods, algorithms and programsAutomatic and complete logging and reuse of work /

analysis sessions: collect all user activities on the command line + code of all executed programs

• Extend the power of users working / performing analyses in their habitual way, giving them virtual logbook capabilities

• CAVES is used in normal data analysis sessions with the analysis package ROOT (High Energy Physics)

• CODESH is a UNIX shell with virtual logbook capabilities• Build functioning collaboration suites - close to users!• Formed a team: CODESH: DB & Vaibhav Khandelwal;

CAVES: DB & Sanket Totala & Archis Kulkarni

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 55

CAVES / CODESH ArchitecturesScalable and Distributed

First prototypes use popular tools: Python, C++, ROOT and CVS; e.g. all ROOT commands and CAVES commands or all UNIX shell commands and CODESH commands available

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 66

CAVES / CODESH Architectures – CVS Server Backend

• Work on per session basis

• CVS provides version controlversion control by tagging releases• CVS tagstags act as unique IDsunique IDs for virtual sessions

(the namespace can be structured by a collaborating group e.g. one big cave or many barrels in a cave, selected on a session basis)

• Both locallocal and remoteremote modes of working

• CVS pservers (secure, efficient remote stores):• Only CVS user accounts with password authentication,

no UNIX accounts on the server (gridmapfile uses same idea)

• read/write access control lists (per user & directory)

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 77

Annotations

• Log annotationsannotations (& possibly summary resultssummary results or metadatametadata) in the repository /data equivalence!/

• BriefBrief annotations – use cvs –m “Annotation” …

• Optional completecomplete annotations - MySQL servers for metadata (fast search for large data volumes e.g. with indexing)

• The users can browse (a subset of) the existing tags, inspect the annotations, and, if interested, reproduce the results (or just get the how-tohow-to), modify them etc.

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 88

Extensible Command Set

• Session commands• open <session>• close <session>

• During analysis• help <command>• browse <tag>• inspect <tag> <b|

c>• startlog• log <tag> <annot>• annotate <tag>• extract <tag>

• Administrative tasks• copy <tag> <from>

<to>• move <tag> <from>

<to>• delete <tag> <from>• archive <tag> <to>• retrieve <tag> <from>

CODESH commands:run, shellgetenv, getalias etctake/get snapshot

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 99

Working Releases - CODESH• Virtual log-book for “shell”

sessions• Parts can be local (private)

or shared• Tracks environment

variables, aliases, invoked program code etc during a session

• Reproduces complete working sessions

• Complex high energy physics examples for the CMS experiment operational

• E.g. all CMS data generations for a Florida group during several months are stored in CODESH and the knowledge is available

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1010

Working Releases - CAVES

Higgs W+W-

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1111

Working Releases

• The client codes are easy to download and install (stored in CVS) – checkout and build

• All you need to run the clients is ROOT 3.05 or higher and CVS (from v 1_0_0 client is fully ROOT-compliant); or Python 2.1 or higher

• Users can browse the virtual logbooks and reproduce the examples stored there

• They can play with e.g. new analyses and store them locally or on our server, or install new servers for groups collaborating on a project

• Stress test: CODESH repository with• 10 sessions – 1 sec to inspect a session

• 10 000 sessions – 2 sec to inspect a session

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1212

Prototype CMS Analysis

Developed a stand-alone C++ code to analyze data stored in tree format:• Lightweight, no external

dependencies besides ROOT, used for iGrid2005 and SC|05 demos and by users at UF for analysis and CMS production validation

• Sessions stored by CAVES in private or group repositories, data stored locally or on remote xrootd servers

• Fully distributed application

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1313

Outlook

• Work in progress – firstfirst CAVES and CODESH releases out

• We are looking forward to user feedbackuser feedback• Possible future directionsfuture directions:

• Different back-ends: web/grid service oriented (e.g. Clarens)• Extend remote data access: e.g. xrootd …• Add GSI security• Automatically convert session log to workflow• Tune on smaller samples, schedule on grid for larger tasks

A picture is better than 1000 words: Try out the releases !CAVES white paper arXiv: physics/0401007CODESH / CAVES paper arXiv: physics/0410226Virtual States and Transitions ICCS 2005, LNCS 3516, pp. 342-345,

2005; © Springer-Verlag

More info at http://cern.ch/bourilkov/caves.html

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1414

Backup Slides

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1515

Possible Scenarios

Case1: SimpleUser 1 : Does some analysis and produces a result with tag analX_user1. User 2: Browses all current tags in the repository and fetches the session stored with tag analX_user1.

Case2: ComplexUser 1 : Does some analysis and produces a result with tag analX_user1. User 2: Browses all current tags in the repository and fetches the session stored with tag analX_user1.User 2: Does a modification in the program obtained from the session of user1 and stores the same along with a new result with tag analX_user2_mod_code.User 1: Browses the repository, finds that his program was modified and decides to extract that session using the tag analX_user2_mod_code.This scenario can be extended to include an arbitrary number of steps and users in a working group or groups in a collaboration.

D.Bourilkov et al.D.Bourilkov et al. Virtual Logbooks and Collaboration 1616

High Energy Physics Visualization in CAVES

Before detector simulation: PYTHIA 4-vectors – CMKINViewer (DB)

After reconstruction