34
D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan Friday 18 th May, 2012 Project acronym: VERCE Project n : 283543 Funding Scheme: Combination of CP & CSA Call Identifier: FP7-INFRASTRUCTURES-2011-2 WP: WP2/NA2, Pilot applications and use cases Filename: D-NA2.1.pdf Author(s): A. Michelini Location: http://www.verce.eu/Repository/Deliverables/RP1/ Type of document: Deliverable Dissemination level: Public Status: Final Due date of delivery: 15/05/ 2012 Reviewer: Malcolm Atkinson Keywords: data-intensive, cpu-intensive, HPC, earthquake, seismology, data infrastructure, forward modeling, inversion Version Author Date Comments 1 A. Michelini (INGV) 23/04/2012 Initial draft for comments 2 M. Atkinson (UEDIN) 25/04/2012 Comments 3 A. Michelini (INGV) 02/05/2012 2nd draft with comments taken into account 4 A. Michelini (INGV) 07/05/2012 modified and translated into L A T E X, 5 A. Michelini (INGV) 09/05/2012 Added comments and corrections by Malcolm, 6 M. Atkinson (UEDIN) 10/05/2012 Added comments and correcting typos, 7 M. Galea (UEDIN) 10/05/2012 Added comments and correcting typos, 8 A. Michelini (INGV) 12/05/2012 Inserted corrections and updated glossary, 9 A. Michelini (INGV) 13/05/2012 Improved overall by improving text and fixing graphics and tables (e.g., used “ltablex” package and positioning) 10 M. Atkinson (UEDIN) 14/05/2012 executive summary review 11 G. Moguilny (IPGP) 15/05/2012 format review

D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

D-NA2.1: Pilot applications and scientific use cases: firstselection, prioritization and enabling work plan

Friday 18th May, 2012

Project acronym: VERCEProject n◦: 283543Funding Scheme: Combination of CP & CSACall Identifier: FP7-INFRASTRUCTURES-2011-2WP: WP2/NA2, Pilot applications and use casesFilename: D-NA2.1.pdfAuthor(s): A. MicheliniLocation: http://www.verce.eu/Repository/Deliverables/RP1/Type of document: DeliverableDissemination level: PublicStatus: FinalDue date of delivery: 15/05/ 2012Reviewer: Malcolm AtkinsonKeywords: data-intensive, cpu-intensive, HPC, earthquake, seismology, data infrastructure,

forward modeling, inversion

Version Author Date Comments1 A. Michelini (INGV) 23/04/2012 Initial draft for comments2 M. Atkinson (UEDIN) 25/04/2012 Comments3 A. Michelini (INGV) 02/05/2012 2nd draft with comments taken into account4 A. Michelini (INGV) 07/05/2012 modified and translated into LATEX,5 A. Michelini (INGV) 09/05/2012 Added comments and corrections by Malcolm,6 M. Atkinson (UEDIN) 10/05/2012 Added comments and correcting typos,7 M. Galea (UEDIN) 10/05/2012 Added comments and correcting typos,8 A. Michelini (INGV) 12/05/2012 Inserted corrections and updated glossary,9 A. Michelini (INGV) 13/05/2012 Improved overall by improving text and fixing

graphics and tables (e.g., used “ltablex” packageand positioning)

10 M. Atkinson (UEDIN) 14/05/2012 executive summary review11 G. Moguilny (IPGP) 15/05/2012 format review

Page 2: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

2 D-NA2.1 - Pilot applications and scientific use cases

Copyright notice

COPYRIGHT c© VERCE PROJECT, 2011-2015. SEE www.verce.eu FOR DETAILS ON VERCE.

VERCE, Virtual Earthquake and seismology Research Community e-science environment in Europe, isa project co-funded by the European Commission as an Integrated Infrastructure Initiative within the 7thFramework Programme. VERCE began in October 2011 and will run for 4 years.

This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 License. To viewa copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0 or send a letter to CreativeCommons, 171 Second Street, Suite 300, San Francisco, California, 94105, and USA.

The work must be attributed by attaching the following reference to the copied elements:COPYRIGHT c© VERCE PROJECT, 2011-2015. SEE www.verce.eu FOR DETAILS ON VERCE.Using this document in a way and/or for purposes not foreseen in the license requires the prior writtenpermission of the copyright holders. The information contained in this document represents the views ofthe copyright holders as of the date such views are published.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 3: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

3

Contents

Executive Summary 5

1 Introduction 7

2 Survey of the use cases 82.1 Structure of the questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Results of the questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Proposed use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Identified data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.4 External software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.5 Coordinate system and spatial scale-length . . . . . . . . . . . . . . . . . . . . 112.2.6 Input Data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.7 Processors/Memory and implementation on clusters . . . . . . . . . . . . . . . 122.2.8 Data sizes and storage requirements . . . . . . . . . . . . . . . . . . . . . . . . 122.2.9 Data availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.10 Application availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.11 Data preparation, metadata and identification . . . . . . . . . . . . . . . . . . . 132.2.12 Documentation, current users and distribution . . . . . . . . . . . . . . . . . . . 14

2.3 Discussion and summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Selection of the use cases 163.1 Data intensive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.1 Data staging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.2 Data ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.3 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.5 Results post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 CPU-Intensive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.1 “Forward Modeling and Inversion: data space comparison” . . . . . . . . . . . . 253.2.2 “Forward Modeling and Inversion: adjoints and model inversion” . . . . . . . . 263.2.3 TsuMaps use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Prioritization and Recommendations 28

References 32

Glossary and Links 33

List of Figures1 NA2 use cases proposed by the partners of VERCE. The tree diagram is organized ac-

cording to the scientific goals that are sought. Each scientific case can include data-and/or CPU-intensive pilot applications that are indicated in blu and red respectively.The numbers reported within parenthesis for each use case refer to the numbered list ofthe proposed used cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Typologies of the application and software (or scripting) language adopted. The numbers(e.g., 8.a and 8.b) refer to the question of the survey. The pie-chart on top (Question 8.a)summarizes the distribution of the pilot application procedures (see text to the right). Thebottom histogram (Question 8.b) indicates the languages used for the pilot applicationdevelopment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 4: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

4 D-NA2.1 - Pilot applications and scientific use cases

3 External libraries used in the use cases (Question 9.c of the survey). The histogramindicates the frequency of usage of the various libraries. . . . . . . . . . . . . . . . . . . 11

4 Histogram of the spatial scale-lengths and of the adopted coordinate systems. The toppanel (Question 9.a) shows that the majority of the use cases targets problems fromlocal to regional scale-lengths. The bpttom panel (Question 9.b) shows that cartesiancoordinates are adopted for the most. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Histogram summarizing the adopted input file formats as resulting from Question 9.d ofthe survey. Detail on the histogram with the actual numbers reported is provided on theright side of the panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 Processors, memory requirements and if the pilot application is data intensive if it hasbeen run on clusters as embarrassingly parallel job. As in the previous figures, the num-bers above each panel refer to the question number of the survey. . . . . . . . . . . . . 13

7 Input, Output and temporary data sizes. This multi-panel figure attempts to provide es-timates of the data volumes that come into play for the data-intensive applications. Thenumbers refer to the questions of the survey. The top panel histogram provides infor-mation on the original target size of the data set to be ingested by the pilot application.The middle panel summarizes the output data sizes as multiple of the input one and thebottom panel histogram for the temporary files. . . . . . . . . . . . . . . . . . . . . . . 14

8 Data availability. The pie chart (top) and the histogram refer to questions 13.e and 13.fof the survey. The pie chart shows well the need for most of the use cases to availof fast methodologies to stage the data to HPC facilities whereas the bottom histogramevidences that most data of the data intensive applications do not avail of on-line servicesprovided by the archival data centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

9 Pre-processing and metadata. Histogram (top) and pie chart (bottom) answers to ques-tions 16 and 17 of the survey. The histogram on top exemplifies some of the data pre-processing stages that the use cases require the bottom pie chart shows that most develop-ers believe that proper data description using metadata is of relevance for data discovery

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1510 Output and PID. The figure displays the histogram summarizing the answers to the sur-

vey (top) regarding the output format (question 19) and a pie chart showing the distri-bution of the answers regarding the importance of persistent identifiers (question 20) fordata re-usability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

11 Documentation, distribution and users. The panel summarizes with histograms and piecharts the answers to questions 21 and 22 of the survey. . . . . . . . . . . . . . . . . . 17

12 Macro-modules of the data-intensive use cases. Note that the term “staging” is usedsomewhat unconventionally to indicate the whole process that takes the data from thedata archives to the computing resources. The data staging box is de-emphasized sinceonly a part of the data movement process — from the local storage to the computerfacility — is addressed by VERCE . This part is discussed in more detail in the JRA1report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

13 Main use cases of the data-intensive group of use cases. (For the de-emphasis of “stag-ing” see the digression made in Fig. 12). . . . . . . . . . . . . . . . . . . . . . . . . . 19

14 Noise cross-correlation full workflow: from the raw data downloading to tomographicimages and velocity variation models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

15 WaveHRL full workflow: from initial event detection to a series of independent moduleslinked to the central database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

16 Main blocks of the “Forward Modeling and Inversion” use case . . . . . . . . . . . . . . 2617 Main blocks of the “adjoints and model inversion” part of the “Forward Modeling and

Inversion” use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2718 Flow diagram of the TsuMaps use case. . . . . . . . . . . . . . . . . . . . . . . . . . . 28

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 5: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

5

Executive Summary

This deliverable report describes the activities undertaken during the first six months by WP2. TheObjective of the deliverable is the initial prioritization of the use cases of the project. The report issubdivided into two main parts. The first part describes and comments on the results of the surveydesigned to better understand the typology of the use cases offered by the project participants. The secondpart of the report focuses on the analysis of the use cases. In doing so, it provides both a synthesis of thecommon features identifiable in the use cases and it provides an initial ranking of the submitted use cases.Of importance for this prioritization is that it has also benefitted of the ongoing discussion in seismologythat took place repeatedly in various meetings and dedicated events (e.g., EGU and AGU Assemblies;workshop in eScience Applications for Seismology, Edinburgh, March and June, 2011; workshop forthe Coordination of IT developments in on-going (EC-)projects: EPOS/ORFEUS seismology: February,2012).

The survey aimed to provide a general description of the use cases by attempting to identify their scien-tific relevance and impact upon science and society, the requirements in terms of data storage and CPUs,the language adopted by the developers, the potential gain(s) resulting from re-factorization of the usecase and their pilot applications, special requirements such as the adoption of identifiers and other relatedtopics. A total of 23 composite questions were prepared for the questionnaire of the survey. Nine usecases have been proposed spanning (observational) data- and CPU-intensive applications. The scientifictopics ranged from velocity structure monitoring and determination using data- and CPU-intensive ap-proaches, to seismic event detection and location and to rapid tsunami modeling to be used for tsunamiearly warning.

The second part of the report aims towards identifying those parts common to the use cases. The reasonfollows from the need to provide software tools on the VERCE platform that are highly optimized and re-usable by both the suite of different use cases proposed in VERCE and by others not currently includedbut likely to be inserted as the project develops. This introduces the concept of macro-modules — oneor more software tools designed to perform similar tasks within the workflows. The macro-modulesidentified so far include the data staging, data ingestion and data pre-processing. These all precede thedata analysis “sensu strictu” stage (or pilot application specific macro-module(s)) and, although not yetthoroughly examined in this report, the output results macro-module(s).

The data transfer from archives to the local storage, although extremely relevant to assemble the data set,is not considered here since other EC supported projects (e.g., EUDAT) are already examining optimalsolutions for this requirement. For the data staging (scalable data storage and file management systems,transparency, fine grain access, high throughput and heavy access concurrency) across distributed storagesystems in various computational environments (local data-intensive cluster, HPC or Grid), it is importantthat VERCE takes full advantage of the existing and/or developing technologies to test optimal solutionsfor managing and staging massive data sets. Similarly, data ingestion is very relevant to VERCE sinceanalysis of huge data volumes requires intensive I/O. This concerns in particular the problems of formatconversion, indexing strategy, shared identification of large binary objects composed of a large numberof small files, fine-grain concurrent access and version management. Classical formats and indexing inseismology have been designed for archiving the data, investigating other data structures and formats(e.g., HDF5) as well the potential of large-scale parallel databases is a key scalability issue. The datapre-processing macro-module is analogously relevant since similar pre-processing schemes are requiredby the different use cases. To this regard, it is thought that development of a “general purpose” pre-processing macro-module incorporating the many signal analysis now present in the use cases is bothinstrumental to the implantation of the use cases on the VERCE platform and an attractor to other usecases (and developers) not currently part of the VERCE community. The concept of the macro-modulesextends nicely throughout data- and CPU-intensive applications.

In summary and while subdividing the types of use cases into data- and CPU-intensive, our analysis

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 6: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

6 D-NA2.1 - Pilot applications and scientific use cases

on the development of macro-modules has lead us to select for the data intensive use cases the cross-correlation of ambient noise as the primary for software factorization and re-design. In fact, three outof nine use cases address this topic thus providing that ample audience of seismological developers andusers indispensable to provide the feedback needed by the software engineers for their development ofthe required e-Infrastructure and common macro-modules. This choice, given the envisaged modularity,will nevertheless benefit also the other data-intensive use cases. What is certain is that the solutionsfound within VERCE will be pivotal for making computationally feasible analysis of huge data sets suchas those of the Japanese seismic networks (i.e., the joint French-Japanese Nemazu project) otherwisenearly impossible using the current resources and test bed architectures. Three use cases encompassCPU-intensive use cases. The first two entail earthquake wave modeling and inversion at regional scale.The first use case — Forward Modelling and Inversion — is special in the sense it will provide toolsfor model misfit comparison in the data space to allow for detailed comparison of seismograms alongpredefined parts of the model. The expectation is that it will provide much insight into the velocity modelbesides offering tools for data misfit calculation — a fundamental ingredient for tomographic inversionand for our second CPU-intensive use case which effectively complements nicely the former. That is,the two use cases are necessarily sequential and complement each other. The last use case selecte—TsuMaps—consist of the rapid determination of the tsunami wave resulting from rapid alerts issued bythe real-time software Early-Est. This use case is very different from all of the others requiring theassignment of dedicated developments.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 7: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

7

1 Introduction

Increasingly dense seismic and geodetic networks are continuously transmitting a growing wealth of datafrom around the world. The multi-use of these data led the seismological community to pioneer globallydistributed open-access data infrastructures, standard services and formats, e.g., the Federation of DigitalSeismic Networks (FDSN) and the European Integrated Data Archives (EIDA). However, our ability toacquire observational data outpaces that of managing, analyzing and modeling them. Thus, enablingadvanced data-intensive analysis and modeling applications is challenged by conventional storage, com-putation and communication models and this all requires a new holistic approach. It is instrumental toexploit the cornucopia of data, and to guarantee optimal operation and design of the high-cost monitoringfacilities.

Computational seismology, and information management, is increasingly revolving around massive amountsof data that stem from: i.) the flood of data from the observational systems; ii.) the flood of data fromlarge-scale simulations and inversions; iii.) the ability to economically store petabytes of data online;iv.) the evolving Internet and Data-aware computing capabilities. As data-intensive applications arerapidly increasing in scale and complexity, they require additional services-oriented architectures offer-ing a virtualization-based flexibility for complex and re-usable workflows.

The primary objective of this work package is the prioritization of the use cases contributed by thepartners of the VERCE project. Each use case encompasses a primary pilot application and a specificdata set. This entails the following main activities.

• Selection of a number of existing pilot applications and definition of scientific use case scenariosbased on actual research goals.

• Analysis and prioritization of the pilot applications and use case scenarios in terms of their impact,complexity and requirements (services and tools). (With the help of JRA1, SA1 and JRA2)

The prioritization will be based on the typology of the use cases, since many of the applications sharecommon software requirements (e.g., same core routines, common parts of the workflow), the similarityin HPC or data resources of the test-bed and on ranking the impact of the use case for the community.The identified selection criteria address the following main aspects.

• Scientific impact (i.e., top-level research since VERCE wants to drive a new paradigm in data andCPU intensive analysis in seismology).

• Challenge (i.e., only through using the VERCE methods and e-Infrastructure is it possible to ad-dress the problem).

• Ready to go due to the existence of a well defined, identifiable group of people working on thespecific use case and seeking results — this is essential for continuous feedback between seismol-ogists and ICT specialists.

• Potentially achievable (it must be complex but achievable in the time span).

• Expected gain from factorizing the procedure/code into modules so that they can be re-used to buildother applications (e.g., cross-correlation, back-projection, filtering, other types of pre-processing,....) and thus stimulate the development of further advances in the methods used by Earth scientists.

• Impact on the development of the VERCE e-Infrastructure, in particular, coverage is important,so that, by the end of the VERCE project, all of the computational and data challenges that thelonger-term use of the e-Infrastructure will encounter, will have been explored.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 8: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

8 D-NA2.1 - Pilot applications and scientific use cases

Results of the selection and analysis processes just described consist of the documentation of the pilotapplications and of the use-case scenarios, and the requirements for new services, tools, software andtailored workbenches. As VERCE develops, these will be integrated into a consistent Virtual ResearchEnvironment (VRE), through the actions of other work packages, particularly JRA2 and SA3, so thatfuture users will have a convenient and intuitive working environment that significantly accelerates theirresearch.

2 Survey of the use cases

The initial work focused on understanding what the different use cases effectively do. To this end a surveyhas been prepared and all of the contributing software partners were requested to fill in the answers. Asanticipated in the previous section, the objectives of the survey were to:

• obtain a clear idea of the use cases and the software requirements,

• help shape a strategy for defining, prioritizing and implementing the use cases, and

• identify the common components of the workflows employed in the use cases.

2.1 Structure of the questionnaire

The survey consisted of 23 main questions and concerned i) the identification of the use case (e.g, data-vs cpu-intensive, purpose, scientific relevance, main software/routines, pipeline, ...), ii) some impor-tant technicalities (e.g., memory requirements, formats, operating systems, I/O requirements, externallibraries, ...) and iii) how it has been disseminated and used until now (e.g., documentation, modalitiesof access, ...).

2.2 Results of the questionnaire

The full results of the questionnaire are available under the NA2 wiki section of Redmine(http://www.verce-project.eu/projects/verce1/wiki/NA2_Use_Cases_Survey).

2.2.1 Proposed use cases

The use cases resulting from compilation of the survey are listed below.

1. Forward Modelling and Inversion (LMU).

2. Xspect : cross-spectrum analysis on noise cross-correlations (INGV).

3. High Resolution Tomography from 3D full waveform inversion in Italy (INGV).

4. TSUMAPS: near real-time forecasting of tsunami wave height (INGV).

5. Automatic detection and High Resolution Location of Italian Seismicity (INGV, EOST).

6. L’Aquila 2009 quake: crustal velocity variation by means of seismic noise cross correlation (INGV).

7. Noise cross-correlations at the Valhall field (IPGP).

8. Automatic high resolution location of Maule aftershocks (ULIV).

9. Velocity and velocity changes of Japan: the Namazu project (IPGP/ISTERRE).

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 9: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

9

The use cases can be sub-divided, according to the results they produce, into three main groups — i)velocity structure and its temporal variations, ii) seismic-event detection and high-resolution location,and iii) tsunami early warning. The first group targets the determination of 3D, laterally inhomogeneous,velocity structures using different state-of-the-art methodologies, both data- and cpu-intensive. Thesemethods include: i) noise cross-correlation and inversion of the resulting maps of group velocity, ii)comparison and inversion of earthquake broadband data, and iii) monitoring of velocity variations usingnoise cross-correlation data. The second group of use cases and related pilot applications targets eventdetection and high-resolution location from large to huge data sets consisting of continuous waveforms.The third group addresses a very special case of analysis, which combines real-time, continuous datastreams analysis and CPU-intensive tsunami modeling. Figure 1 summarizes the typology of the usecase according to the scientific results that are sought while evidencing the type of pilot application (i.e.,data- or CPU-intensive).

Figure 1 – NA2 use cases proposed by the partners of VERCE. The tree diagram is organized according to thescientific goals that are sought. Each scientific case can include data- and/or CPU-intensive pilot applications thatare indicated in blu and red respectively. The numbers reported within parenthesis for each use case refer to thenumbered list of the proposed used cases.

2.2.2 Identified data sets

The partners will provide access to the following data sets throughout the course of the VERCE project.

Continuous data:

• Japan-Namazu (∼ 90TB)

• Maule 2010 (M8.8) aftershocks ∼ 1 year (3.3 TB)

• Italy — 5-6 years (∼ 45TB)

• L’Aquila earthquake aftershock 1 Month (200 GB)

• Continuous stream for Tsunami Early Warning (TsuMaps) (∼ 10.5 kB/s) (e.g., from IRIS globalstations SeedLink server).

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 10: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

10 D-NA2.1 - Pilot applications and scientific use cases

• European data from Orfeus-EIDA data archive - 5-6 years (∼ 30 TB)

Event data:

• Available using EIDA web-services. [1000 events recorded at regional scale with 20’ time win-dows by 1000 stations at 100 Hz for 3C data result in ∼ 400GB of storage requirements]

In practice, there are two very large continuous data sets — the Japan-Namazu that corresponds to thejoint data sets of the Japanese seismographic networks (short period, broadband and strong motion), andthe Italian seismic networks (broadband). They are both in the size range of tens of TBs of continuouswaveform data. One data set — the aftershocks of the 2010 (M8.8) Maule earthquake is a little morethan 1.1 TB and the Vahall field and the L’Aquila earthquake aftershock data set lie in the range ofsome hundreds of GigaBytes. The tsunami early warning (TEW) use case —TsuMaps — requires real-time, continuous stream of miniSEED data available through SeedLink connections. This flux will bereplicated by using the same procedures but on miniSEED data stored on local disk. Finally, it is worthnoting that a size projection of the data storage requirements for the event-data evidence; 1000 eventsrecorded by 1000 stations for time windows of 20 minutes is of the order of ∼ 400 GB.

2.2.3 Software

The script pipeline and/or the software of the pilot applications is quite composite. Figure 2 shows thatin almost all of the use cases there is no single language adopted and that the use cases are composed ofdifferent programming and scripting languages.

Figure 2 – Typologies of the application and software (or scripting) language adopted. The numbers (e.g., 8.a and8.b) refer to the question of the survey. The pie-chart on top (Question 8.a) summarizes the distribution of the pilotapplication procedures (see text to the right). The bottom histogram (Question 8.b) indicates the languages usedfor the pilot application development.

Just two use cases — Valhall field and Xspect — adopt a single language (C/C++ and MATLAB, respec-tively) throughout.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 11: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

11

Figure 3 – External libraries used in the use cases (Question 9.c of the survey). The histogram indicates thefrequency of usage of the various libraries.

2.2.4 External software libraries

Figure 3 shows the external software libraries adopted by the different use cases.

Two primary external libraries — ObsPy (http://obspy.org) and GMT (http://gmt.soest.hawaii.edu/) – areadopted in most use cases. ObsPy is a python library consisting of modules designed to ingest seismolog-ical data in various formats. It offers a number of signal processing functionalities that make it very valu-able for data processing. GMT has become a graphics standard for mapping in the Earth Sciences, be-sides offering 2D processing functionalities, which can be extremely useful for representing fields in vari-ous canonical geographical projections. SAC (seismological analysis code, http://www.iris.edu/software/sac/)is also used by some use cases, although its limited use suggests that it may have been partially replacedby ObsPy. Finally, some use cases adopt the Numerical Recipes library, which provides numerical anal-ysis software.

2.2.5 Coordinate system and spatial scale-length

The coordinate systems adopted by the use cases appear to privilege the Cartesian one for local andregional scale analysis rather than the Spherical at global scale. It follows that the scale-length of the usecases is primarily on the local to regional scales (Fig. 4).

2.2.6 Input Data formats

The input data formats are dominated by SAC (5 use cases) and miniSEED and fullSEED (4 use cases).

SAC input format is very popular in seismology, but it is likely not optimal for data ingestion especiallyfor large data sets. miniSEED is a compressed format and in a SDS [Simple (or SeedLink) Data Struc-ture] can be potentially ingested very rapidly. For the output (see Fig. 5), the survey indicates a verycomplex suite of output formats. The majority of the use cases have their own binary or ASCII format,very few rely on generic XML or to QuakeML and a few adopt, for the output waveforms, standardformats like SAC or miniSEED.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 12: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

12 D-NA2.1 - Pilot applications and scientific use cases

Figure 4 – Histogram of the spatial scale-lengths and of the adopted coordinate systems. The top panel (Question9.a) shows that the majority of the use cases targets problems from local to regional scale-lengths. The bpttompanel (Question 9.b) shows that cartesian coordinates are adopted for the most.

Figure 5 – Histogram summarizing the adopted input file formats as resulting from Question 9.d of the survey.Detail on the histogram with the actual numbers reported is provided on the right side of the panel.

2.2.7 Processors/Memory and implementation on clusters

The great majority of the use cases runs on Intel x86 and AMD Opteron processors, but the results ofthe survey show also that the same pilot applications have been on other less popular processors (e.g.,IBM Power 5/6, Sun, SGI, Compaq, NEC, HPC systems,...). More confusion seems to exists for theCPU memory requirements per processor that appears to vary from O(100-1000 GB) for the forwardmodeling application proposed by LMU to about 1 GB, but we note that many have not answered thisquestion. Also the great majority of the pilot applications have been run on clusters as embarrassinglyparallel jobs which is a very suitable for data-intensive analysis of large number of e.g., independent timewindows (Fig. 6).

2.2.8 Data sizes and storage requirements

The questions on the data storage requirements were designed to obtain a better understanding of the datavolumes that come into play (Fig. 7). It appears, as anticipated in the data description, that the input datavolumes of the proposed use cases span different orders of magnitude. An important issue is representedby the temporary storage requirements of many use cases — primarily those dealing with the noise cross-correlation (i.e., the cross-correlations files essential analysis of the seismic noise increase as n2 wheren is the number of continuous waveform channels used). Finally for what regards the sought input data

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 13: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

13

Figure 6 – Processors, memory requirements and if the pilot application is data intensive if it has been run onclusters as embarrassingly parallel job. As in the previous figures, the numbers above each panel refer to thequestion number of the survey.

size of the proposed use cases, it appears that the majority of the pilot applications were designed forinput sizes of the data in the order of the 1-10 TB. This implies that some relevant re-factoring of theapplications is required in order to deal with data sizes in the order of the tens to hundreds of TB.

In any event, the optimization of the data ingestion (input) and of the disk writing of the large amounts oftemporary data produced by some use cases (e.g., the cross-correlations files) is effectively one primarymain issue that needs to be addressed by VERCE.

2.2.9 Data availability

This is a critical issue that involves both the data access (i.e., the discovery and downloading of the dataarchived at the seismological data centers) and the data staging (i.e., the protocols and software adoptedfor transferring the data from the archives or from the local machines to the HPC facilities1). The surveyindicates that most use cases require “staging to HPC facilities” for analysis and that only a fraction (3)have the data stored locally already (Fig. 8). We note that data staging can be a severe bottleneck forthe data-intensive use cases especially when the data are not sited on the same LAN. For VERCE thisproblem is extremely relevant, but given the resources allocated to the project its solution is to be found inthe results/developments of other concurrent projects (e.g., EUDAT, ENVRI,...) that address specificallythis issue.

2.2.10 Application availability

All pilot applications of the use cases are public.

2.2.11 Data preparation, metadata and identification

Data pre-processing is a very important part of the analysis for all use cases. Within the data pre-processing stage it is included also the data quality control part. Pre-processing depending on the spe-

1In the computer literature “data staging” is reserved to moving the data from local storage to the neraby computing resource(see definition in glossary) whereas here its definition is broadened to the movement of the data also from the archives.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 14: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

14 D-NA2.1 - Pilot applications and scientific use cases

Figure 7 – Input, Output and temporary data sizes. This multi-panel figure attempts to provide estimates of thedata volumes that come into play for the data-intensive applications. The numbers refer to the questions of thesurvey. The top panel histogram provides information on the original target size of the data set to be ingestedby the pilot application. The middle panel summarizes the output data sizes as multiple of the input one and thebottom panel histogram for the temporary files.

cific application involves a series of signal processing analysis tools (e.g., band-, high- and low-passfiltering, deconvolution of the instrument response, decimation, synchronization, whitening, binning,identification of spikes or data gaps, calculation of characteristic functions). A fundamental issue thatneeds solution is the description of the data through metadata. This issue is emerging ever more oftenin seismology since the seismological community has moved toward the analysis of very large data vol-umes. Proper selection of descriptive metadata can greatly speed-up data discovery and can representderived data sets to be used for further (or different) analysis.

Another issue that to some extent represents the other side of the same book-keeping issue introducedwith the definition of the metadata is the introduction of persistent identifiers for both seismological dataand to the whole workflow of the analysis in general (Fig. 9).

2.2.12 Documentation, current users and distribution

The proposed use cases, though very important scientifically, are not yet widely used and the surveyevidences that since they have been developed by scientists for specific scientific targets, they still lack,in most cases, documentation, user guides or tutorials (Fig. 10 and Fig. 11).

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 15: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

15

Figure 8 – Data availability. The pie chart (top) and the histogram refer to questions 13.e and 13.f of the survey.The pie chart shows well the need for most of the use cases to avail of fast methodologies to stage the data to HPCfacilities whereas the bottom histogram evidences that most data of the data intensive applications do not avail ofon-line services provided by the archival data centers.

Figure 9 – Pre-processing and metadata. Histogram (top) and pie chart (bottom) answers to questions 16 and 17of the survey. The histogram on top exemplifies some of the data pre-processing stages that the use cases requirethe bottom pie chart shows that most developers believe that proper data description using metadata is of relevancefor data discovery

2.3 Discussion and summary

Nine use cases have been surveyed and they sub-divide into 6 data intensive and 3 CPU-intensive ap-plications. Similarly, the data sets brought to the VERCE platform by the partners are 4 overall (Japan,Italy, L’Aquila, Maule) and ranging in terms of storage requirements between ∼ 200 GB to 30 TB ofcompressed data (i.e., ∼ 90 TB uncompressed). Without ranking the use cases scientifically, since sub-jective and not the purpose of this survey, it can be certainly stated that the presented use-cases addressstate-of-the-art issues in seismology (and in the solid Earth sciences, in general) and they have also the

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 16: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

16 D-NA2.1 - Pilot applications and scientific use cases

Figure 10 – Output and PID. The figure displays the histogram summarizing the answers to the survey (top)regarding the output format (question 19) and a pie chart showing the distribution of the answers regarding theimportance of persistent identifiers (question 20) for data re-usability.

potential for providing important scientific answers to issues of great relevance both when studying theEarth and for important services like seismic monitoring. An example is represented from the analysisof the seismic data acquired by the Japanese networks prior to the great 2011 (M9.0) Tohoku earthquake.Questions such as “Did any velocity change occur in the Crust in the years, months or weeks before theearthquake?” are very relevant scientifically since they can open up new perspectives into the earthquakenucleation process of mega earthquakes and, once the process is better understood, they can also haveimportant effects in the manner the seismic monitoring will be carried out in the future. We note, how-ever, that the ingestion and analysis of a 30 TB compressed data, though currently feasible, is still anddefinitely too lengthy (weeks to months) to assure thorough analysis and results appraisal. This alonerepresents just one good reason why the optimizations conceived and developed within VERCE can bevery relevant. Other reasons follow from the need to optimize and standardize analysis on the VERCEplatform to insure that massive data seismology can be effectively carried out using the resources avail-able in the XXI century.

3 Selection of the use cases

There have been proposes nine (9) applications - 6 data intensive and 3 CPU intensive. The 4 data sets(Japan, Italy, L’Aquila, Maule) of the data intensive use cases encompass applications and analysis car-ried out at different scale-lengths (i.e., from regional to local scale). The same can be said for the CPUintensive use cases that address continental scale (a few thousands of km) and regional/local scale analy-sis. In the following, we distinguish between data- and CPU-intensive use cases. We are also aware thatparts of the use cases workflow for data pre-processing are very similar — the only difference being in theamount of data used (i.e., the scaling of the problem). More specifically, analysis of the use cases above

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 17: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

17

Figure 11 – Documentation, distribution and users. The panel summarizes with histograms and pie charts theanswers to questions 21 and 22 of the survey.

evidences that all can be partitioned into main blocks of specific analysis which in the following we willdefine macro-modules. For both data- and CPU-intensive use cases, we can individuate i.) a data stagingpart, ii.) a data ingestion part and iii.) a data pre-processing part that includes quality control procedures(Fig. 12). This indicates that, rather than focusing and targeting the optimization of individual use cases,in the VERCE project it is of great importance the development and the progressive improvement ofthese macro-modules that, on the VERCE architecture/platform, will provide those initial basic analysisservices required for carrying out the analysis. Theses macro-modules will be offering both standardfunctionalities/analysis tools (e.g., bandpass filtering, QC tools, ...) and specific ones (e.g., data stretch-ing, one-bit normalization, ...) and all within the same general architecture. Another important matter ofconcern that resulted from the survey is the need for proper book-keeping and accessibility throughoutthe various stages of the use cases workflows (e.g., data, results). This stresses the need of metadatadescriptive services and of persistent identifiers (PIDs) that will allow to refer (and re-use) uniquely todata used or more in general to provide reliable, long-term access to resources.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 18: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

18 D-NA2.1 - Pilot applications and scientific use cases

Figure 12 – Macro-modules of the data-intensive use cases. Note that the term “staging” is used somewhat un-conventionally to indicate the whole process that takes the data from the data archives to the computing resources.The data staging box is de-emphasized since only a part of the data movement process — from the local storage tothe computer facility — is addressed by VERCE . This part is discussed in more detail in the JRA1 report.

3.1 Data intensive

The generic structure of the data intensive use cases (Fig. 13) complies with i.) the three general macro-modules mentioned above, ii.) one or more use case-specific module for data analysis, and iii) resultspost-processing (e.g., visualization, statistical analysis, ...).

3.1.1 Data staging

Data staging is the initial part of the workflow that makes the data available for the analysis. In thisdocument, it is intended for data staging the entire process that through data discovery it first selects thedata and then uploads them onto the computational facility. In VERCE, however, the initial part of thestaging process (i.e., discovery and data movement from the archive to the user local storage facility)is not really considered as it is assumed that the data have been already selected and transferred. Asnoted previously, optimized data transfer is not object of development in VERCE since there are otherconcurrent data related projects supported by the EC (e.g., EUDAT, ENVRI among others) taddressingsuch optimized procedures.

3.1.2 Data ingestion

This part of the workflow is of primary importance since the main target of the use cases is the anal-ysis of massive seismological continuous waveform data sets that need to be ingested for the analysis.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 19: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

19

Seismological data are stored in various formats and follow diversified hierarchical file structures. Agenerally accepted data file system structure in seismology is the Simple (or SeedLink) Data Structure(http://www.seiscomp3.org/wiki/doc/applications/slarchive) that adopts an organizational tree based onstation name, year, Julian day and channel and the files are in miniSEED compressed format. For rapiddata ingestion it is of primary importance to reduce the delay introduced by the file system in findingthe data to access. In fact, this process can be extremely demanding when millions of files are storedeven for well organized system like SDS. To this regard, it appears that the use of parallel file systems(e.g., [1]) can be very promising to satisfy the demand for fast access to the data. We note here also thatanother aspect of importance is represented by the data format itself since it is to be verified whetherthe compressed miniSEED or the uncompressed SAC format are optimal for data reading. Both theseformats are “de facto” the standards in seismology but they have have been designed for data archivingand not for rapid data reading in a time (late 1980s) when there was no interest in reading the continuousdata. For these reasons, it would be worth investigating whether other formats (e.g., HDF5) can offerincrease speed for I/O.

Figure 13 – Main use cases of the data-intensive group of use cases. (For the de-emphasis of “staging” see thedigression made in Fig. 12).

Data indexing with the assignment of persistent identifiers (PID) is another important requirement. Dig-ital waveform data are effectively digital objects on which the seismological analysis are performed.There are two issues that are becoming ever more relevant — ability to re-use the data and mining/disco-vering the data using the associated metadata. In the first case, it is important the adoption of a standardapproach to insure maximum inter-operability also with other communities and certainly within EPOS.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 20: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

20 D-NA2.1 - Pilot applications and scientific use cases

For the second issue — the metadata — it appears that they can be split into at least two main groups.The first group consists of the standard geographical location parameters associated to the station andthe instrumental characteristics, the second type of metadata involves instead parameters that should bedetermined automatically from the data with the end purpose of providing descriptive values of the datathemselves. This second kind of descriptive parameters can be very relevant, for example, to identifyparticular type of data without having to ingest the whole waveforms.

Note that one of the main goals of VERCE consists of constructing a computational platform that willallow the tackling of problems otherwise impossible to address using the standard resources availableat the individual institutions. This implies that either Grid resources or a grid-like structure among theVERCE participating institutions is to be built. Overall, the goal is to allow the user to run over Griddistributed datasets the very same analysis (s)he ran locally, and collect the results at the end. Thisrequires the use and exploitation of distributed file system over a grid type architecture.

3.1.3 Data pre-processing

Data pre-processing is essential to any use case and it has two primary targets. The first consists ofidentifying the poor quality data and either remove them altogether or “fix” the detected problems (e.g.,for some applications, data interpolation can be used when small data gaps are encountered). The secondtarget is to process the data for the specific needs of the analysis carried out in the use case (e.g., band-pass filtering, kurtosis processing, pre-whitening, ...). Thus, a main objective of VERCE is to develop apre-processing macro-module that contains all those signal analysis routines required by the use cases.The module must allow also for maximum flexibility in the sequence the various processings are applied.The routines that have been identified during this initial prioritization analysis include

• Remove mean

Continuous waveform data do often display a DC offset that requires removal.

• Remove trend

More sophisticated removal of the DC offset

• Decimation

After low-pass filtering, waveforms data can be decimated at fewer samples per second (sps)

• Resampling

Time shifting and/or increasing the sampling frequency of the time series. This can beachieved both in the time domain by interpolation or in the frequency domain by adding zerosand then inverse FFT

• Synchronization

Essential signal analysis that allows the synchronization to the same time of all the target timeseries. It can be performed both in the time and frequency domain.

• Tapering

Very important operation required to avoid the introduction of high frequencies resulting fromthe non-zero values at the start and end of the time series.

• Pre-whitening

whitening of a time series changes the power spectrum of the time series such that it is flat (intheory) and any two samples of the whitened time series are uncorrelated.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 21: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

21

• Band-, low- and high-pass filtering

Basic operation that passes frequencies within a certain range and rejects (attenuates) frequen-cies outside that range.

• 1-bit normalization

Process of mapping onto +/-1 values the positive and negative values of the signal

• Stretching

Process of extending (stretching) or compressing the waveform used to evidence changes ofthe phase of the signal

• Dominant period

Time domain processing that seeks to resolve the dominant period of the signal

• Eigenvalues and eigenvectors for 3C data

Eigen-values and eigen-vectors of the covariance matric determined from the 3C data for pre-defined windows.

• Gaps and glitches detection

Not really a signal processing operation, it seeks to verify the signal integrity

• Data gap interpolation

For some processing, the presence of small gaps can be obviated by signal interpolation.

• ....

These and other signal processing algorithms are currently implemented in the proposed use cases.

3.1.4 Data Analysis

In the following, the basic data analysis steps are just outlined since dedicated documents for each usecase will provide the required detail furing the course of the project. For completeness, however, it isprovided here a general description of the successive steps that each basic use case entails. In generaland as it has been described in the previous section, the macro-modules scheme of Figure 13 indicatesthat, after the data pre-processing macro-module common to all the use cases, the VERCE partnershipproposes essentially two basic analysis pilot applications — ambient noise cross-correlation and eventdetection/location analysis.

Ambient noise cross-correlation The cross-correlation analysis of noise spawns two types of analy-sis that can be loosely defined “static” and “dynamic”. The former seeks determination of the averagepseudo-Green’s functions between pairs of stations. The dynamic one strives to detect temporal vari-ations of the pseudo-Green’s. Main aim of the two analysis is to study the properties of the velocitystructure and its variations through time. In both cases, the cross-correlation of noise are stored and thenstacked as a whole (or in groups through time) depending on the type of analysis. This stacking process— though less demanding of the cross-correlation analysis — still requires optimized access to the filesystem on which the individual cross-correlation files are stored. When the data are stored across storageservers sited in different locations it becomes ever more determinant the bandwidth of the connectivityand the type of file system adopted. The workflow of Figure 14 shows in stages 4 to 6 the calculation ofthe correlations and their different stacking strategy depending on whether “static” tomographic imagesor “dynamic” velocity model variations are sought.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 22: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

22 D-NA2.1 - Pilot applications and scientific use cases

Figure 14 – Noise cross-correlation full workflow: from the raw data downloading to tomographic images andvelocity variation models.

The VERCE partnership is very experienced in this use case ([8, 7]). The most evolved develop-ment from the software engineering perspective is that of the ISTERRE achieved within the WHIS-PER project (http://code-whisper.isterre.fr/html/). The development has been carried out in Python byavailing of the scientific libraries (numpy, scipy and matplotlib) and of the seismological library obspy(http://obspy.org). Developments in Python have been also carried out by the INGV group based on aninitial development in MATLAB.

Event detection and location The second analysis targets the whole series of operations that, startingfrom seismic event detection from the continuous waveforms, leads to the high resolution location of thesame events. This is effectively a suite of macro-modules linked to one another through a central database(DB) that stores the results obtained in the analysis (Figure 15). Although the DB can be populated withthe event data in various manners (e.g., from an earthquake catalogue) a particular feature of this use case

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 23: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

23

is represented by the event detection part that involves WaveLoc which is an algorithm that stacks theorigin times resulting from matching the pre-processed waveforms with pre-defined reference signals.In WaveLoc, the best location is found based on the histogram of the origin times. The algorithm isgeared toward detecting diverse types of events since the recorded waveform can be matched to pre-defined reference signals. The central DB thus stores all the event window waveforms and a number ofoperations can be automatically or manually triggered on the data stored. A typical sequence of analysisinvolves the following steps: i.) event detection and initial location (WaveLoc), ii.) insertion of the datatime windows of the detected events in the DB, iii.) calculation of magnitude of the events, iv.) P- andS-wave onset phase picking based on the initial detection/location, v.) location using NonLinLoc ([3],[2]; http://www.alomax.net/nlloc), vi.) high resolution, relative event locations (e.g., [9], [6], [4]).

Figure 15 – WaveHRL full workflow: from initial event detection to a series of independent modules linked tothe central database

3.1.5 Results post-processing

Both use cases described above generate results that require dedicated post-processing aimed towardvisualization of the results and/or additional statistical analysis. The details will be provided in thededicated use case documents.

3.2 CPU-Intensive

The second big class of use cases is represented by the CPU-intensive pilot cases. These include theforward model solvers which are at the core of the use cases. The primary goal of this class of use cases

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 24: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

24 D-NA2.1 - Pilot applications and scientific use cases

is to generate synthetic seismograms that match as close as possible the observed ones. The range ofapplications of forward modeling (and inversion) is almost limitless in seismology and the impact bothscientific and to the society is equally enormous. For example, the ability to perform accurate simulationsof the wavefield at frequencies as high as 10 Hz for a given area is extremely relevant for engineeringapplications and, similarly, the ability to model accurately the ground motion at frequencies as high as 2-3 Hz at continental scale can really provide quite incredible insights into the complexities of the velocitystructure and, after transposing this information into a geological model, important information on thecomplex geodynamics for instance of Europe. Clearly, the accuracy of the reconstructed model and ofthe synthetics generated depends strongly on various factors. The first and perhaps the largest obstacleis posed by our often poor knowledge of the velocity model. That is and regardless of the solver used,poor and inaccurate knowledge of the velocity model makes it difficult the replication of the observeddata. A second big problem is constituted, for larger size earthquakes where the finiteness of the seismicsource becomes relevant, by our often poor knowledge of the source-time function at high frequencies —recall here that the observed wavefield results from the convolution between the source and the responseof the velocity model to a unit impulse (i.e., the Green’s function). One last important factor that canseverely affect the reconstructions is given by the source-receiver geometry provided by the adopted dataset. In particular, earthquakes do repeat in certain seismically active areas and they do not distributeevenly throughout a region. This inhomogeneous spatial distribution of the seismic events conditionsthe resolution of the target volumes object of the imaging and, unfortunately, little can be done besideswaiting for new earthquakes to occur in the poorly sampled areas (or rely onto artificial sources).

When concerning the modeling and the inversion of the seismic data, a issue that has not been takenmuch into account by seismologists regards the quantitative comparison of the adopted models (invertedor based on the available literature). To date there is no easy manner to compare the adopted modelsboth in the model space and in the data space (e.g., by calculating differences between the syntheticseismograms according to some criterion and/or norm). [Note also that the same data space algorithmcan be used to determine the differences between observed and synthetic seismograms.]

Turning the attention to the computational matters, it results that the current solvers, depending on thescale-length of the problem and its extension (both temporal and geographic), can require CPU-times thatare still intractable computationally to obtain results in reasonable times. This problem will, however,become progressively ever more tractable as speeds and architectures of computers increase though time.

If we combine the arguments outlined above, it is evident that the role of VERCE consists not only ofproviding the computational platform for intensive wavefield modeling of challenging use cases, but alsothat of providing tools, for example, for comparing models in the data space. We note here that datacomparison (misfit) and inverse modeling are intimately linked and proper assessment of the former iscritical for the inversion. Thus, an initial priority for the CPU-intensive forward modeling use case willbe the development of a web portal providing access to functionalities such as computing seismogramsfor a number of Earth model synthetic using different forward solvers. This development is greatlyneeded since it is expected to provide the users with insights into the different parts the velocity modelstarget of the comparison. The engineering of these “comparison” functionalities on the VERCE portalis expected also to attract new potential users interested, for example, in comparing model generatedsynthetics and observed data. Nevertheless and eventually the final true challenge of the CPU-intensiveuse cases is the carrying out of the full adjoint, iterative inversion automatically. This strongly suggeststhat the comparison activity with the portal service and the full waveform inversions activities should bebrought forward, simultaneously.

In summary, the CPU-intensive use case(s) should address the following topics in the initial part of theproject

• Implementation of data misfit algorithms for exploration and assessment of models in data space.These algorithms will provide statistics of the misfits and they will help to better understand the

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 25: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

25

spatial fidelity of the models to the “true” one (e.g., which model/code/station-set is best by sortingfor events and stations, processing-steps, frequency band, earth models and solvers, and accordingto different misfit functions). Moreover, this activity will foster the development of visualizationtools for the misfits, providing geographical and interactive analysis of the Earth model quality.Important by product of this activity will be the definition of a standard, a unified model for com-parison.

• Automation (also partial) of the adjoint inversion scheme. To this regard, note that although thedata input (and output) are small, the intermediate data for the adjoint can be massive (10s of TBs).Another important and very determinant aspect is represented by the meshing of the velocity modelwhich is critical especially when, at each iteration of the adjoint inversion, requires being updatedon the basis of the perturbations calculated by the inversion.

Side issues of importance regard i.) what needs to be stored permanently and whether it must be properlyindexed with unique identifiers; ii.) what metadata to use (this issue should be jointly addressed inconjunction with JRA1); iii.) within the workflow frame, it is important to define the level and thestage for user interaction.; iv.) topical issues that regard the model meshing regard the software to adoptfor meshing and, for model exchange, and whether to adopt standard formats such as HDF5, netCDF;the identification of all the sources of data (e.g., topography, bathymetry, velocity models), how to bestupdate the velocity model from the meshing perspective side; v.) how to organize a uniform environmentfor transferring the data and the model to/from the HPC storage (GridFTP, Tsunami UDP Protocol,...)and vi.) what data pre-processing already under development in the data-intensive use cases can be alsoadopted by the CPU-intensive use cases. The motivations described above have led to conceive two,initially independent, main use cases. The first focuses on the forward calculation of seismograms andthe assessment of the models in the data space through comparison of data and synthetics. This usecase will rely on the development of a web portal where the various procedures will be made availableinteractively. The second use case addresses the tomographic inversion part and it is essentially theobvious continuation of the former use case and for this reasons they have been merged into a single usecase —- Forward Modeling and Inversion.

3.2.1 “Forward Modeling and Inversion: data space comparison”

The aim of this part of the use case is to develop a system that allows the comparison of observed seis-mograms for earthquakes on a continental scale (here: Europe) with synthetic seismograms for variousearth models via one or more web-client (e.g., http://www.seismicportal.eu). The use case is organizedinto 4 main blocks (see also Fig. 16).

1. The user chooses a spatio-temporal region of interest (event information and geographic coordi-nates), as is already possible in http://www.seismicportal.eu.

2. After choosing from available solvers and earth models, perform a forward simulation of the wave-field.

3. Store the synthetic seismograms for all the recording stations and at phantom stations where thestations are sparse.

4. The resulting database of observed and calculated waveforms should be interactively mined (e.g.,multi-parameter misfits, frequency-dependent time misfits, amplitudes, surface wave dispersion,etc) via a portal, based on Python/Obspy and results would be geographically visualized via theweb-client.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 26: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

26 D-NA2.1 - Pilot applications and scientific use cases

Figure 16 – Main blocks of the “Forward Modeling and Inversion” use case

3.2.2 “Forward Modeling and Inversion: adjoints and model inversion”

This is the second part of the use case and it addresses the model progressive update. It is organizedin 4 macro-modules: i) the observed data processing and ingestion (red), ii) the design and meshingof geological volumes (blue), iii) the forward simulations (orange) and iv) the iteration of the inversionproblems (green) (see Fig. 17). In detail, the first macro-modules is composed of a) the selection of thesuitable earthquakes (source solutions) and stations inside the simulated volume, b) the download of thecorresponding waveforms, c) the quality check phase in order to avoid signal with gaps or bad signal-noise ratio. The meshing macro-modules is divided in a) the filtering of the topographic informationat the selected scale, b) the acquisition of the major known geological discontinuities in the selectedvolume, c) the choice of the tomographic model of the region, d) the construction of a hexahedral meshfollowing the previous constrains, e) the partitioning of the resulting mesh according with the availablecomputational resources. The forward simulations of the use case will adopt the code SPECFEM3D inorder to simulate the wavefields for the considered events and stations. This phase produces a syntheticseismograms archive and the state of variables at the last time step for each simulations that will be usedin the next modules.

The Inversion modules is composed of a) the processing of the data and synthetics (synchronization,deconvolution of the instrument response, filtering at the chosen frequency bands), b) the selection ofthe time windows suitable for measurements on data and synthetics, c) the measurement of the chosenmisfit functions to quantify the difference between observed and synthetic waveforms, d) the definitionof the corresponding adjoint source, e) using SPECFEM3D, the construction for each event of the cor-responding event kernel by means of interaction of forward and adjoint wavefields, f) the aggregation ofthe event kernels creating the misfit kernels of the model parameters (preconditioning and smoothing canbe applied to the misfit kernels), g) the steplength test in order to update the tomographic model. Theprocedure iterates to the forward simulations in order to improve successively the velocity model of theselected region.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 27: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

27

Figure 17 – Main blocks of the “adjoints and model inversion” part of the “Forward Modeling and Inversion” usecase

3.2.3 TsuMaps use case

This CPU-intensive use case is quite different since it conjugates a real-time analysis for event detectionand location (Early-Est, http://early-est.rm.ingv.it) with the calculation of tsunami using the pCOMCOTsolver. The use case is aimed toward tsunami early warning. Figure 18 provides a flow diagram of theuse case. The use case is sub-divided into three parts. First, the real-time for event detection, location,magnitude and a number of tsunami discriminant that carry important information about the source.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 28: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

28 D-NA2.1 - Pilot applications and scientific use cases

Figure 18 – Flow diagram of the TsuMaps use case.

The second part of the flow diagram provides merges the information provided by Early-Est with thegeologic and tectonic knowledge of the area where the earthquake has occurred. The result of this part isthe fault mechanism and a first estimate of the fault finiteness and of course depth which is an extremelyimportant parameter. The final part of the workflow solves the forward problem of the tsunami modeling.

4 Prioritization and Recommendations

The analysis of the proposed used cases together with the evaluation work carried out through the partic-ipation to various meetings and events 2 have led to the following initial prioritization of the use cases

• Ambient noise cross-correlation

• Forward Modelling and Inversion

• TsuMaps: near real-time forecasting of tsunami wave height

The “ambient noise cross-correlation” is a mature workflow which has been recently software engineeredin Python using the Obspy libraries by the CNRS participants to the project (INGV has also expertise

2EGU and AGU Assemblies; workshops in eScience Applications for Seismology, Edinburgh, March and June, 2011;workshop for the Coordination of IT developments in on-going (EC-)projects: EPOS/ORFEUS seismology: February, 2012)

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 29: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

29

of this pilot application). Despite the simplicity of the stages of the use case, much re-thinking mustbe made on the way to analyze and mine the huge data sets of continuous waveform data available.Particular attention must be put on selecting file systems that insure high throughput which, in the contextof data-intensive applications, will translate into massively parallel data access, that has to be handledefficiently by the underlying storage service. Since data-intensive applications like the ambient noisecross-correlation spend considerable time to perform I/O, a high throughput is an important propertythat impacts on the total application execution time. This results in the need for supporting also highlyparallel data workflows. In particular, the ambient noise cross-correlation consist of multiple phaseswhere data staging from the local storage to the computer resources interleaves with data processing,generating highly parallel data workflows (cf. [5]). This depends also on the granularity of the dataset that in our case is of the order of one day data files per station channel. Finally, another importantissue for the data intensive use cases is represented by the seismological formats now in use that havebeen designed primarily for data archiving. The existence of formats highly optimized for complex datastructures like HDF5 is worth investigation.

The second use case which is prioritized is the “Forward Modelling and Inversion” which is a compositeworkflow consisting of a highly modular, interactive and automatic scheme for gathering data, calculatingsynthetics and comparing the two. It progressively includes and analyses more data to be used forcomparison and inversion using the adjoint formulation for determining the sensitivity kernels. Theamount of observational data used is not very significant but the adjoint sensitivity kernel are very large.Again, it is envisaged the importance of availing of high performance parallel file systems for optimizeddata transfer. Two groups — LMU and INGV — are very active in this use case and in the JRA1 reportis provided a full examination of the use case.

Finally, the TsuMaps use case is again composite — it matches a real-time module that determinesearthquake location, magnitude and tsunami discriminants carrying information on the tsunamigenicpotential with a tsunami wave height solver for calculating the impact of the tsunami along the coast.The challenge is to be able to provide the modeling as quickly and as accurately as possible as functionof the distance of the earthquake from the coast. The use case is CPU-intensive and it would requireaccess to HPC resources once a tsunami alert is issued.

In conclusion, all three use cases have the potential for giving a strong impulse to data- and compute-intensive applications of computational seismology and they represent state-of-the-art applications im-portant to build upon them.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 30: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

30 D-NA2.1 - Pilot applications and scientific use cases

Appendix 1: Use cases survey - questionnaire

The answers to the questionnaire below (in “excel” format), and summary documents providing a concisedescription of the use cases can be found on the Redmine wiki page of NA2(http://www.verce-project.eu/projects/verce1/wiki/NA2_Use_Cases_Survey).

1.a) Please give your name

1.b) Please give your affiliation

2.a) Name of the use case

2.b) Purpose of the use case

2.c) How can the use case classified ?

2.d) Describe in detail the use case by listing its main steps

2.e) Based on your experience and extrapolating the current total CPU timings required by the applicationfor this use case, please estimate the CPU-timing that the use case described above would take on a singleserver/cluster and provide an estimate of the gain you would expect from using the VERCE platform.

2.f) If the use case is "data intensive", what is the input data size (in GB) ?

2.g) Have you provided a picture describing the workflow ?

3) Scientific impact of the use case

4) Other types of impact of the use case

6.a) Name of the primary applications/procedures employed in the use case

6.b) Authors of the application and affiliation

7.a) What kind of application is it ?

7.b) Describe concisely what the application does

7.c) Provide references of articles that have adopted the application

8.a) What is the pipeline of the application/procedure ?

8.b) What language(s) is(are) used ?

9.a) Spatial scale of the application ?

9.b) What coordinate system is adopted by the application ?

9.c) What software libraries are used ?

9.d) If the application is data intensive, what data format is ingesting ?

10.a) What processors has been tested on ?

10.b) How much memory (in GB) per processor does the application require ?

10.c) If the application is data intensive, has it been already set and run on a cluster as a embarrassinglyparallel application ?

11.) Operating system upon which the application has been tested ?

12.) Does the application require any graphics libraries (if positive please specify) ?

13.a) If the application is data intensive, what size of input data (D) has been originally designed for ?

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 31: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

31

13.b) If the application is data intensive, what multiple of the input data (D) results as output ?

13.c) If the application is data intensive, what multiple of the input data (D) are generated as intermediatetemporary results ?

13.d) If the application is data intensive, what is the input data size target ?

13.e) Do the data need to be transferred from the "data centers" to the HPC facilities before their analysis?

13.f) How are the data provided ?

14.a) Is the application open source ?

14.b) If the application is not entirely open source please specify what s/w requires licensing ?

15.) Does the application adopt a DB ?

16.) Do the data require any pre-processing ?

17.) Do you think that adoption of metadata summarizing specific features of the input data can bevaluable to optimize the data selection ?

18.) If profiling of the application was available, what operation/routine is taking the longest executiontime?

19.) What kind of output format the application adopts?

20.) Do you find it important to assign a PID to the results of the analysis for future reference andsharing?

21.a) What is the level of documentation of the application ?

21.b) How many users have tested the application ? 22.a) How is the application distributed (or madeavailable) ? 22.b) Do you make it available with a test case (mandatory) ?

23.) Final comments on the application: strengths, weaknesses, additional developments needed, sug-gestions,...

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 32: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

32 D-NA2.1 - Pilot applications and scientific use cases

References

[1] P.H. Carns, W.B. Ligon III, R.B. Ross, and R. Thakur. PVFS: A parallel file system for Linuxclusters. Proceedings of the 4th annual Linux Showcase & Conference-Volume 4, pages 28–28,2000.

[2] A Lomax, A Michelini, and A Curtis. Earthquake Location, Direct, Global-Search Methods. InEncyclopedia of Complexity and Systems Science, pages 2449–2473. 2009.

[3] A Lomax, J Virieux, P Volant, and C Berge. Probabilistic earthquake location in 3D and layeredmodels: Introduction of a Metropolis-Gibbs method and comparison with linear locations. in Ad-vances in Seismic Event Location Thurber, C.H., and N. Rabinowitz (eds.), pages 101–134, 2000.

[4] Alberto Michelini and Anthony J Lomax. The effect of velocity structure errors on double-differenceearthquake location. Geophysical Research Letters, vol.31, no.9., page 4, May 2004.

[5] B Nicolae, G Antoniu, L. Bougé, D Moise, and A. Carpen-Amarie. BlobSeer: Next-generationdata management for large scale infrastructures. Journal of Parallel and Distributed Computing,71(2):169–184, 2011.

[6] Andreas Rietbrock and Felix Waldhauser. A narrowly spaced double-seismic zone in the subductingNazca plate. Geophys. Res. Lett, 31:–, 2004.

[7] Nikolai M Shapiro and Michel Campillo. Emergence of broadband Rayleigh waves from correlationsof the ambient seismic noise. Geophys. Res. Lett, 31(7):1615–1619, 2004.

[8] Nikolai M Shapiro, Michel Campillo, L Stehly, and Michael H Ritzwoller. High-resolution surface-wave tomography from ambient seismic noise. Science, 307(5715):1615, 2005.

[9] F Waldhauser and WL Ellsworth. "Slip-Parallel Seismic Lineations on the Northern Hayward Fault,Califo" by Felix Waldhauser, William L. Ellsworth et al. 1999.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 33: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

33

Glossary and Links

3C Three component (e.g., seismometer recording ground motion along thethree-dimensions).

cross-correlation In signal processing, cross-correlation is a measure of similarity of twowaveforms as a function of a time-lag applied to one of them.

cpu-intensive applications Compute-intensive applications are those that devote most of their execu-tion time to computational requirements and typically require small vol-umes of data although they can produce very large to huge data volumes.Compute-intensive is a term that applies to any computer application thatdemands a lot of computation, such as forward modeling programs forseismic wave propagation or other scientific applications.

data Any digitally encoded information that can be stored, processed andtransmitted by computers. Includes text files, database records, images,video sequences and recordings. In seismology, it is customary to refer todata as those digital values acquired by the pair seismometer/data-loggerdeployed in the field which is recording the ground motion object of theanalysis.

data-intensive applications Data-intensive applications are those that are used to process large vol-umes of data typically of the order of terabytes or petabytes in size andreferred to as Big Data.

data staging Indicates the process of moving (stage-in) the data to the site of execu-tion. In some cases, a data stage-out is specified to download the resultsand data clean-up after execution In a use case, there may be several suchdata staging activities, which could occur sequentially or in parallel.

ENVRI “Implementation of common solutions for a cluster of ESFRI infras-tructures in the field of “Environmental Sciences” is a project currentlyfunded by the EC

EPOS “European Plate Observing System” is an ESFRI approved infras-tructure currently in its preparatory phase and funded by the EC(http://www.epos-eu.org)

EUDAT EUropean DATa is a project currently funded by the EC for the develop-ment of the Common Data Interface (http://www.eudat.eu)

HDF5 Hierarchical Data Format (HDF, HDF4, or HDF5) is the name of a set offile formats and libraries designed to store and organize large amounts ofnumerical data (http://www.hdfgroup.org/HDF5/).

JRA1 WP8JRA2 WP9miniSEED The miniSEED format is a subformat of the commonly used SEED data

format used for archiving seismological data.NA2 WP2NA3 WP3NA4 WP4PID — Persistent Identifier A persistent identifier is a permanent, location- independent and globally

unique identifier for a resource. Persistent identifiers are generally as-signed by agencies who undertake to provide reliable, long-term accessto resources. Examples of persistent identifiers include Digital ObjectIdentifiers, Uniform Resource Names, Handles and Archival ResourceKeys.

COPYRIGHT c© VERCE PROJECT, 2011-2015

Page 34: D-NA2.1: Pilot applications and scientific use cases: first ... · D-NA2.1: Pilot applications and scientific use cases: first selection, prioritization and enabling work plan

34 D-NA2.1 - Pilot applications and scientific use cases

pilot application main software routine within a use case (e.g., the cross-correlation analy-sis in the use case addressing the velocity variations of the Italian penin-sula crust properties).

pre-processing One or operations performed on the observed data to prepare the latterfor the analysis and/or for performing quality control checks.

SA1 WP5SA2 WP6SA3 WP7SPECFEM3D A simulation software code based on the spectral-

element method for 3D seismic wave propagation —http://www.seg.ethz.ch/software/specfem3D

use case In software and systems engineering, a use case is a list of steps, typ-ically defining interactions between a role and a system, to achievea goal. The actor can be a human or an external system (cfhttp://en.wikipedia.org/wiki/Use_case). In VERCE it is assumed to rep-resent the entire scientific application (e.g., analysis of the noise cross-correlation of the Italian seismic networks for 6 years period to detecttemporal variations of the Crust material properties)

workflow A process of composed data-handling tasks, computational tasks and hu-man interactions intended to implement a research method or establishedworking practice..

COPYRIGHT c© VERCE PROJECT, 2011-2015