BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

BIGDATAMYEXPERIENCEATCERNLUCAMENICHETT I

Università RomaTre,BigData,6June2016

CERN

2

LargeHadronCollider

3

Experiments

4

Events

5

6

Tier 0 (CERN Computing Centre)Data Recording &Offline Analysis

(perogniesperimento…)

DataFlow

7

Storage

200-400 MB/sec

Data flow to permanent storage: 4-6 GB/sec

1.25 GB/sec

1-2 GB/se

1-2 GB/sec

Reconstructionandarchival

6/8/16 DOCUMENTREFERENCE 8

Tiers- WLCG

9

Tier-0 (CERN):•Data recording•Initial data reconstruction

•Data distribution

Tier-1 (11 centres):•Permanent storage•Re-processing•Analysis

Tier-2 (~130 centres):• Simulation• End-user analysis

WLCG

10

Hadoop◦ ExperimentsandITservicesrunning24/7◦ Millionofjobssubmitteddailyinthegrid(physicists)◦ Monitoringdataforeachservicearecollectedandproperlystoredindependently

◦ Crossprojectanalysisactivitiesarecoordinatedbyworkinggroupsthatareoftensharingacommonplatformwheretodumpdataandrunjobs(IT)

◦ Amongthese:HadoopServiceprovidedbyCERNIT◦ Acommonrepository(datalake)◦ Aproductionenvironmentforotherservices

11

Main activities◦ Serviceprovider◦ Cluster(s)maintenance(ROTA)◦ Framework/Applicationstroubleshooting◦ Analysisenvironmentconfiguration(clients)◦ Externalserviceintegration(fromtransportlayeruntilUI)◦ …

◦ Dataanalysis◦ Mainlyaboutresourcesutilizationandjobsperformance◦ Fileformatandframeworksevaluation◦ Usersupport◦ …

12

DataFlow:ETL◦ DataarestoredinHDFSusingRESTAPIs,streaming(SparkorFlume)orSqoop jobs◦ ExtractionTransformationandLoad(orELT)proceduresarerunningdailyforeachdataset◦ ResultsareCSV,JSON,Avro,Parquet,…◦ Eachdatasetcanbepresentmorethanonce◦ Writtenwithdifferenttechnologiesorformats◦ Mergedwithotherdatasets(denormalization)◦ Writtenwithlessormorefields

13

DataAnalysiswithinHadoop◦Howtoansweryourquestion?◦Differentframeworksandtoolscanbeused,dependingontheusecase:◦ Datasize◦ Frequency◦ Numberoffieldsperrecord◦ Finalresult

14

ApacheSpark◦ Fast(in-memoryapproach)◦ Easytolearnandtouse◦ RDDandDataFrame bringsthefocusonthedataset◦ SparkComponents!◦ SparkSQL,MLlib,GraphX

15

Analysisexample- workflow

16

Dashboard(experiment

jobs)

LSF(batchjobs)

LanDB(hostinfo)

SqoopFlume

AnalysisExamples◦ JobefficiencyWignervsGeneva◦ Spark,Python(Pandas)

◦ Memoryprofiling◦ Spark(SQL)

◦ Datapopularity(blockreplicaslocaltion)◦ Pig,Spark(SQL,GraphX)

◦ Jobmonitoringsystemdiscrepancyanalysis◦ Spark,Python(Pandas)


WebNotebooks


◦ Itis“aninteractivecomputationalenvironment,inwhichyoucancombinecodeexecution,richtext,mathematics,plotsandrichmedia”[http://ipython.org/notebook.html]

IPython /Jupyter


Jupyter example- matplotlib


Zeppelin

21

Zeppelin– ExampleDF

22

23

Zeppelin– ExampleChart

24

WallClock

CPU

Circlesize:jobduration

Zeppelin– ExamplePlot

Theend


Documents

BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset