25
BIG DATA MY EXPERIENCE AT CERN LUCA MENICHETTI Università Roma Tre, Big Data, 6 June 2016

BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

BIGDATAMYEXPERIENCEATCERNLUCAMENICHETT I

Università RomaTre,BigData,6June2016

Page 2: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

CERN

2

Page 3: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

LargeHadronCollider

3

Page 4: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Experiments

4

Page 5: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Events

5

Page 6: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

6

Tier 0 (CERN Computing Centre)Data Recording &Offline Analysis

(perogniesperimento…)

DataFlow

Page 7: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

7

Storage

200-400 MB/sec

Data flow to permanent storage: 4-6 GB/sec

1.25 GB/sec

1-2 GB/se

1-2 GB/sec

Page 8: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Reconstructionandarchival

6/8/16 DOCUMENTREFERENCE 8

Page 9: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Tiers- WLCG

9

Tier-0 (CERN):•Data recording•Initial data reconstruction

•Data distribution

Tier-1 (11 centres):•Permanent storage•Re-processing•Analysis

Tier-2 (~130 centres):• Simulation• End-user analysis

Page 10: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

WLCG

10

Page 11: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Hadoop◦ ExperimentsandITservicesrunning24/7◦ Millionofjobssubmitteddailyinthegrid(physicists)◦ Monitoringdataforeachservicearecollectedandproperlystoredindependently

◦ Crossprojectanalysisactivitiesarecoordinatedbyworkinggroupsthatareoftensharingacommonplatformwheretodumpdataandrunjobs(IT)

◦ Amongthese:HadoopServiceprovidedbyCERNIT◦ Acommonrepository(datalake)◦ Aproductionenvironmentforotherservices

11

Page 12: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Main activities◦ Serviceprovider◦ Cluster(s)maintenance(ROTA)◦ Framework/Applicationstroubleshooting◦ Analysisenvironmentconfiguration(clients)◦ Externalserviceintegration(fromtransportlayeruntilUI)◦ …

◦ Dataanalysis◦ Mainlyaboutresourcesutilizationandjobsperformance◦ Fileformatandframeworksevaluation◦ Usersupport◦ …

12

Page 13: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

DataFlow:ETL◦ DataarestoredinHDFSusingRESTAPIs,streaming(SparkorFlume)orSqoop jobs◦ ExtractionTransformationandLoad(orELT)proceduresarerunningdailyforeachdataset◦ ResultsareCSV,JSON,Avro,Parquet,…◦ Eachdatasetcanbepresentmorethanonce◦ Writtenwithdifferenttechnologiesorformats◦ Mergedwithotherdatasets(denormalization)◦ Writtenwithlessormorefields

13

Page 14: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

DataAnalysiswithinHadoop◦Howtoansweryourquestion?◦Differentframeworksandtoolscanbeused,dependingontheusecase:◦ Datasize◦ Frequency◦ Numberoffieldsperrecord◦ Finalresult

14

Page 15: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

ApacheSpark◦ Fast(in-memoryapproach)◦ Easytolearnandtouse◦ RDDandDataFrame bringsthefocusonthedataset◦ SparkComponents!◦ SparkSQL,MLlib,GraphX

15

Page 16: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Analysisexample- workflow

16

Dashboard(experiment

jobs)

LSF(batchjobs)

LanDB(hostinfo)

SqoopFlume

Page 17: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

AnalysisExamples◦ JobefficiencyWignervsGeneva◦ Spark,Python(Pandas)

◦ Memoryprofiling◦ Spark(SQL)

◦ Datapopularity(blockreplicaslocaltion)◦ Pig,Spark(SQL,GraphX)

◦ Jobmonitoringsystemdiscrepancyanalysis◦ Spark,Python(Pandas)

6/8/16 DOCUMENTREFERENCE 17

Page 18: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

WebNotebooks

6/8/16 DOCUMENTREFERENCE 18

◦ Itis“aninteractivecomputationalenvironment,inwhichyoucancombinecodeexecution,richtext,mathematics,plotsandrichmedia”[http://ipython.org/notebook.html]

Page 19: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

IPython /Jupyter

6/8/16 DOCUMENTREFERENCE 19

Page 20: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Jupyter example- matplotlib

6/8/16 DOCUMENTREFERENCE 20

Page 21: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Zeppelin

21

Page 22: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Zeppelin– ExampleDF

22

Page 23: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

23

Zeppelin– ExampleChart

Page 24: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

24

WallClock

CPU

Circlesize:jobduration

Zeppelin– ExamplePlot

Page 25: BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs Extraction Transformation and Load (or ELT) procedures are running daily for each dataset

Theend

6/8/16 DOCUMENTREFERENCE 25