Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
BIGDATAMYEXPERIENCEATCERNLUCAMENICHETT I
Università RomaTre,BigData,6June2016
CERN
2
LargeHadronCollider
3
Experiments
4
Events
5
6
Tier 0 (CERN Computing Centre)Data Recording &Offline Analysis
(perogniesperimento…)
DataFlow
7
Storage
200-400 MB/sec
Data flow to permanent storage: 4-6 GB/sec
1.25 GB/sec
1-2 GB/se
1-2 GB/sec
Reconstructionandarchival
6/8/16 DOCUMENTREFERENCE 8
Tiers- WLCG
9
Tier-0 (CERN):•Data recording•Initial data reconstruction
•Data distribution
Tier-1 (11 centres):•Permanent storage•Re-processing•Analysis
Tier-2 (~130 centres):• Simulation• End-user analysis
WLCG
10
Hadoop◦ ExperimentsandITservicesrunning24/7◦ Millionofjobssubmitteddailyinthegrid(physicists)◦ Monitoringdataforeachservicearecollectedandproperlystoredindependently
◦ Crossprojectanalysisactivitiesarecoordinatedbyworkinggroupsthatareoftensharingacommonplatformwheretodumpdataandrunjobs(IT)
◦ Amongthese:HadoopServiceprovidedbyCERNIT◦ Acommonrepository(datalake)◦ Aproductionenvironmentforotherservices
11
Main activities◦ Serviceprovider◦ Cluster(s)maintenance(ROTA)◦ Framework/Applicationstroubleshooting◦ Analysisenvironmentconfiguration(clients)◦ Externalserviceintegration(fromtransportlayeruntilUI)◦ …
◦ Dataanalysis◦ Mainlyaboutresourcesutilizationandjobsperformance◦ Fileformatandframeworksevaluation◦ Usersupport◦ …
12
DataFlow:ETL◦ DataarestoredinHDFSusingRESTAPIs,streaming(SparkorFlume)orSqoop jobs◦ ExtractionTransformationandLoad(orELT)proceduresarerunningdailyforeachdataset◦ ResultsareCSV,JSON,Avro,Parquet,…◦ Eachdatasetcanbepresentmorethanonce◦ Writtenwithdifferenttechnologiesorformats◦ Mergedwithotherdatasets(denormalization)◦ Writtenwithlessormorefields
13
DataAnalysiswithinHadoop◦Howtoansweryourquestion?◦Differentframeworksandtoolscanbeused,dependingontheusecase:◦ Datasize◦ Frequency◦ Numberoffieldsperrecord◦ Finalresult
14
ApacheSpark◦ Fast(in-memoryapproach)◦ Easytolearnandtouse◦ RDDandDataFrame bringsthefocusonthedataset◦ SparkComponents!◦ SparkSQL,MLlib,GraphX
15
Analysisexample- workflow
16
Dashboard(experiment
jobs)
LSF(batchjobs)
LanDB(hostinfo)
SqoopFlume
AnalysisExamples◦ JobefficiencyWignervsGeneva◦ Spark,Python(Pandas)
◦ Memoryprofiling◦ Spark(SQL)
◦ Datapopularity(blockreplicaslocaltion)◦ Pig,Spark(SQL,GraphX)
◦ Jobmonitoringsystemdiscrepancyanalysis◦ Spark,Python(Pandas)
6/8/16 DOCUMENTREFERENCE 17
WebNotebooks
6/8/16 DOCUMENTREFERENCE 18
◦ Itis“aninteractivecomputationalenvironment,inwhichyoucancombinecodeexecution,richtext,mathematics,plotsandrichmedia”[http://ipython.org/notebook.html]
IPython /Jupyter
6/8/16 DOCUMENTREFERENCE 19
Jupyter example- matplotlib
6/8/16 DOCUMENTREFERENCE 20
Zeppelin
21
Zeppelin– ExampleDF
22
23
Zeppelin– ExampleChart
24
WallClock
CPU
Circlesize:jobduration
Zeppelin– ExamplePlot
Theend
6/8/16 DOCUMENTREFERENCE 25