BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs...

BIGDATAMYEXPERIENCEATCERNLUCAMENICHETT I

Università RomaTre,BigData,6June2016

LargeHadronCollider

Experiments

Events

Tier 0 (CERN Computing Centre)Data Recording &Offline Analysis

(perogniesperimento…)

DataFlow

Storage

200-400 MB/sec

Data flow to permanent storage: 4-6 GB/sec

1.25 GB/sec

1-2 GB/se

1-2 GB/sec

Reconstructionandarchival

6/8/16 DOCUMENTREFERENCE 8

Tiers- WLCG

Tier-0 (CERN):•Data recording•Initial data reconstruction

•Data distribution

Tier-1 (11 centres):•Permanent storage•Re-processing•Analysis

Tier-2 (~130 centres):• Simulation• End-user analysis

Hadoop◦ ExperimentsandITservicesrunning24/7◦ Millionofjobssubmitteddailyinthegrid(physicists)◦ Monitoringdataforeachservicearecollectedandproperlystoredindependently

◦ Crossprojectanalysisactivitiesarecoordinatedbyworkinggroupsthatareoftensharingacommonplatformwheretodumpdataandrunjobs(IT)

◦ Amongthese:HadoopServiceprovidedbyCERNIT◦ Acommonrepository(datalake)◦ Aproductionenvironmentforotherservices

Main activities◦ Serviceprovider◦ Cluster(s)maintenance(ROTA)◦ Framework/Applicationstroubleshooting◦ Analysisenvironmentconfiguration(clients)◦ Externalserviceintegration(fromtransportlayeruntilUI)◦ …

◦ Dataanalysis◦ Mainlyaboutresourcesutilizationandjobsperformance◦ Fileformatandframeworksevaluation◦ Usersupport◦ …

DataFlow:ETL◦ DataarestoredinHDFSusingRESTAPIs,streaming(SparkorFlume)orSqoop jobs◦ ExtractionTransformationandLoad(orELT)proceduresarerunningdailyforeachdataset◦ ResultsareCSV,JSON,Avro,Parquet,…◦ Eachdatasetcanbepresentmorethanonce◦ Writtenwithdifferenttechnologiesorformats◦ Mergedwithotherdatasets(denormalization)◦ Writtenwithlessormorefields

DataAnalysiswithinHadoop◦Howtoansweryourquestion?◦Differentframeworksandtoolscanbeused,dependingontheusecase:◦ Datasize◦ Frequency◦ Numberoffieldsperrecord◦ Finalresult

ApacheSpark◦ Fast(in-memoryapproach)◦ Easytolearnandtouse◦ RDDandDataFrame bringsthefocusonthedataset◦ SparkComponents!◦ SparkSQL,MLlib,GraphX

Analysisexample- workflow

Dashboard(experiment

LSF(batchjobs)

LanDB(hostinfo)

SqoopFlume

AnalysisExamples◦ JobefficiencyWignervsGeneva◦ Spark,Python(Pandas)

◦ Memoryprofiling◦ Spark(SQL)

◦ Datapopularity(blockreplicaslocaltion)◦ Pig,Spark(SQL,GraphX)

◦ Jobmonitoringsystemdiscrepancyanalysis◦ Spark,Python(Pandas)

WebNotebooks

◦ Itis“aninteractivecomputationalenvironment,inwhichyoucancombinecodeexecution,richtext,mathematics,plotsandrichmedia”[http://ipython.org/notebook.html]

IPython /Jupyter

Jupyter example- matplotlib

Zeppelin

Zeppelin– ExampleDF

Zeppelin– ExampleChart

WallClock

Circlesize:jobduration

Zeppelin– ExamplePlot

Theend

BIG DATA - inf.uniroma3.ittorlone/bigdata/S7-Cern.pdfstreaming (Spark or Flume) or Sqoopjobs...

Documents

Flume in 10minutes

Spark+flume seattle

Apache Flume (NG)

Centralized logging with Flume

Flume Sistem

Cloudera's Flume

Flume - On Top

Flume-Cassandra Log Processor

Flume lspe-110325145754-phpapp01

Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData Community

Flume with Twitter Integration

Environmentalhydraulics : flume assignments

HM 162.51 Venturi Flume

Flume User Guide - Welcome to Apache Flume — Apache Flumeflume.apache.org/releases/content/1.2.0/FlumeUserGuide.pdf · 2012-07-23 · Flume 1.2.0 User Guide Introduction Overview

Implementando #BigData #Analytics #DataScience com # ...sucesurs.org.br/sites/default/files/2020-03/Implementando BigData... · Implementando #BigData #Analytics #DataScience com

THE PARSHALL MEASURING FLUME

Flume office-hours-110228

the flume - E-Type · • Research • Commercial Development Water speed is controllable from 0 to 2.5 m/s or 3.2 m/s depending on flume choice/design. The flume has an optimal water

Flume: Audience Research

Introduction ot Flume