14
Hadoop IT Services Hadoop Users Forum CERN October 7 th ,2015 CERN IT-D*

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Embed Size (px)

Citation preview

Page 1: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Hadoop IT Services

Hadoop Users Forum

CERN October 7th,2015

CERN IT-D*

Page 2: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

2

HadoopA framework for large scale data processing

• Distributed storage and processing• Shared nothing architecture – scales horizontally• Optimized for high throughput on sequential data

access

Interconnect network

MEMORY

CPU

Disks

MEMORY

CPU

Disks

MEMORY

CPU

Disks

MEMORY

CPU

Disks

MEMORY

CPU

Disks

MEMORY

CPU

Disks

Node 1 Node 2 Node 3 Node 4 Node 5 Node X

Page 3: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

3

How Hadoop Can Help You• Parallel processing of large amounts of data• Perform analytics on a big scale• Dealing with diverse data: structured, semi-

structured, unstructured• ‘Cold’ storage / Archives

Performance is usually suboptimal for• Random reads and real-time access• ‘Small’ datasets

Page 4: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

4

There are already interesting use cases of Hadoop @CERN

• WLCG grid monitoring• Data Transfers etc.

• Atlas Events Indexing

• CASTOR log aggregation

• Data Warehousing• Logging/time series data

• IT monitoring

Page 5: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

5

Hadoop Service in IT• Setup and run the

infrastructure• Provide consultancy• Build the community

• Joint work • IT-DB and IT-DSS

Page 6: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

6

Hadoop Clusters in IT (Oct 2015)

• lxhadoop (22 nodes)• general purpose cluster (mainly used by ATLAS)• stable software setup• recent hardware

• analytix (56 nodes)• for analysis of monitoring data• varied hardware specifications• the biggest in terms of number of nodes

• hadalytic (17 nodes)• general purpose cluster with additional services• recent hardware

Page 7: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

7

Many Configuration Options• Hadoop is a platform

• Many components and key decisions in the implementation

• Rapidly evolving field

• Examples• Data access: domain specific language or SQL• Many components and data formats• Data loading and unloading tools

Page 8: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Currently available components

8

HDFS Hadoop Distributed File System

Hb

ase

NoS

ql c

olum

nar

stor

e

YARN Cluster resource manager

MapReduce

Hiv

eS

QL

Pig

Scr

iptin

g

Flu

me

Log

data

col

lect

or

Sq

oo

pD

ata

exc

ha

ng

e w

ith R

DB

MS

Zo

ok

ee

pe

rC

oord

inat

ion

Imp

ala

SQ

L

Sp

ark

Larg

e sc

ale

da

ta p

roce

esi

ng

Page 9: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

9

Software version policy• Align to CDH distributions

lxhadoop(22 nodes)

analytix(56 nodes)

hadalytic(17 nodes)

CDH 5.1.0 5.4.2 5.4.2

HDFS 2.3.0 2.6.0 2.6.0

HBase 0.98.1 1.0.0 1.0.0

Hive 0.12.0 1.1.0 1.1.0

Pig 0.12.0 0.12.0 0.12.0

Spark 1.0.0 1.3.0 1.3.0

Impala - - 2.2.0

Sqoop 1.4.4 1.4.5 1.4.5

Page 10: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

10

Maintenance activities• Actions

• Upgrades to a newer CDH

• Frequency• Typically twice a year

• Impact• Downtime 1-3 hours

Page 11: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

11

Recent activities (last 3 months)• Hadoop Tutorials – during summer• Deployment of Coudera Impala component• Monitoring of hanging HBase region servers• Self-service Oracle2Hadoop integration (work in

progress) • Building a database of users’ data sources

Page 12: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

12

Contact points

• Service is available in SNOW• SE: Hadoop Service

• FE: Hadoop Components• FE: Hadoop Core

• E-group: [email protected]

• Show up on the Wednesday’s meeting• Analytic Working Group• Hadoop User Forum

Page 13: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

13

How to Learn More• Hadoop tutorials at CERN, summer 2015

• Introduction to Hadoop (Architecture, HDFS, MapReduce, Spark) https://indico.cern.ch/event/404527/

• SQL on Hadoop (Hive, Impala) https://indico.cern.ch/event/434650/

• NoSQL on Hadoop (HBase) https://indico.cern.ch/event/442004/

• We plan to do more/repeats in the future

Page 14: Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

14

Future plans• Infrastructure

• HDFS backups• Rolling upgrades• Support from Cloudera?

• Users community• Write a Knowledge Base (SNOW)

• New features/technology testing• Kudu – a new columnar file system from Cloudera• Tachyon – in-memory file system