Upload
alexina-webb
View
215
Download
2
Tags:
Embed Size (px)
Citation preview
Hadoop IT Services
Hadoop Users Forum
CERN October 7th,2015
CERN IT-D*
2
HadoopA framework for large scale data processing
• Distributed storage and processing• Shared nothing architecture – scales horizontally• Optimized for high throughput on sequential data
access
Interconnect network
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
MEMORY
CPU
Disks
Node 1 Node 2 Node 3 Node 4 Node 5 Node X
3
How Hadoop Can Help You• Parallel processing of large amounts of data• Perform analytics on a big scale• Dealing with diverse data: structured, semi-
structured, unstructured• ‘Cold’ storage / Archives
Performance is usually suboptimal for• Random reads and real-time access• ‘Small’ datasets
4
There are already interesting use cases of Hadoop @CERN
• WLCG grid monitoring• Data Transfers etc.
• Atlas Events Indexing
• CASTOR log aggregation
• Data Warehousing• Logging/time series data
• IT monitoring
5
Hadoop Service in IT• Setup and run the
infrastructure• Provide consultancy• Build the community
• Joint work • IT-DB and IT-DSS
6
Hadoop Clusters in IT (Oct 2015)
• lxhadoop (22 nodes)• general purpose cluster (mainly used by ATLAS)• stable software setup• recent hardware
• analytix (56 nodes)• for analysis of monitoring data• varied hardware specifications• the biggest in terms of number of nodes
• hadalytic (17 nodes)• general purpose cluster with additional services• recent hardware
7
Many Configuration Options• Hadoop is a platform
• Many components and key decisions in the implementation
• Rapidly evolving field
• Examples• Data access: domain specific language or SQL• Many components and data formats• Data loading and unloading tools
Currently available components
8
HDFS Hadoop Distributed File System
Hb
ase
NoS
ql c
olum
nar
stor
e
YARN Cluster resource manager
MapReduce
Hiv
eS
QL
Pig
Scr
iptin
g
Flu
me
Log
data
col
lect
or
Sq
oo
pD
ata
exc
ha
ng
e w
ith R
DB
MS
Zo
ok
ee
pe
rC
oord
inat
ion
Imp
ala
SQ
L
Sp
ark
Larg
e sc
ale
da
ta p
roce
esi
ng
9
Software version policy• Align to CDH distributions
lxhadoop(22 nodes)
analytix(56 nodes)
hadalytic(17 nodes)
CDH 5.1.0 5.4.2 5.4.2
HDFS 2.3.0 2.6.0 2.6.0
HBase 0.98.1 1.0.0 1.0.0
Hive 0.12.0 1.1.0 1.1.0
Pig 0.12.0 0.12.0 0.12.0
Spark 1.0.0 1.3.0 1.3.0
Impala - - 2.2.0
Sqoop 1.4.4 1.4.5 1.4.5
10
Maintenance activities• Actions
• Upgrades to a newer CDH
• Frequency• Typically twice a year
• Impact• Downtime 1-3 hours
11
Recent activities (last 3 months)• Hadoop Tutorials – during summer• Deployment of Coudera Impala component• Monitoring of hanging HBase region servers• Self-service Oracle2Hadoop integration (work in
progress) • Building a database of users’ data sources
12
Contact points
• Service is available in SNOW• SE: Hadoop Service
• FE: Hadoop Components• FE: Hadoop Core
• E-group: [email protected]
• Show up on the Wednesday’s meeting• Analytic Working Group• Hadoop User Forum
13
How to Learn More• Hadoop tutorials at CERN, summer 2015
• Introduction to Hadoop (Architecture, HDFS, MapReduce, Spark) https://indico.cern.ch/event/404527/
• SQL on Hadoop (Hive, Impala) https://indico.cern.ch/event/434650/
• NoSQL on Hadoop (HBase) https://indico.cern.ch/event/442004/
• We plan to do more/repeats in the future
14
Future plans• Infrastructure
• HDFS backups• Rolling upgrades• Support from Cloudera?
• Users community• Write a Knowledge Base (SNOW)
• New features/technology testing• Kudu – a new columnar file system from Cloudera• Tachyon – in-memory file system