Hadoop on Palmetto HPC
Pengfei Xuan
Jan 8, 2015
School of ComputingClemson University
Outline
• Introduction
• Hadoop over Palmetto HPC Cluster
• Hands-on Practice
HPC Cluster vs. Hadoop Cluster
Networking
ComputeNode
Data Node
…
HDD
SSD
RAM
Data Node
Data Node
ComputeNode
ComputeNode
ComputeNode Hadoop
HPC
HPC Clusters
Forge (NVIDIA GPU Cluster)• 44 GPU Nodes• 6 or 8 NVIDIA Fermi M2070 GPUs per Node• 6GB Graphics Memory per GPU • 600 TB GPFS File System• 40GB/sec InfiniBand QDR per Node
(Point-to-point unidirectional speed)
The National Center for Supercomputing Application
+ +
InfiniBand Switch 40GB InfiniBand Adapter 8 NVIDIA Fermi M2070 GPUs
Hadoop Clusters
History of Hadoop
1998 Google Funded2003 GFS Paper2004 MapReduce Paper Nutch DFS Impl.2005 Nutch MR Impl.2006 BigTable Paper Hadoop Project2008 World’s Largest Hadoop2010 Facebook 21 PB Data 2011 Microsoft, IBM, Oracle, Twitter, AmazonNow Everywhere, Our Class!
Jeffrey Dean Doug Cutting
Google vs. Hadoop Infrastructures
Dremel
Evenflow Evenflow Dremel
MySQLGateway
SawzallBigtable
MapReduce / GFS
Chubby
Azkaban Azkaban
SqoopKafka
PigVoldemort
Hadoop
Zookeeper
HiPal
Databee Databee Hive
ScribeHive
HBaseHadoop
Zookeeper
Oozie Oozie Hive
Data Highway
Hive / PigHBase
Hadoop
Zookeeper
Hue Crunch
Oozie Oozie Hive
SqoopFlume
Hive / PigHBase
Hadoop
Zookeeper
I. Data Coordination Layer
II. Data Storage and Computing Layer
III. Data Flow Layer
IV. Data Analysis Layer
MapReduce Word Count Example
cat * | grep | sort | uniq -c | cat > file
Run Hadoop over Palmetto Cluster
1. Setup Hadoop configuration files
2. Start Hadoop services
3. Copy input files to HDFS (stage-in)
4. Run Hadoop job (MapReduce WordCount)
5. Copy output files from HDFS to your home directory (stage-out)
6. Stop Hadoop services
7. Clear up
Commands
1. Create job directory: $> mkdir myHadoopJob1$> cd myHadoopJob1
2. Get Hadoop PBS Script:$> cp /tmp/runHadoop.pbs .Or,$> wget https://raw.githubusercontent.com/pfxuan/myhadoop/master/examples/runHadoop.pbs
3. Submit job to Palmetto cluster:$> qsub runHadoop.pbs
4. Check status of your job:$> qstat -anu your_cu_username
5. Verify the correctness of your result:$> grep Hadoop wordcount-output/* | grep 51