Upload
all-things-open
View
226
Download
3
Tags:
Embed Size (px)
Citation preview
Everything you (freaking) need to know about
Hadoop NowAndrew C. Oliver
@acoliver#ATO2014
{All Things Open | Raleigh}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Andrew C. Oliver● Programming since I was about 8● Java since ~1997● Founded POI project (currently hosted at Apache) with
Marc Johnson ~2000○ Former member Jakarta PMC○ Emeritus member of Apache Software Foundation
● Joined JBoss ~2002● Former Board Member/current helper/lifetime member:
Open Source Initiative (http://opensource.org)● Column in InfoWorld: http://www.infoworld.com/author-
bios/andrew-oliver○ I make fanboys cry.
Andrew C. Oliver@acoliver
#ATO2014
Everything You Need to Know About Hadoop Now
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Open Software Integrators● Founded Nov 2007 by Andrew C. Oliver (me)
○ in Durham, NCPivoted from Java/Linux consulting to full on
Hadoop/NoSQL this year
● We’re Hiring○ mid to senior level (Java/Linux and Database background)○ devopsy type people (Puppet, Chef, Salt, etc, Linux
background, database understanding, Ruby/Python/etc) ○ up to 50% travel, salary + bonus, 401k, health, etc etc○ preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS,
JQuery○ nice to have: Hadoop, Neo4j, MongoDB, Cassandra, Ruby, at
least one Cloud platform
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● What is Hadoop anyhow?
● What is Hadoop Good For?
● What isn’t it good for?
● How do you get data into Hadoop?
● How do you get data out of Hadoop?
● How do you process data in Hadoop?
● How do you analyze data in Hadoop?
● How do you secure Hadoop?
Overview
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● This is an overview talk intended as a roadmap to point you at the most
important bits to learn on the way…
● It is not comprehensive training…
● It is not an in-depth look at any part of Hadoop
● It is a rather high level selective overview of the Hadoop ecosystem
But first...
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
What is Hadoop Anyhow?
{All Things Open | Raleigh}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● A platform for distributed
computing
● 2011
○ HDFS
○ Hive
● 2012
○ HDFS
○ YARN
○ Hive
○ HBase
● 2014
○ HDFS
○ Hive
○ Yarn
○ HBase
○ Spark
○ Storm
○ Kafka
○ Mahout
○ Squoop
○ Oozie
○ ...
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014Hadoop is
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● HDFS
○ Distributed Filesystem similar to Gluster, Ceph, etc.
○ You can use other distributed filesystems in place of HDFS
○ Blocks are distributed, and by default duplicated on at least 1 other
node
○ 128m default block size
○ Restful API, CLI tools, third-party tools to “mount” HDFS on Linux
(stable), Windows (ymmv), Mac (?)
● DO NOT PUT YOUR DATA NODES ON A SAN! IT IS WRONG! DO NOT DO
IT! EVEN ON THURSDAY!
Hadoop is
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● YARN
○ Yet another resource negotiator
○ schedules “work” among nodes, distributes the “processing”
● Map Reduce is
○ an API
○ an algorithm, data is mapped to nodes, the answers are “reduced” to a single
answer
● Hive is
○ HDFS/Hadoop based data warehousing
○ SQL, JDBC, ODBC
○ Tables map to files on HDFS
○ No updates, deletes, transactions (but coming in “Stinger.next”)
Hadoop is
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● HBase
○ a column family database
○ ACID
○ relatively low-latency
● And a whole lot more
Hadoop is
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● An ecosystem of tools for distributed processing and storage of data.
Hadoop is
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
What is Hadoop Good For?
{All Things Open | Raleigh}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● Working with large amounts of data in batch
○ ETL processing / Data Transformation
○ Analytics / BI
○ Integration (Data Lake, Enterprise Data Hub)
● Working with streams of data
○ Events
■ Log data
● Time series or similar data (HBase)
What is Hadoop Good for
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● What is Hadoop bad at?
○ Quick jobs - i.e. Hive/Map Reduce setup time is measured in seconds
to minutes.
○ Lots of small files (128MB block size = 0 byte files are 128m files)
○ General DBMS stuff - HBase is a much more “specific” database than
MySQL/etc.
○ High Availability
■ WHA???
● Knox, Oozie, etc all have shaky support if any for HA
Namenodes.
What is Hadoop bad at?
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
How do you get data into/out of Hadoop?
{All Things Open | Raleigh}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● How do you get data into Hadoop?
○ Sqoop it from an RDBMS
○ Use JDBC or ODBC and push into Hive from an external DB
○ Push data into Hive with the restful API
○ Put an extract file onto HDFS with the REST API
■ process it into Hive directly with a LOAD DATA statement
■ transform/process it into Hive using PIG
■ use Java
○ Message it in there with Kafka, RabbitMQ or similar MQ and custom “spout”
for Storm
○ Use any multitude of APIs that write data into HDFS, HBase, Hive, etc.
How do you get data into Hadoop?
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● How do you get data out of Hadoop?
○ Should you be getting it out or should you process it there?
○ JDBC/ODBC to Hive
○ HBase can be mounted into Hive
○ REST APIs for Hive/HDFS
○ APIs for Kafka, Spark, Storm, etc (subscribe)
○ HDCP to another HDFS
○ Mount it with FUSE and use your favorite Linux tool
○ hadoop fs -cat /path/to/file/on/hdfs |grep stuff > mynewlocalfile
How do you get data out of Hadoop?
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
How do you process data in Hadoop?
{All Things Open | Raleigh}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● Map-reduce Java API
● Hive supports SQL (soon to be not a subset)
● PIG can munge files on HDFS and can work with Hive
● Storm and Spark have their own APIs for dealing with events or so-called
micro-batches of data
● There are numerous toolkits
○ Mahout - common machine learning algorithms (many not very
parallelizable/etc)
○ MLib - Machine learning built on Spark
○ GraphX
How do you process data in Hadoop?
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● Most major BI tools now support Hadoop
○ Tableau
○ Pentaho
○ Datameer
○ Your favorite probably here
● All that stuff is for l4m3rs, use the command line interface :-)
○ hive -e ‘select * from sometable’
○ pig hdfs://some/dir/myscript.pig
● Use RStudio and write some R to predict what sales will be next month (you will be
sort of wrong probably)
● Use your favorite SQL tool that supports JDBC/ODBC
● Use Hue
How do you analyze data in Hadoop
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
How do you secure Hadoop?
{All Things Open | Raleigh}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● HDFS supports POSIX (that means Linux-style) filesystem security
● The most complete security authentication throughout Hadoop is based
on Kerberos (yeah I know).
● You can do it with just straight LDAP too, but it isn’t integrated.
● Knox supplies “perimeter-based security” for (only):
○ Hive
○ HDFS
○ Ooozie
○ HBase
○ HCatalog
● Supposedly Argus will save us from all of this!
How do you secure Hadoop?
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Other Considerations
{All Things Open | Raleigh}
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
● Disaster Recovery
○ Falcon (alpha quality)
● Workflow
○ Flume
● Schedule/trigger/orchestrate those ETL jobs
○ Oozie
● Install, configure, monitor Hadoop
○ Ambari
● Use tables in both Pig and Hive
○ HCatalog
Cacophony
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Ambari
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Hue
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Hue editing Oozie
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}
Pig ScriptREGISTER file:///usr/lib/pig/piggybank.jar;define SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();
rows = load '$FILEPATH' using org.apache.pig.piggybank.storage.CSVExcelStorage('\u001a') as (a0:chararray,a1:chararray,a2:chararray,a3:chararray,a4:chararray,a5:chararray,a6:chararray,a7:chararray,a8:chararray,a9:chararray);
row = foreach rows GENERATEREPLACE((TRIM($0)),'NULL','') as orderid,REPLACE((TRIM($1)),'NULL','') as customerid,REPLACE((TRIM($2)),'NULL','') as customername,REPLACE((TRIM($3)),'NULL','') as address,REPLACE((TRIM($4)),'NULL','') as city,REPLACE((TRIM($5)),'NULL','') as state,REPLACE((TRIM($6)),'NULL','') as zip,REPLACE((TRIM($7)),'NULL','') as status,REPLACE((TRIM($8)),'NULL','') as store row into 'stage.orders' using org.apache.hcatalog.pig.HCatStorer('loaddate=$LOADDATE');
Everything You Need to Know About Hadoop Now
Andrew C. Oliver@acoliver
#ATO2014
{Open Software Integrators} { www.osintegrators.com} {@osintegrators}