Finding the needles in the haystack. An Overview of Analyzing Big Data with Hadoop

Finding the needles in the haystackAn overview of analyzing big data with Hadoop

Presented by Chris BaglieriArchitect, Research Architecture &

DevelopmentTechnology Forum, August 30th, 2010

Hadoop-alooza (ahem Agenda)

Opening Act (Big Data): Overview Analytical Challenges It’s Not Just <insert big web company name here>’s Problem

Headliner (Hadoop Core) Overview Hadoop Distributed File System Hadoop MapReduce

Secondary Stage (Tooling Atop MapReduce) Overview Pig Hive

Setting The Stage; What If We Could…

…centrally log instrument activity across R&D? Are there times when a lab is exceptionally proficient or inefficient? Do certain instruments which look like sound investments cost us

more as a result of down time or rarely produce actionable data? Do certain instruments fail following a predictable chain of events?

…centrally log service or application activity across informatics? Do certain applications fail following a predictable chain of events? Are there two seemingly disparate applications performing similar

tasks; can we extract those tasks into a common service?

…collect exhaust data on samples? What’s the logical starting point to conduct analysis on this sample? Could we construct a sample’s analytical resume on the spot? Could we identify friction in our discovery process?

Oh there’s the big data!

Facebook 36 PB of uncompressed data 2250 machines 23,000 cores 32 GB of RAM per machine Processing 80-90 TB/day

Yahoo 70 PB of data in distributed storage 170 PB spread across the globe 34000 servers Processing 3 PB/day

Twitter 7 TB/day into distributed storage

LinkedIn 120 billion relationships 16 TB of intermediate data

* All references pulled from Hadoop Summit 2010

We’re not them, right?

Research Pushes Boundaries Sequencing (25GB/day) Gene Expression Omnibus (>2TB) Studies consisting of 5 billion rows Aggregating log data across R&D More collaborations More partners

AWS Public Data Sets GenBank (~200GB) Ensembl Genomic Data (~200GB) PubChem Library (~230GB)

Alright, so we’re not exactly close to them, but we’ll still run

into similar problems sooner rather than later.

So what if we store everything?

A typical drive from 1990 could store 1370 MB and had a transfer speed of 4.4 MB/s, hence reading all the data in ~5 minutes. Twenty years

later, storage has 1 TB if not more is the norm while the rate of access has only climbed to ~100 MB/s, yielding a full read time of >2.5 hours.

* Graph extrapolated from data generated on Tom’s Hardware

Our business is data…

…and it’s already Big (just not as big). Very Complex. Highly relational. Very diverse.

…and it may not Fit nicely a database. Scale nicely in a database.

…and we need tooling to Aggregate it. Organize it. Analyze it. Share it. Do more in the area of metrics.

Enter the elephant!

At the highest level, Hadoop is a… …reliable shared storage system. …batch analytical processing engine. …a toy elephant.

Technically speaking, Hadoop is… …an open source Apache licensed project with numerous sub-projects. …based on Google infrastructure (GFS, MapReduce, Sawzall, BigTable,…) …written in Java. …under active development. …distributed. …fault tolerant. …incredibly scalable. ...built with commodity hardware in mind. …efficient with big data. …an ecosystem*. …a market*.

Hadoop (and a small fraction of) The Ecosystem

Hadoop (and a small fraction of) The Market

Hadoop The Apache Project

Raw Processing

HDFS

MapReduceZooke

eper

Hadoop C

om

mons

Compiled Processing

Structured Storage

HBase

Unstructured Storage

Pig HiveC

hukw

a

Hadoop Distributed File System (HDFS)

HDFS is a file system designed for storing very large files (with hooks for smaller files) with streaming data access patterns,

running on clusters on commodity hardware.



HDFS Anti-Patterns

Low-latency data access Applications requiring access speeds of 10’s of milliseconds look elsewhere. Optimized for delivering a high throughput of data at the expense of latency. Hbase is better suited for these access patterns.

Lots of small files The capacity of the namenode is a limiting factor. Millions is feasible, billions is beyond the capability of today’s hardware.

Multiple writers, arbitrary file modifications Files may be written to by a single writer at a given point in time. Writes are always made at the end of the file.

Map Reduce

Generalities A simple programming model. A framework for processing large data sets. Abstracts away the complexities of parallel programming. Computation routed to data.

A Language Unto Itself A map reduce job is a unit of work. It consists of input data, a map reduce program, and configurations. Hadoop runs the job dividing it into tasks, namely map and reduce tasks. Two node types control the execution: job and task trackers. Hadoop divides input to a job into fixed-sized pieces called splits. Hadoop creates one map task per split (often the size of a block). Hadoop employs data locality optimization. Map tasks results are intermediate; stored locally and not on HDFS.

Map Reduce

Basic steps (mapper and reducer change to fit the problem)1. Read data.2. Extract something you care about from each record (mapping)3. Shuffle and sort the extracted data.4. Aggregate, summarize, filter, or transform (reducing)5. Write the results.

Map Reduce (Job Configuration)

Map Reduce (Map)

general form => (k1, v1) list(k2, v2)key => offset from the beginning of a file

value => line of textoutput => { word == Text, 1 == IntWritable }

reporter => container that holds metadata about the job

Map Reduce (Reduce)

general form => (k2, list(v2)) list(k3, v3)key => offset from the start of a file

value => collection of word counts where each count is 1output => { word == Text, number of times it appeared in full text ==

IntWritable }reporter => container that holds metadata about the job

Hadoop-able Problems Courtesy of Cloudera

Properties of Data Complex data. Multiple data sources. Lots of data.

Types of Analysis Text mining. Index building. Graph creation and analysis. Pattern recognition. Collaborative filtering. Prediction models. Sentiment analysis. Risk assessment.

Examples Modeling true risk Customer churn analysis Recommendation engine Ad targeting Point of sale transaction analysis Failure prediction Threat analysis Trade surveillance Search quality Data “sandbox”

Tooling Atop MapReduce

Challenges MapReduce is incredibly powerful but quite verbose. It’s a developer tool through and through. Common tasks are implemented over and over and over again. Better tooling is needed for data preparation and presentation.

Higher Order Abstractions Are Emerging Writing a job in a higher order MR language requires orders of magnitude less

code, thus requires orders of magnitude less time, at acceptable hits in execution time.

Hadoop Pig Overview

Born at Y!; roughly 40% of Y!’s Hadoop jobs are scripted in Pig.

Operates directly on data in HDFS via… …a command line. …scripts. …plug-ins (Eclipse has one called Pig Pen).

Well suited for “data factory” (e.g. ETL) type of work. Raw data in, data ready for consumers out.

Users are often engineers, data specialists, and researchers.

Consists of a… …data flow language (Pig Latin); 10 lines of Pig Latin ~= 200 lines of Java. …shell (Grunt). …server with a JDBC like interface (Pig Server).

Has lots of constructs suitable for data manipulation Join Merge Distinct Union Limit

Pig Makes Something Like This…

…Look Like This

A More Interesting Pig Example

A more interesting example, the above Ruby script (taken from a presentation given by

Twitter’s Analytics Lead) breaks down where users are Tweeting from (e.g. the API, the

front page, their profile page, etc.). Note the use of scripts in Twitter’s “PiggyBank”.

Hadoop Hive Overview

Born at Facebook.

Does not operate directly on HDFS data but rather on a Hive metastore HDFS data is pre-configured and loaded into the metastore. You may have something like /home/hive/warehouse on HDFS. Tables are created and stored as sub-directories in the warehouse. Contents are stored as files within table sub-directories.

Well suited for “data presentation” (e.g. warehouse) type of work.

Users are often engineers using data for their systems, analysts, or decision-makers.

Consists of a… …data SQL-like language. …shell. …server.

Mirrors SQL semantics Table creation. Select clauses. Includes clauses. Joins clauses. Etc.

Working With Hive

Working With Hive

Working With Hive

Working With Hive

Working With Hive

Closing Thoughts

re: Data We already have big (or at the very least medium) data. Our data is inherently diverse, complex, and highly related. Our data footprint is growing fast; can informatics keep up?

re: Technology The Hadoop ecosystem and market are maturing and expanding rapidly. The Hadoop community is strong (oh yeah, and helpful too). Cost to get in the game is not outrageous in terms of soft/hardware. Provisioning “hardware” it’s drastically easier, especially given our AWS

efforts.

re: Opportunity Capturing data doesn’t always need to be structured or regimented. R&D could benefit from additional “free form” data exploration. While relatively inexpensive to get in the game, it does require an investment.

Questions

In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems

of computers.

Grace HopperFun fact, she also is accredited with “It’s easier to ask for forgiveness than it is to get permission.