Introduction to Big Data

Introduction to Big Data Roi Blanco

2

What is Big Data?

•  A fashioned term used by some IT vendors to remarked old fashioned hardware and software

•  “The term itself is vague, but it is getting at something that is real… Big Data is a tagline for a process that has the potential to transform everything.” John Kleinberg

•  What I want to talk about: –  Big Data science, cool use cases –  Access to data, tools to process the data (Hadoop and friends’ ecosystem) –  What’s next (now!)

3

Now, that’s Big data

4

Data?

•  Advances in digital sensors, communications, computation, and storage have created huge collections of data, capturing information of value to business, science, government, and society.

•  Example: search engine companies –  transformed how people find and make use of information on a daily basis.

•  Other forms of big data are transforming the activities of companies, scientific researchers.

•  Machine learning on large data-sets for decision making, product shaping.

5

Motivation •  BIG DATA is an OPEN SOURCE Software Revolution •  BIG DATA Analytics 2.0

•  What is happening right now

•  Why do we need new tools? •  Improve decision making:

•  Measure and react in REAL-TIME

6

Data Explosion

picture from Big Data Integration

relational

text audio video

images

7

Real Time Decision Making

8

Companies need to know:

•  what is happening right now, in real time, to be able to

•  react •  anticipate and detect new

business opportunities.

Wal-Mart

9

LHC

10

WWW

11

Mobile

12

Intelligence agencies

13

Social media

14

Big Data 3(+3) Vs

•  Volume •  Variety •  Velocity

•  Value •  Variability •  Veracity

15

Volume vs Velocity

16

Controversy of Big Data

•  All data is BIG now •  Hype to sell Hadoop

based systems •  Ethical concerns about

accessibility •  Limited access to Big

Data creates new digital divides

17

Controversy of Big Data

•  Statistical Significance: –  When the number of

variables grow, the number of fake correlations also grow

–  Leinweber: S&P 500 stock index correlated with butter production in Bangladesh

18

Need for Big Data

McKinsey Global Institute (MGI) Report on Big Data, 2011 19

•  WEF defined data as an asset just like gold or currency

•  Business opportunities to exploit by companies that can analyze information in the right way

•  What do your customers need?

•  What will they demand in the future?

Need for Big Data

20

•  How do you know the invest was worth it?

•  In the happy success cases predictive analysis has led to income improvement of ~70%

McKinsey Global Institute (MGI) Report on Big Data, 2011

Crude Oil

21

Data Analysis

•  Most business still running on small data! •  Is more data always better?

–  Hardly –  past a certain point, return on adding more data diminishes to the point that

you’re only wasting time gathering more

•  Do you need data? –  Of course –  … but the right data (+ interpretation)

•  Unbiased, context •  Big data is not a magic wand for inferring causality

•  Most AI problems have been tackled from a data perspective –  Still, unsolved (Google’s cat detector).

22

What is data science?

23

Why Machine Learning interest is increasing?

•  Data is everywhere –  Increasingly captured –  Increasingly comprehensive

•  Storage capabilities are now much cheaper, such is processing –  In-house Hadoop clusters –  Cloud-based processing (Amazon EC2)

•  Data is important –  Machine learning provides effective development methodology –  … when you cannot program a solution by hand –  … but you have data available

•  Let the data figure out the program

•  Any company with large data sets will have an interest

24

(HADOOP)

25

Big Data Challenges

Sort 10TB on 1 node = 2 days

100-node cluster = 30 min

26

Big Data Challenges

“Fat” servers implies high cost

–  use cheap commodity nodes instead

Large number of cheap nodes implies frequent failures

–  leverage automatic fault-tolerance

commodity

fault-tolerance

27

Big Data Challenges

We need new data-parallel programming model for clusters of commodity

machines

data-parallel

28

MapReduce

Published in 2004 by Google

–  MapReduce: Simplified Data Processing on Large Clusters

Popularized by Apache Hadoop project started by Yahoo!

–  Now used by virtually everybody else Facebook, Twitter,

Amazon, …

29

Who uses Hadoop?

30

Map Reduce Philosophy

– hide complexity

– make it scalable

– make it cheap

1.  System Shall Manage and Heal Itself

2.  Performance Shall Scale Linearly

3.  Compute Should Move to Data 4.  Simple Core, Modular and

Extensible

31

Hadoop High-Level Architecture

Name Node Maintains mapping of file blocks

to data node slaves

Job Tracker Schedules jobs across

task tracker slaves

Data Node Stores and serves

blocks of data

Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs

Task Tracker Runs tasks (work units)

within a job Share Physical Node

32

Pig

33

Pig

A = LOAD ’data’ USING PigStorage() AS(f1:int, f2:int, f3:int);

B = GROUP A BY f1;

C = FOREACH B GENERATE COUNT ($0);

DUMP C;

Pig: Similar to SQL

21 / 55

Pig Similar to SQL

Pig powers

34

HBase

35

•  Apache HBase™ is the Hadoop database, a distributed, scalable, big key-value store –  Linear and modular

scalability. – Strictly consistent reads

and writes. – Automatic and configurable

sharding of tables – Failover support –  Interoperable with Java,

Hadoop

Hive

•  Apache project for querying and analyzing datasets in HDFS –  Tools to enable easy data

extract/transform/load (ETL) –  A mechanism to impose

structure on a variety of data formats

–  Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBaseTM

–  Query execution via MapReduce

36

Apache S4

37

Twitter Storm

38

Apache Mahout

39

MOVING TOWARDS (NEAR)REALTIME

Runaway Complexity

41

Future

•  Process data fast enough –  BI analytics

•  Key drivers: connected devices/services –  Tablets, smartphones, etc. –  Your data is “always connected to the cloud” –  Low latency (again)/enormous amount of data

•  User data –  Categorize data to infer knowledge about a user

•  Targeting, personalization •  100B events per day

–  ML: from information to knowledge –  Behavioral targeting (user features)

•  How likely am I to be interested in fashion? For how long? •  Map to behavioral targeting categories, segment for targeting

42

Future (II)

•  Data processed in batches –  There are gaps! –  Things you’ve calculated half an hour ago –  Ok for monthly reports, not for online NRT prediction –  Think of GEO targeting

•  You can’t go fast enough with MR –  From big long windows to small incremental iterations –  Micro-batches updating user knowledge

•  Use cases –  Ad campaign allocation

•  Delay between click and deducting budget from an advertiser (overspending) –  Personalization and targeting

•  Y! Homepage •  Use every event on the stream to detect the interest

–  How do we train machine learning models when the data is arriving non-stop? •  You want parameters to adapt, to change slowly •  Maybe 99% of the data is the same! Incrementally is better

43

Beyond Hadoop

•  YARN –  Why if you just want to interact with the data in Hadoop?

•  Hive (SQL-like), Hbase (NoSQL) and Pig (scripted data access) –  Those apps are great but limited to running as a single application system with

MapReduce at the core –  Spark (see below) and Storm have been ported to YARN already

•  Streaming –  SAMOA

•  RDDs –  Spark

•  Shark (Hive on Spark)

•  Analytics Architecture –  Visualization http://visualize.yahoo.com/mail/

44

Future Challenges for Big Data

•  Evaluation

•  Time evolving data •  Distributed mining

•  Compression •  Visualization •  Hidden Big Data

45

Hadoop 2.0

•  No longer “only” running MR jobs –  MR + processing low latency and streaming

•  Iterative processing –  Hold data in memory to re-process

•  Figure the questions of what to do with data –  BI that want to do exploration of the data really fast

•  Possible thanks to YARN + Storm(S4) + Spark + … ? –  350PB of data –  >30K nodes with Yarn –  400K per day (6 jobs/sec) –  10M hours of compute with YARN

46

Future key take-aways

•  Scalability •  Performance •  Flexibility •  Programming paradigms

– MAP/MAP/MAP .. OR REDUCE/REDUCE/REDUCE

47

Big Data Myths

•  Big Data is new •  Big Data is objective •  Big Data doesn’t discriminate •  Big Data makes things smart •  Big Data is anonymous •  You can opt-out

48

Big Data vs Big Reality

•  Big Data is an oxymoron •  Big Data raises bigger issues. The term suggests assembling many

facts to create greater, previously unseen truths. It suggests the certainty of math.

•  It's not the data itself but what you do with it that counts.

49

Technology

Introduction to Big Data