76
Taewook Eom Data Infrastructure Team SK planet 2014-01-28

Strata Conference NYC 2013

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Strata Conference NYC 2013

Taewook Eom Data Infrastructure Team SK planet 2014-01-28

Page 2: Strata Conference NYC 2013

Taewook Eom

http://www.flickr.com/photos/oreillyconf/10616622085/

Data Programmer Plaster(Planet Master) of Big Data Infra Pre-Assessor of Hiring Programmers Mentor of 101 Startup Korea

Twitter: @taewooke LinkedIn: http://kr.linkedin.com/in/taewookeom

Page 3: Strata Conference NYC 2013

http://strataconf.com/

by O’Reilly

Web 2.0 : Open, Sharing, Participation

Santa Clara : Technical

New York with Cloudera : Financial, Business

Europe : Privacy, Government

Boston : Medical

Big Data : Making Data Work Change the World with Data.

Page 4: Strata Conference NYC 2013

Data

When hardware became commoditized, software was valuable. Now software being commoditized, data is valuable.

– Tim O’Reilly, 2011

Data is like the blood of the enterprise.

– Amr Awadallah, CTO at Cloudera, 2013

Page 5: Strata Conference NYC 2013

Big Data Architectural Patterns http://strataconf.com/stratany2013/public/schedule/detail/30397

What is Big Data?

All data that is not a fit for a traditional RDBMS, whether used for OLTP or Analytics purposes

Page 6: Strata Conference NYC 2013

http://blog.vitria.com/Portals/47881/images/3values-resized-600.png

Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data - Gartner, 2011

Page 7: Strata Conference NYC 2013

http://im

age-s

tore

.slid

esh

are

cdn.com

/ae63030a-3

d9b-1

1e3-9

cff-

22000a970267-o

rigin

al.j

pg

Page 8: Strata Conference NYC 2013

Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS http://strataconf.com/stratany2013/public/schedule/detail/29968

Page 9: Strata Conference NYC 2013

Data Science

http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Page 10: Strata Conference NYC 2013

Big Data

http://mappingignorance.org/fx/media/2013/07/Figura-11.jpg

Open Mind!

Page 11: Strata Conference NYC 2013

Big Data

Gartner's 2013 Hype Cycle for Emerging Technologies (2013-08-19)

Page 12: Strata Conference NYC 2013

more than half of technical sessions are presented by Chinese or Indian

39 of 125 sessions are sponsored sessions

Page 13: Strata Conference NYC 2013

Big Data: 4 Approaches

Search-based Hadoop-based

RDB-based NoSQL

Page 14: Strata Conference NYC 2013

Real-time Processing

Real-time Recommendations for Retail: Architecture, Algorithms, and Design http://strataconf.com/stratany2013/public/schedule/detail/30217

Page 15: Strata Conference NYC 2013

Real-time Stream Processing

Apache Storm

Streaming

Apache Kafka Gathering

Processing

Querying Search-based

NoSQL

Stringer/Tez Shark SQL

Page 16: Strata Conference NYC 2013

… not yet Graph Processing

Page 17: Strata Conference NYC 2013

Big Data Space

No one tools is the right fit for all Big Data problem Do not be afraid to recommend the right solution for the problem over the popular solution To do this, you must be aware of the entire ecosystem

Big Data Architectural Patterns http://strataconf.com/stratany2013/public/schedule/detail/30397

Page 18: Strata Conference NYC 2013

Practical Performance Analysis and Tuning for Cloudera Impala http://strataconf.com/stratany2013/public/schedule/detail/30551

Page 19: Strata Conference NYC 2013

Hadoop and the Relational Data Warehouse – When to Use Which? http://strataconf.com/stratany2013/public/schedule/detail/30964

Page 20: Strata Conference NYC 2013

Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS http://strataconf.com/stratany2013/public/schedule/detail/29968

Page 21: Strata Conference NYC 2013

Ignite

Signal Detection Theory: Man vs Machine

Co-Founder @VividCortex Kyle Redinger

http://www.youtube.com/watch?v=Fg6mN-jevds

(5 minutes 6 seconds)

http://www.slideshare.net/realkyleredinger/man-vs-machine-signal-detection-theory-and-big-data

Page 22: Strata Conference NYC 2013

Signal Detection Theory: Man vs Machine

Remove the obvious and look at what is important Remember: Less is more.

Page 23: Strata Conference NYC 2013

Towards Strata 2014

Director of market research at O’Reilly Media Roger Magoulas

http://www.youtube.com/watch?v=Ytd5VkEgQf8

(5 minutes 26 seconds)

http://strataconf.com/stratany2013/public/schedule/detail/31935

Keynote

http://www.oreilly.com/data/free/files/stratasurvey.pdf

Page 24: Strata Conference NYC 2013

Towards Strata 2014

Page 25: Strata Conference NYC 2013

Towards Strata 2014

Page 26: Strata Conference NYC 2013

Towards Strata 2014

Page 27: Strata Conference NYC 2013

Towards Strata 2014

Page 28: Strata Conference NYC 2013

Beyond R and Ph.D.s: The Mythology of Data Science Debunked Douglas Merrill (ZestFinance)

http://www.youtube.com/watch?v=J2sgObXbIWY (8 minutes 9 seconds)

Science is fundamentally about data, but data is not fundamentally about science

Page 29: Strata Conference NYC 2013

People

A data scientist is a data analyst who lives in California. – George Roumeliotis, (Intuit)

Page 32: Strata Conference NYC 2013

Scientists think they can code, software engineers think they are scientists. Team them up so they collaborate.

– Scott Sorenson (Ancestry.com) Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop

Page 33: Strata Conference NYC 2013

How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce http://strataconf.com/stratany2013/public/schedule/detail/30707

Page 34: Strata Conference NYC 2013

Data scientists spend their lives as data janitors instead of leveraging their skills

– Wes McKinney (DataPad) Building More Productive Data Science and Analytics Workflows

Page 35: Strata Conference NYC 2013

Keynote

Is Bigger Really Better? Predictive Analytics

with Fine-grained Behavior Data

Professor at the NYU Stern School of Business Foster Provost

http://www.youtube.com/watch?v=1jzMiAfLH2c

(10 minutes 16 seconds)

http://strataconf.com/stratany2013/public/schedule/detail/31685

Page 36: Strata Conference NYC 2013

Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data

Page 37: Strata Conference NYC 2013

Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data

Page 38: Strata Conference NYC 2013

Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data

Predictive does not mean actionable. – Scott Sorenson (Ancestry.com)

Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop

Page 39: Strata Conference NYC 2013

Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data

More data gives you more precision, not more prediction. Using multiple datasets to reduce errors when measuring values.

- Ravi Iyer (Ranker.com) Using Graphs of Data to Understand your Customers, Users, and Employees

Page 40: Strata Conference NYC 2013

Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data

Page 41: Strata Conference NYC 2013

Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data

Page 42: Strata Conference NYC 2013

Big Impact from Big Data

Head of Analytics at Facebook Ken Rudin

http://www.youtube.com/watch?v=RJFwsZwTBgg

(11 minutes 57 seconds)

http://strataconf.com/stratany2013/public/schedule/detail/31903

Keynote

Page 43: Strata Conference NYC 2013

Big Impact from Big Data

Page 44: Strata Conference NYC 2013

Designing Your Data-Centric Organization Josh Klahr (Pivotal)

http://www.youtube.com/watch?v=D86udfrVzrI (12 minutes)

Hadoop is a hammer, but you need other tools along with it.

Page 45: Strata Conference NYC 2013

Big Impact from Big Data

The way you organize information depends on the question you intend to ask of it.

- Richard Saul Wurman Building a Data Platform

Page 46: Strata Conference NYC 2013

HaDump : Loading data into Hadoop for not reason.

Data Science Without a Scientist http://strataconf.com/stratany2013/public/schedule/detail/31801

Page 47: Strata Conference NYC 2013

Big Impact from Big Data

Technical people still don't understand the business needs of business people! Business people don't know what's a table.

- Anurag Tandon (MicroStrategy) Inject Big Data into your Corporate DNA: Enable Every Employee to Make Data Driven Decisions

Page 48: Strata Conference NYC 2013

Ask the Right Questions Organizations already have people who know their own data better than mystical data scientists. Learning Hadoop is easier than learning the company’s business.

- Gartner, 2012

Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS http://strataconf.com/stratany2013/public/schedule/detail/29968

Page 49: Strata Conference NYC 2013

Non-linear Storytelling: Towards New Methods and Aesthetics for Data Narrative http://strataconf.com/stratany2013/public/schedule/detail/30207

Page 50: Strata Conference NYC 2013

Every Soldier is a Sensor: Countering Corruption in Afghanistan http://strataconf.com/stratany2013/public/schedule/detail/30828

Page 51: Strata Conference NYC 2013

Big Impact from Big Data

Page 52: Strata Conference NYC 2013

Big Impact from Big Data

Page 53: Strata Conference NYC 2013

Big Impact from Big Data

Page 54: Strata Conference NYC 2013

< Actionable Usable < Useful

with Impact If you can't answer for "so what?", you only have facts, not insight

- Baron Schwartz (VividCortex Inc) Making Big Data Small

Descriptive (Easy) What happened?

Predictive (Medium) What will happen?

Prescriptive (Hard) What should we do about it? Hadoop & Data Science for the Enterprise

Value of Data

Page 55: Strata Conference NYC 2013

Big Data is first industry that was created by open source.

- Jack Norris (MapR Technologies) Separating Hadoop Myths from Reality

The Future of Hadoop : What Happened

& What's Possible?

Co-Founder of Hadoop Doug Cutting

http://www.youtube.com/watch?v=_WwuZI6AhN8

(14 minutes 41 seconds) http://strataconf.com/stratany2013/public/

schedule/detail/31591 Hadoop the kernel of the OS for data.

Page 56: Strata Conference NYC 2013

Hadoop's Impact on the Future of Data Management Mike Olson (Cloudera)

http://www.youtube.com/watch?v=puHS2JNKgRM http://strataconf.com/stratany2013/public/schedule/detail/31380

Page 57: Strata Conference NYC 2013

Single : S/W & H/W system : security model : management model : metadata model : audit model : resource management model

Common : storage & schema

http://www.slideshare.net/cloudera/enterprise-data-hub-the-next-big-thing-in-big-data

Page 58: Strata Conference NYC 2013

Last generation of data management is not sufficient More copies, representations, transformations increase risk Index once and reuse across workloads, lifecycle NoSQL: indexing and updates for interactive apps Hadoop: staging, persistence, and analytics

Data Governance for Regulated Industries Using Hadoop http://strataconf.com/stratany2013/public/schedule/detail/30738

Page 59: Strata Conference NYC 2013

Rethink How You See Data Sharmila Shahani-Mulligan (ClearStory Data)

http://www.youtube.com/watch?v=07hGulTOZGk (9 minutes 6 seconds) http://strataconf.com/stratany2013/public/schedule/detail/31742

Data Intelligence

Page 60: Strata Conference NYC 2013

?

Question Analysis & Discovery

Access Sampling Modeling Presentation

The Data Availability Problem

Insight

Data Prep – too slow!

Loading

Introducing a New Way to Interact with Insight http://strataconf.com/stratany2013/public/schedule/detail/31743

Information Supply Chain

Page 61: Strata Conference NYC 2013

Running Non-MapReduce Big Data applications on Apache Hadoop http://strataconf.com/stratany2013/public/schedule/detail/30755

Page 62: Strata Conference NYC 2013

What’s Next for Apache HBase: Multi-tenancy, Predictability, and Extensions. http://strataconf.com/stratany2013/public/schedule/detail/30857

Apache HBase for Architects http://strataconf.com/stratany2013/public/schedule/detail/30619

Page 63: Strata Conference NYC 2013

Securing the Apache Hadoop Ecosystem http://strataconf.com/stratany2013/public/schedule/detail/30302

Page 64: Strata Conference NYC 2013

An Introduction to the Berkeley Data Analytics Stack With Spark, Spark Streaming, Shark, Tachyon, and BlinkDB http://strataconf.com/stratany2013/public/schedule/detail/30959

Page 65: Strata Conference NYC 2013

Schema

Information does not exist until a schema is defined and data is stored in a relational database

- anonymous

Building a Data Platform http://strataconf.com/stratany2013/public/schedule/detail/31400

Page 66: Strata Conference NYC 2013

Lessons Learned From A Decade’s Worth of Big Data At The U.S. National Security Agency (NSA) http://strataconf.com/stratany2013/public/schedule/detail/30913

Page 67: Strata Conference NYC 2013

Managing a Rapidly Evolving Analytics Pipeline http://strataconf.com/stratany2013/public/schedule/detail/30635

Page 68: Strata Conference NYC 2013

SQL on/in Hadoop/Hbase Solutions

Stringer/Tez Shark

Perception is Key: Telescopes, Microscopes and Data http://strataconf.com/strataeu2013/public/schedule/detail/32351

Page 69: Strata Conference NYC 2013

All SQL on Hadoop Solutions are Missing the Point of Hadoop

Every Solution makes you define a schema - SQL(Structured Query Language) is expressed over an assumed schema

Major reasons why Hadoop has taken of include: - Ability to load data without defining a schema - Process data using schema-on-read instead of first defining a schema

Hadoop contains a lot of: - Raw, granular data sets with potentially inconsistent schemas - Data sets in JSON, key-value, and other self-describing (non-relational) models designed for schema-on-read processing

SQL on Hadoop solutions that make you first define a schema are missing a major part of Hadoop’s usage patterns

Flexible Schema and the End of ETL http://strataconf.com/stratany2013/public/schedule/detail/31868

Page 70: Strata Conference NYC 2013

Lessons Learned

Page 71: Strata Conference NYC 2013

Hadoop Adventures At Spotify http://strataconf.com/stratany2013/public/schedule/detail/30570

Page 72: Strata Conference NYC 2013

Hadoop Adventures At Spotify http://strataconf.com/stratany2013/public/schedule/detail/30570

Page 73: Strata Conference NYC 2013

Prototyping is key to overcoming resistance to change Technical architecture is heavily influenced by people organization Developing a team of experienced Hadoop users can often be done using internal employees A culture of experimentation and innovation yields the best result

Quick prototyping is the fastest way to internal advocacy. Ship It! Cloud == Speed We don’t always need a complicated solution. KISS Play to your differentiating strengths. Experience >> Data Bias towards impact. It Takes a Village EASE!! (Emulate, Analyze, Scale, Evaluate)

Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop http://strataconf.com/stratany2013/public/schedule/detail/30499

How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce http://strataconf.com/stratany2013/public/schedule/detail/30707

Page 74: Strata Conference NYC 2013
Page 75: Strata Conference NYC 2013

Questions? SELECT questions FROM audience;