36
What is Big Data in a Nutshell?: An Introduction to Problems and Bottlenecks in Data Systems Zach Gazak David E Drummond Insight Data Science & Engineering

Data for Action Talk - 2016-02-22

Embed Size (px)

Citation preview

What is Big Data in a Nutshell?: An Introduction to Problems and

Bottlenecks in Data SystemsZach Gazak

David E Drummond

Insight Data Science & Engineering

Program mentors are data teams from top technology companies including:

500+ Fellows

100+ Companies

Goals• Understand what can be done with “Big Data” and

the scale of the data.

• Understand the hardware bottlenecks that dictate the technology “stack”.

• Understand different stacks that are used for different types of companies, and why.

Facebook is Data

Types of Data• Audio / Visual:

Images and Videos

• Text: Comments, Notes, Profile Content

• Interactions: Likes, Friendships, Groups

• Site usage: Log in, Scroll, Click, Post, etc.

Types of Data• Audio / Visual:

Images and Videos

• Text: Comments, Notes, Profile Content

• Interactions: Likes, Friendships, Groups

• Site usage: Log in, Scroll, Click, Post, etc.

Unstructured

Structured

How is it Used?Business Intelligence / Analytics Customer engagement

How is it Used?Research and Development

Product Iteration and Improvement

How is it Used?

How much data is there?For Zach:

• ~1 MB per month

• Unstructured data only

How much data is there?For 1.2 billion Zachs ~ 1.2 petabytes per month

How is this done?

Hardware basics

Various ports (I/O)

up to ~ 10GB/s

CPU (processor)

~ 1GHz

Hard Drive (storage) ~ 250GB

RAM (memory)

~ 8GB

Various ports (I/O)

up to ~ 10GB/s

RAM (memory)

~ 8GBCPU (processor)

~ 1GHz

Hard Drive (storage) ~ 250GB

Network Processing Storage

Bottlenecks in Data SystemsProper data system design should consider these limiting bottlenecks:

• Processing time by the CPU

• Loading data into the CPU and memory

• Finding data on the disk

• Reading data from the disk

• Moving data across the network

Bottlenecks: Processing Data• All data that is processed must be loaded into the CPU

Disk Storage

Memory

CPU

Price

Speed

Bottlenecks: Processing Data• All data that is processed must be loaded into the CPU

Disk Storage

Memory

CPU

Price

Speed

• Solution: Storage Hierachy, Supercomputers, Distributed Systems

Bottlenecks: Finding Data• Finding a new file on disk (known as random seeks)

Actuator arm with head that reads from disk

End of Desired File

Beginning of Desired File

Bottlenecks: Finding Data• Finding a new file on disk (known as random seeks)

• Solution: SSD and structured databases for specific use cases

Actuator arm with head that reads from disk

End of Desired File

Beginning of Desired File

Bottlenecks: Moving Data• Moving data from machine to machine over a network

Bottlenecks: Moving Data

• Solution: Keeping data close to the processors (MapReduce)

• Moving data from machine to machine over a network

Bottlenecks: Example• Processing a 2 kB transaction in memory, sequentially and

randomly on disk, or across the network 100 :1 200 :1 50 :1

Open Questions

• Will processors continue to improve?

• Are there new types of processing?

• What if memory replaced hard disks?

Quantum Computing

GPU and Deep Learning

Memory Optimized

Tech Stacks for CompaniesDepending on your growth plans:

• Single system with small data

• Distributed data center with large data

• Renting computers for flexibility (cloud)

Small Firms with Small Data• Example: Small medical firm with slow growth

• Pros: Easy to maintain, data locality, inexpensive

• Cons: Difficult to grow quickly, risky, not ideal for analysis

Small Firms with Small Data• Example: Small medical firm with slow growth

• Pros: Easy to maintain, data locality, inexpensive

• Cons: Difficult to grow quickly, risky, not ideal for analysis

Small Firms with Small Data

Large Firms with Stable Growth• Example: Facebook with steadily growing data centers

• Pros: Economies of scale, redundancy, innovative design

• Cons: Upfront capital, dedicated maintenance

• >100 PB of Data • 7 PB / Day • 1 kW / TB • ~$20 / TB / Month

Start-Ups with Exponential Growth• Example: AirBnB - rent processing and storage from AWS

• Pros: Scales easily, no maintenance, no upfront capital

• Cons: Expensive in the long run, depend on data provider

• 50 GB / Day • $20-50 / TB / Mo

Start-Ups with Exponential Growth• Example: Netflix - AWS fails on Christmas Eve • Con: You can rent the computers, but you own the failure