In-Memory Computing: How, Why? and common Patterns

Preview:

DESCRIPTION

Traditionally, big data is mostly read from disks and processed. However, most big data systems are latency bound, which means often the CPU sits idle waiting for data to arrive. This problem is more prevalent with use cases like graph searches that need to randomly access different parts of datasets. In-memory computing proposes an alternative model where data is loaded or stored in-memory and processed instead of processing them from the disk. Although such designs cost more in terms of memory, sometimes resulting systems can have faster order of magnitudes (e.g. 1000X), which could lead to savings in the long run. With rapidly falling memory prices, this difference is reducing by the day. Furthermore, in-memory computing can enable use cases like ad hoc analysis over a large set of data that was not possible earlier. This talk will provide an overview of in-memory technology and discuss how WSO2 technologies like complex event processing that can be used to build in-memory solutions. It will also provide an overview of upcoming improvements in the WSO2 platform.

Citation preview

In-Memory Computing

Srinath Perera Director, Research

WSO2 Inc.

Performance Numbers (based on Jeff Dean’s numbers )

Mem Ops / Sec

If Memory access is a Second

L1 cache reference 0.05 1/20th sec

Main memory reference 1 1 sec

Send 2K bytes over 1 Gbps network 200 3 min

Read 1 MB sequentially from memory 2500 41 minDisk seek 1*10^5 27 hours

Read 1 MB sequentially from disk 2*10^5 2 days

Send packet CA->Netherlands->CA 1.5*10^6 17 days

OperationSpeed

MB/sec

Hadoop Select 3Terasort Bench mark 18Complex Query Hadoop 0.2

CEP 60

CEP Complex 2.5

SSD 300-500

Disk 50-100

Performance Numbers (based on Jeff Dean’s numbers )

Mem Ops / Sec

If Memory access is a Second

L1 cache reference 0.05 1/20th sec

Main memory reference 1 1 sec

Send 2K bytes over 1 Gbps network 200 3 min

Read 1 MB sequentially from memory 2500 41 minDisk seek 1*10^5 27 hours

Read 1 MB sequentially from disk 2*10^5 2 days

Send packet CA->Netherlands->CA 1.5*10^6 17 days

OperationSpeed

MB/sec

Hadoop Select 3Terasort Bench mark 18Complex Query Hadoop 0.2

CEP 60

CEP Complex 2.5

SSD 300-500

Disk 50-100

Most Big Data Apps are Latency-bound!!

Often, your app waste CPU waiting for data to arrive

Latency Lags Bandwidth

• Observation in prof. Patterson’s Keynote at 2004

• Bandwidth improves, but not latency • Same holds now, and the gap is

widening with new systems

Handling Speed Differences in Memory Hierarchy

1. Caching – E.g. Processor caches, file cache,

disk cache, permission cache

2. Replication – E.g. RAID, Content Distribution

Networks (CDN), Web Cache

3. Prediction – Predict what data will be needed and prefect – Tradeoff bandwidth – E.g. disk caches, Google Earth

Above three does not always work

• Limitations – Caching works only if working set is small – Prefetching only works when access patterns are predictable – Replication is expensive and limited by receiving side machines

• Lets assume you are reading and filtering 10G data (assuming 6b per record that is 17Billion records)– 3 minutes to read the data from disk– 35ms to filter 10M in my laptop => 1 minutes to process all

data – Keeping data in memory can give about 30X more

Data Access Patterns in Big Data Applications

• Read from Disk, process once (Basic Analytics)– Data can be perfected, batch load is only about 100 times faster.– OK if processing time > data read time

• Read from Disk, iteratively Process (Machine Learning Algos, e.g. KMean)– Need to load data from disk once and process (e.g. Spark supports this)

• Interactive (OLAP)– Queries are random, data may be scattered. Once query started, can load data to

memory and process

• Random Access (e.g. Graph Processing)– Very hard to optimize

• Realtime Access – As data comes in

In-Memory Computing

Four Myths

• Myths– Too expensive 1TB RAM cluster for 20-40k (about 1$/GB)– It is not durable – Flash is fast enough – It is about In-Memory DBs

• From Nikita Ivanov’s post– http

://gridgaintech.wordpress.com/2013/09/18/four-myths-of-in-memory-computing/

Let us look at each Big data access pattern and where In-Memory

Computing can make a difference

Access Pattern 1:Read from Disk, Process Once • If Tp = 35ms vs

Td=1.2 sec with 60MB chunks, it will give about 30X to keep all data in Memory

• However, this benefit is less if computation is more complex (e.g. Sort)

Access Pattern 2: Read from Disk, iteratively Process

• Very common pattern for machine learning algorithms (e.g. KMean)

• On this case, advantages are greater – If we cannot hold data in memory fully, we need to offload– Then we need to read again – Then cost is very high to load and process and much faster

with in memory computing

• Spark let you load to memory fully and process

Spark• New Programming Model

built on functional programming concepts

• Can be much faster for recursive usecases

• Have a complete stack of products

file = spark.textFile("hdfs://...”)file.flatMap(

line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

Access Pattern 3: Interactive Queries

• Need to be responsive, < 10 sec• Harder to predict what data is needed• Queries tend to be simpler • Can be made faster by a RAM Cloud

– SAP Hana– Volt DB

• With smaller queries, disk may still be OK. Apache Drill as an Alternative

VoltDB Story

• VoltDB Team (Michael Stonebraker et al.) observed 92% of work in a DB related to Disk

• By building complete in-memory database cluster they made it 20x faster!

Distributed Cloud (e.g. Hazelcast)

• Store the data portioned and replicated across many machines

• Used as a cache that span multipme machines• Key value access

Access Pattern 4: Random Accesses

• E.g. Graph Traversal • This is the hardest usecase • In easy cases, there is a small working set and can be

solved with a cache ( checking users against a black list), not the case with Graph most graph operations like traversal

• Hard cases, In Memory Computing is only real solution • Can be as fast as 1000x or more

Access Pattern 5: Realtime Processing

• This is already In-Memory technology using tools like Complex Event Processing (e.g. WSO2 CEP) or stream processing (e.g. Apache Storm)

Faster Access to Data

• In-Memory databases (e.g. VoltDB, MemSQL)– Provide Same SQL interface– Can think as fast database– VoltDB has shown to about 20X faster than MySQL

• Distributed Cache – Can Integrated as a Large Cache

Load Data Set to Memory and Analyze

• Used with Interactive and Random access usecases • Can be as 1000x faster for some usecases • Tools

– Spark – Hazelcast– SAP Hana

Realtime Processing

• Realtime analytics tools – CEP (WSO2 CEP)– Stream Processing (e.g. Storm)

• Can generate results within few milliseconds to seconds

• Can process 10ks-millions of events per second

• Not all algorithms can be implemented

In Memory Computing with WSO2 Platform

Thank You

Recommended