Enterprise Search Summit - Speeding Up Search

©2011 Azul Systems, Inc.

Speeding up Enterprise Search

withIn-Memory Indexing

Gil Tene, CTO & co-Founder, Azul Systems


This Talk’s Purpose / Goals

Describe how in-memory indexing can help enterprise search

Explain why in-memory and large memory use has been delayed

Show how in-memory indexing is now (finally) practical for interactive search applications


About me: Gil Tene

co-founder, CTO @Azul Systems

Have been working on a “think different” GC approaches since 2002

Created Pauseless & C4 core GC algorithms (Tene, Wolf)

A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc... * working on real-world trash compaction issues, circa 2004


About Azul

We make scalable Virtual Machines

Have built “whatever it takes to get job done” since 2002

3 generations of custom SMP Multi-core HW (Vega)

Now Pure software for commodity x86 (Zing)

“Industry firsts” in Garbage collection, elastic memory, Java virtualization, memory scale

Vega

C4


About Azul Supporting mission-critical deployments around


High level agenda

How can in-memory help things?

Why isn’t everyone just using it?

The “Application Memory Wall” problem

Some Garbage Collection banter

What an actual solution looks like...


Memory use How many of you use heap sizes of:

F more than ½ GB?

F more than 1 GB?

F more than 2 GB?

F more than 4 GB?

F more than 10 GB?

F more than 20 GB?

F more than 50 GB?


Why is In-Memory indexing interesting?


Interesting things happen whenan entire problem fits in a single heap

No more out-of-process access for “hot” stuff

~1 microsecond instead of ~1 millisecond access

10s of GB/sec instead of 100s of MB/sec rates

Work eliminationNo more reading/parsing/deserializing per access

No network or I/O load

These can amount to dramatic speed benefits


Many business problems can now comfortably fit in server memory

Product Catalogs are often in the 2-10GB size range

Overnight reports are often done on 10-20GB data sets

All of English Language Wikipedia text is ~29GB

Indexing “bloats” things, but not by that much...

A full index of English Language Wikipedia (including stored fields and term vectors fits for highlighting) fits in ~78GB.

You’ll need even more than that in heap size...

100%

CPU%

Heap sizeLive set

Heap size vs. GC CPU %


Case in point:The impact of in-memory index in Lucene

Lucene provides multiple index representation optionsSimpleFSDirectory, NIOFSDirectory (On-disk file-based)

MMapDirectory (mapped files, better leverage of OS cache memory)

RAMDirectory (In-process, in-heap)

Comparing throughput (e.g. on English Lang. Wikipedia)RAMDirectory ~2-3x throughput vs. MMapDirectory

Similar impact on average response.

Even higher impact with promising things in Lucene 4.0

But average is not all that mattersThere is a good reason people aren’t doing this (yet...)


Why isn’t all search done within-memory indexes already?


The “Application Memory Wall”


Memory use How many of you use heap sizes of:

F more than ½ GB?

F more than 1 GB?

F more than 2 GB?

F more than 4 GB?

F more than 10 GB?

F more than 20 GB?

F more than 50 GB?

Reality check: servers in 2012Retail prices, major web server store (US $, Oct 2012)

Cheap (< $1/GB/Month), and roughly linear to ~1TB

10s to 100s of GB/sec of memory bandwidth

24 vCore, 128GB server ≈ $5K




64 vCore, 1TB server ≈ $36K


The Application Memory WallA simple observation:

Application instances appear to be unable to make effective use of modern server memory capacities

The size of application instances as a % of a server’s capacity is rapidly dropping


How much memory do applications need?

“640KB ought to be enough for anybody”

WRONG!

So what’s the right number?6,400K?64,000K?640,000K?6,400,000K?64,000,000K?

There is no right number

Target moves at 50x-100x per decade

“I've said some stupid things and some wrong things, but not that. No one involved in computers would ever say that a certain amount of memory is enough for all time …” - Bill Gates, 1996


“Tiny” application history

100KB apps on a ¼ to ½ MB Server

10MB apps on a 32 – 64 MB server

1GB apps on a 2 – 4 GB server

??? GB apps on 256 GBAssuming Moore’s Law means:

“transistor counts grow at ≈2x every ≈18 months”

It also means memory size grows ≈100x every 10 years

2010

2000

1990

1980

“Tiny”: would be “silly” to distribute

Application Memory Wall


What is causing theApplication Memory Wall?

Garbage Collection is a clear and dominant cause

There seem to be practical heap size limits for applications with responsiveness requirements

[Virtually] All current commercial JVMs will exhibit a multi-second pause on a normally utilized 2-6GB heap.

It’s a question of “When” and “How often”, not “If”.

GC tuning only moves the “when” and the “how often” around

Root cause: The link between scale and responsiveness

What quality of GC is responsiblefor the Application Memory Wall?

It is NOT about overhead or efficiency:CPU utilization, bottlenecks, memory consumption and utilization

It is NOT about speedAverage speeds, 90%, even 95% speeds, are all perfectly fine

It is NOT about minor GC events (right now)GC events in the 10s of msec are usually tolerable for most apps

It is NOT about the frequency of very large pauses

It is ALL about the worst observable pause behavior

People avoid building/deploying visibly broken systems


GC Problems


Framing the discussion:Garbage Collection at modern server scales

Modern Servers have 100s of GB of memory

Each modern x86 core (when actually used) produces garbage at a rate of ¼ - ½ GB/sec +

We want to make use of that memory and throughput in responsive applications

Monolithic stop-the-world operations are the cause of the current Application Memory Wall

Even if they are done “only a few times a day”

One way to deal with Monolithic-STW GC

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

16000"

18000"

0" 2000" 4000" 6000" 8000" 10000"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"

Max=16023.552&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

16000"

18000"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

16000"

18000"

0" 2000" 4000" 6000" 8000" 10000"

Hiccup

&Dura*

on&(m

sec)&




0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"

Max=16023.552&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

16000"

18000"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&



Another way to cope: “Creative Language”

“Guarantee a worst case of X msec, 99% of the time”

“Mostly” Concurrent, “Mostly” IncrementalTranslation: “Will at times exhibit long monolithic stop-the-world pauses”

“Fairly Consistent”Translation: “Will sometimes show results well outside this range”

“Typical pauses in the tens of milliseconds”Translation: “Some pauses are much longer than tens of milliseconds”


Actually measuring things


Incontinuities in Java platform execution

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

1800"

0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800"

Hiccup

&Dura*

on&(m

sec)&


Hiccups"by"Time"Interval"


0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=1665.024&

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

1800"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups"by"Percen@le"Distribu@on"


But reality is simple

People deal with GC problems primarily by keeping heap sizes artificially small.

Keeping heaps small has translated to avoiding the use of in-process, in-memory contents

That needs to change...


Solving the problems behind theApplication Memory Wall


We needed to solve the right problems

Focus on the causes of the Application Memory WallScale is artificially limited by responsiveness

Responsiveness must be unlinked from scale:Data size, Throughput, Queries per second, Concurrent Users, ...Heap size, Live Set size, Allocation rate, Mutation rate, ...Responsiveness must be continually sustainableCan’t ignore “rare” events

Eliminate all Stop-The-World FallbacksAt modern server scales, any Stop-The-World fallback is a failure


The GC problems that needed solving

Robust Concurrent MarkingIn the presence of high throughput (mutation and allocation rates)

Compaction that is not stop-the-world E.g. stay responsive while compacting 100+GB heapsMust be robust: not just a tactic to delay STW compaction

Young-Gen that is not stop-the-world Stay responsive while promoting multi-GB data spikesSurprisingly little work done in this specific area


Azul’s “C4” Collector Continuously Concurrent Compacting Collector

Concurrent guaranteed-single-pass markerOblivious to mutation rateConcurrent ref (weak, soft, final) processing

Concurrent CompactorObjects moved without stopping mutatorReferences remapped without stopping mutatorCan relocate entire generation (New, Old) in every GC cycle

Concurrent, compacting old generation

Concurrent, compacting new generation

No stop-the-world fallbackAlways compacts, and always does so concurrently


Benefits


Sample responsiveness behavior

๏ SpecJBB + Slow churning 2GB LRU Cache๏ Live set is ~2.5GB across all measurements๏ Allocation rate is ~1.2GB/sec across all measurements


A JVM for Linux/x86 servers

ELIMINATES Garbage Collection as a concern for enterprise applications

Decouples scale metrics from response time concerns

Transaction rate, data set size, concurrent users, heap size, allocation rate, mutation rate, etc.

Leverages elastic memory for resilient operation


Sustainable Throughput:The throughput achieved while safely maintaining service levels

UnsustainableThroughout


Instance capacity test: “Fat Portal”HotSpot CMS: Peaks at ~ 3GB / 45 concurrent users

* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times


Instance capacity test: “Fat Portal”C4: still smooth @ 800 concurrent users


Lucene: English Language Wikipedia

RAMDirectory based index


Fun with jHiccup


Oracle HotSpot CMS, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup

&Dura*

on&(m

sec)&




0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=13156.352&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&


Zing 5, 1GB in an 8GB heap

0"

5"

10"

15"

20"

25"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup

&Dura*

on&(m

sec)&




0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"

Max=20.384&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&



Oracle HotSpot CMS, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup

&Dura*

on&(m

sec)&




0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=13156.352&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&


Zing 5, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup

&Dura*

on&(m

sec)&




0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"Max=20.384&0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&


Drawn to scale


Summary:In-memory indexing and search

Interesting content can now fit in memory

In-memory indexing makes things go fast

In-memory indexing makes new things possibleChange Batch to Interactive. Change business processes.

It is finally practical to use in-memory indexing for interactive search applications

Zing makes this work...


Q & ALucene In-RAM English Lang. Wikipedia test: http://blog.mikemccandless.com/2012/07/lucene-index-in-ram-with-azuls-zing-jvm.html

jHiccup: http://www.azulsystems.com/dev_resources/jhiccup

GC :G. Tene, B. Iyengar and M. WolfC4: The Continuously Concurrent Compacting CollectorIn Proceedings of ISMM’11, ACM, pages 79-88

http://www.azylsystems.com










Technology

Enterprise Search Summit - Speeding Up Search