Agenda
‣ Introduction
- Search Architecture
- Inverted Index 101
- Realtime Posting Lists
Search @twitter
2
Introduction
3
Introduction
Twitter has more than 230 million monthly active users.
4
Introduction
500 million tweets are sent per day.
5
Introduction
More than 300 billion tweets have been sent since company founding in 2006.
6
Introduction
Tweets-per-second world record:33,388 TPS.
7
Introduction
More than 2 billion search queries per day.
8
Introduction
2008
2009
2010
2011
2012
2013
2014
Twitter acquires Summize (MySQL-based RT search engine)
Modified Lucene (Earlybird) ships and replaces MySQL indexes
New Earlybird features: image/video search; index compression;efficient relevance search in time-sorted index
Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary documentlengths, but keeps performance optimizations for tweets
9
Introduction
2008
2009
2010
2011
2012
2013
2014
Twitter acquires Summize (MySQL-based RT search engine)
Modified Lucene (Earlybird) ships and replaces MySQL indexes
New Earlybird features: image/video search; index compression;efficient relevance search in time-sorted index
Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary documentlengths, but keeps performance optimizations for tweets
10
Introduction
2008
2009
2010
2011
2012
2013
2014
Twitter acquires Summize (MySQL-based RT search engine)
Modified Lucene (Earlybird) ships and replaces MySQL indexes
New Earlybird features: image/video search; index compression;efficient relevance search in time-sorted index
Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary documentlengths, but keeps performance optimizations for tweets
11
Introduction
2008
2009
2010
2011
2012
2013
2014
Twitter acquires Summize (MySQL-based RT search engine)
Modified Lucene (Earlybird) ships and replaces MySQL indexes
New Earlybird features: image/video search; index compression;efficient relevance search in time-sorted index
Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary documentlengths, but keeps performance optimizations for tweets
12
Introduction
2008
2009
2010
2011
2012
2013
2014
Twitter acquires Summize (MySQL-based RT search engine)
Modified Lucene (Earlybird) ships and replaces MySQL indexes
New Earlybird features: image/video search; index compression;efficient relevance search in time-sorted index
Tweet archive search on SSD with vanilla Lucene
New RT posting list format that supports arbitrary documentlengths, but keeps performance optimizations for tweets
13
Realtime Search @twitter
Agenda
- Introduction
‣ Search Architecture
- Inverted Index 101
- Realtime Posting Lists
14
Search Architecture
15
RT index
Search Architecture
RT streamAnalyzer/Partitioner
RT index(Earlybird)
Blender
RT indexArchive index
MapreduceAnalyzer
rawtweets
HDFS
searcheswrites
Searchrequests
analyzedtweets
analyzedtweets
rawtweets
Tweet archive
16
Search Architecture
Analyzer/Partitioner
• Pre-processes Tweets for indexing
• Analyzing (tokenization/normalization) of text
• Geo-coding, URL expansion, etc.
• Hash partitioning
17
RT index
Search Architecture
RT streamAnalyzer/Partitioner
RT index(Earlybird)
Blender
RT indexArchive index
rawtweets
HDFS
searcheswrites
Searchrequests
analyzedtweets
analyzedtweets
rawtweets
Tweet archive
MapreduceAnalyzer
18
RT index
Search Architecture
RT index(Earlybird)
• Modified Lucene index implementation optimized for realtime search
• IndexWriter buffer is searchable (no need to flush to allow searching)
• In-memory
• Hash-partitioned, static layout
19
Cluster layout
Replicas
EarlybirdEarlybird
Earlybird
20
Cluster layout
...
n hash partitions (docId % n)
Replicas
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
21
Cluster layout
...
...
...
... ... ... ...Timeslices
n hash partitions (docId % n)
Replicas
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
22
Cluster layout
...
...
...
... ... ... ...
Writabletimeslice
Completetimeslices
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
EarlybirdEarlybird
Earlybird
23
RT index
Search Architecture
RT index(Earlybird)
• Modified Lucene index implementation optimized for realtime search
• IndexWriter buffer is searchable (no need to flush to allow searching)
• In-memory
• Hash-partitioned, static layout
24
RT index
Search Architecture
RT streamAnalyzer/Partitioner
RT index(Earlybird)
Blender
RT indexArchive index
rawtweets
HDFS
searcheswrites
Searchrequests
analyzedtweets
analyzedtweets
rawtweets
Tweet archive
MapreduceAnalyzer
25
Search Architecture
MapreduceAnalyzer
• Daily jobs that process raw tweets
• Analyzes text
• Aggregates metadata and signals
26
RT index
Search Architecture
RT streamAnalyzer/Partitioner
RT index(Earlybird)
Blender
RT indexArchive index
rawtweets
HDFS
searcheswrites
Searchrequests
analyzedtweets
analyzedtweets
rawtweets
Tweet archive
MapreduceAnalyzer
27
Search Architecture
RT indexArchive index
• Standard Lucene (4.4) indexes
• Reverse time-sorted (new to old)
• Cluster layout similar to realtime search cluster
28
Search Architecture
RT indexArchive index
• Two tiers: In-memory and on SSD
In-memory index
SSD index
29
Search Architecture
RT indexArchive index
• Two tiers: In-memory and on SSD
In-memory index
SSD index
Contains small number of best tweets of all time
30
Search Architecture
RT indexArchive index
• Two tiers: In-memory and on SSD
In-memory index
SSD index
Much bigger index with more tweets, less max. QPS, limited by
SSD IOPS.Only needs to be queried if in-
memory index did not yield enough results
31
RT index
Search Architecture
RT streamAnalyzer/Partitioner
RT index(Earlybird)
Blender
RT indexArchive index
rawtweets
HDFS
searcheswrites
Searchrequests
analyzedtweets
analyzedtweets
rawtweets
Tweet archive
MapreduceAnalyzer
32
RT index
Search Architecture
RT index(Earlybird)
Blender
RT indexArchive index
searcheswrites
Searchrequests
• Blender is our Thrift service aggregator
• Queries multiple Earlybirds, merges results
33
RT index
Search Architecture
RT streamAnalyzer/Partitioner
RT index(Earlybird)
Blender
RT indexArchive index
rawtweets
HDFS
searcheswrites
Searchrequests
analyzedtweets
analyzedtweets
rawtweets
Tweet archive
MapreduceAnalyzer
34
RT index
Search Architecture
TweetsAnalyzer/Partitioner
RT index(Earlybird)
Blender
RT indexArchive index
queue
HDFS
Searchrequests
Updates Deletes/Engagement (e.g. retweets/favs)
searcheswrites
MapreduceAnalyzer
35
Realtime Search @twitter
Agenda
- Introduction
- Search Architecture
‣ Inverted Index 101
- Realtime Posting Lists
36
Inverted Index 101
37
Inverted Index 101
1 The old night keeper keeps the keep in the town
2 In the big old house in the big old gown.
3 The house in the town had the big old keep
4 Where the old night keeper never did sleep.
5 The night keeper keeps the keep in the night
6 And keeps in the dark and sleeps in the light.
Table with 6 documents
Example from:Justin Zobel , Alistair Moffat, Inverted files for text search engines, ACM Computing Surveys (CSUR)v.38 n.2, p.6-es, 2006
38
Inverted Index 101
1 The old night keeper keeps the keep in the town
2 In the big old house in the big old gown.
3 The house in the town had the big old keep
4 Where the old night keeper never did sleep.
5 The night keeper keeps the keep in the night
6 And keeps in the dark and sleeps in the light.
term freqand 1 <6>big 2 <2> <3>
dark 1 <6>did 1 <4>
gown 1 <2>had 1 <3>
house 2 <2> <3>in 5 <1> <2> <3> <5> <6>
keep 3 <1> <3> <5>keeper 3 <1> <4> <5>keeps 3 <1> <5> <6>light 1 <6>
never 1 <4>night 3 <1> <4> <5>old 4 <1> <2> <3> <4>
sleep 1 <4>sleeps 1 <6>
the 6 <1> <2> <3> <4> <5> <6>town 2 <1> <3>where 1 <4>
Table with 6 documents
Dictionary and posting lists39
Inverted Index 101
1 The old night keeper keeps the keep in the town
2 In the big old house in the big old gown.
3 The house in the town had the big old keep
4 Where the old night keeper never did sleep.
5 The night keeper keeps the keep in the night
6 And keeps in the dark and sleeps in the light.
term freqand 1 <6>big 2 <2> <3>
dark 1 <6>did 1 <4>
gown 1 <2>had 1 <3>
house 2 <2> <3>in 5 <1> <2> <3> <5> <6>
keep 3 <1> <3> <5>keeper 3 <1> <4> <5>keeps 3 <1> <5> <6>light 1 <6>
never 1 <4>night 3 <1> <4> <5>old 4 <1> <2> <3> <4>
sleep 1 <4>sleeps 1 <6>
the 6 <1> <2> <3> <4> <5> <6>town 2 <1> <3>where 1 <4>
Table with 6 documents
Dictionary and posting lists
Query: keeper
40
Inverted Index 101
1 The old night keeper keeps the keep in the town
2 In the big old house in the big old gown.
3 The house in the town had the big old keep
4 Where the old night keeper never did sleep.
5 The night keeper keeps the keep in the night
6 And keeps in the dark and sleeps in the light.
term freqand 1 <6>big 2 <2> <3>
dark 1 <6>did 1 <4>
gown 1 <2>had 1 <3>
house 2 <2> <3>in 5 <1> <2> <3> <5> <6>
keep 3 <1> <3> <5>keeper 3 <1> <4> <5>keeps 3 <1> <5> <6>light 1 <6>
never 1 <4>night 3 <1> <4> <5>old 4 <1> <2> <3> <4>
sleep 1 <4>sleeps 1 <6>
the 6 <1> <2> <3> <4> <5> <6>town 2 <1> <3>where 1 <4>
Table with 6 documents
Dictionary and posting lists
Query: keeper
41
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090
42
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090
5 10 8985 2 90998 90Delta encoding:
43
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090
5 10 8985 2 90998 90Delta encoding:
00000101VInt compression:
Values 0 <= delta <= 127 need one byte
44
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090
5 10 8985 2 90998 90Delta encoding:
11000110VInt compression:
Values 128 <= delta <= 16384 need two bytes
00011001
45
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090
5 10 8985 2 90998 90Delta encoding:
11000110VInt compression:
First bit indicates whether next byte belongs to the same value
00011001
46
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090
5 10 8985 2 90998 90Delta encoding:
11000110VInt compression: 00011001
• Variable number of bytes - a VInt-encoded posting can not be written as a primitive Java type; therefore it can not be written atomically
47
Posting list encoding
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090
5 10 8985 2 90998 90Delta encoding:
Read direction
• Each posting depends on previous one; decoding only possible in old-to-new direction
• With recency ranking (new-to-old) no early termination is possible
48
Posting list encoding
• By default Lucene uses a combination of delta encoding and VInt compression
• VInts are expensive to decode
• Problem 1: How to traverse posting lists backwards?
• Problem 2: How to write a posting atomically?
49
Realtime Search @twitter
Agenda
- Introduction
- Search Architecture
- Inverted Index 101
‣ Realtime Posting Lists
50
Realtime Posting Lists
51
Posting list encoding in Earlybird v1
int (32 bits)
docID24 bits
max. 16.7M
textPosition8 bits
max. 255
• Tweet text can only have 140 chars
52
Posting list encoding in Earlybird v1
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090
Earlybird encoding:
Read direction
5 15 9000 9002 100000 100090
53
Early query termination
Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090
Earlybird encoding:
Read direction
5 15 9000 9002 100000 100090
E.g. 3 result are requested: Here we can terminate after reading 3
postings
54
Inverted index components
Parallel arraysDictionary
pointer to the most recently indexed posting for a term
Posting list storage
?
55
Inverted index components
Parallel arraysDictionary
pointer to the most recently indexed posting for a term
Posting list storage
?
56
• Store many single-linked lists of different lengths space-efficiently
• The number of java objects should be independent of the number of lists or number of items in the lists
• Every item should be a possible entry point into the lists for iterators, i.e. items should not be dependent on other items (e.g. no delta encoding)
• Append and read possible by multiple threads in a lock-free fashion (single append thread, multiple reader threads)
• Traversal in backwards order
Posting lists storage - Objectives
57
Memory management
= 32K int[]
4 int[]pools
58
Memory management
= 32K int[]
4 int[]pools
Each pool can be grown
individually by adding 32K
blocks
59
Memory management
• For simplicity we can forget about the blocks for now and think of the pools as continuous, unbounded int[] arrays
• Small total number of Java objects (each 32K block is one object)
4 int[]pools
60
Memory management
• Slices can be allocated in each pool
• Each pool has a different, but fixed slice size
21
24
27
211slice size
61
Adding and appending to a list
21
24
27
211slice size
availableallocatedcurrent list
62
Adding and appending to a list
21
24
27
211slice size
Store first twopostings in this slice
availableallocatedcurrent list
63
Adding and appending to a list
21
24
27
211slice size
When first slice is full, allocate another one in second pool
availableallocatedcurrent list
64
Adding and appending to a list
21
24
27
211slice size
availableallocatedcurrent list
Allocate a slice on each level as list grows
65
Adding and appending to a list
21
24
27
211slice size
availableallocatedcurrent list
On upper most level one list can own multiple slices
66
Posting list format v1
int (32 bits)
docID24 bits
max. 16.7M
textPosition8 bits
max. 255
• Tweet text can only have 140 chars
67
Addressing items
• Use 32 bit (int) pointers to address any item in any list unambiguously:
int (32 bits)
poolIndex2 bits0-3
offset in slice1-11 bits
depends on pool
sliceIndex19-29 bits
depends on pool
• Nice symmetry: Postings and address pointers both fit into a 32 bit int
68
Linking the slices
21
24
27
211slice size
availableallocatedcurrent list
69
Linking the slices
21
24
27
211slice size
availableallocatedcurrent list
Parallel arraysDictionary
pointer to the last posting indexed for a term
70
Posting list encoding - Summary
• ints can be written atomically in Java
• Backwards traversal easy on absolute docIDs (not deltas)
• Every posting is a possible entry point for a searcher
• Skipping can be done without additional data structures as binary search, though there are better approaches (skip lists)
• Repeating docIDs if a term occurs multiple times in the same document only works for small docs
• Max. segment size: 2^24 = 16.7M tweets
71
New posting list encoding
• Objectives:
• 32 bit positions and variable-length payloads
• Store term frequency (TF) instead of repeating docIDs
• Keep:
• Concurrency model
• Space-efficiency for short documents
• Performance
72
New posting list encoding
DocID, termFreq Position, Payload
73
New posting list encoding
DocID, termFreq Position, Payload
Fixed length for each posting
74
New posting list encoding
DocID, termFreq Position, Payload
Variable length
75
New posting list encoding
DocID, termFreq
Position, Payload
76
New posting list encoding
DocID, termFreq
Position, Payload
DocID, termFreq
Position, Payload, Position
DocID, termFreq
Position, Payload
...
...
77
New posting list encoding
DocID, termFreq
Position, Payload
DocID, termFreq
Position, Payload, Position
DocID, termFreq
Position, Payload
...
...
• Store TF instead of repeating the same DocID
• Store DocID/TF pairs separately from position/payloads
• Find a way to synchronously decode the two streams without storing a pointer for each posting (expensive)
78
New posting list encoding
DocID, termFreq
Position, Payload
DocID, termFreq
Position, Payload, Position
DocID, termFreq
Position, Payload
...
...
• Store TF instead of repeating the same DocID
• Store DocID/TF pairs separately from position/payloads
• Find a way to synchronously decode the two streams without storing a pointer for each posting (expensive)
Fixed length for each posting (32 bits)
79
New posting list encoding
• Idea: Use an embedded skip list as periodical “synchronization points”
• Keeps memory overhead for pointers low and improves search performance
80
21
24
27
211slice size
availableallocatedcurrent list
New posting list encoding
81
New posting list encoding
Slice header
• Header contains:
• Back-pointer to previous slice (as before)
• Skip list
• Slice id
82
New posting list encoding
int (32 bits)
docID24 bits
max. 16.7M
textPosition8 bits
max. 255
• Observation: Most tweets don’t need all 8 bits for text position
• Idea: Use the position “inlining” approach for short documents, but support Lucene’s 32-bit positions and variable length payloads
83
New posting list encoding
int (32 bits)
docID24 bits
max. 16.7M
textPositionor
termFreq7 bits
max. 127
As a storage optimization, the text position is stored with the docID if:o termFreq == 1 (term occurs once only in the doc) ANDo textPosition <= 127 AND o Posting has no payload ANDo Posting is not at a skip point of the docID posting list (see later).
0=textPosition1=termFreq
1 bit
84
New posting list encoding - Summary
• Support for 32 bit positions and arbitrary length payloads stored in separate data structure
• Performance and space consumption very similar compared to previous encoding for tweet search
• Skip lists used for speed and synchronization points
• For short documents positions can still be inlined
85
Questions?Michael Busch@[email protected] [email protected]
Previous talk: http://vimeo.com/31195040
86