44
Sensei Volodymyr Zhabiuk

SenseiDB

Embed Size (px)

DESCRIPTION

The techtalk @LinkedIN

Citation preview

Page 1: SenseiDB

Sensei

Volodymyr Zhabiuk

Page 2: SenseiDB

Agenda

1.  History and motivation

2.  High level architecture

3.  Data guarantees

4.  Features detailed overview

5.  Quick demo

Page 3: SenseiDB

What is Sensei

�  search engine and database

�  Built on top of Lucene

�  Full text search, relevance, faceting

�  Distributed, horizontally scalable

Page 4: SenseiDB

History

•  Technology stack for LinkedIn.com's search, analytics and homepage

•  Open sourced in 2009, first 1.0.0 release February 2012

•  https://github.com/linkedin/sensei

•  http://senseidb.com

�  sensei-search Google group

�  Used by Xiaomi, several other OS deployments

Page 5: SenseiDB

Why yet another Lucene based search engine?

Page 6: SenseiDB

Why yet another Lucene based search engine?

•  Indexing elevates query latency •  Hard to distribute

Page 7: SenseiDB

Why yet another Lucene based search engine?

•  Indexing elevates query latency •  Hard to distribute

•  Large memory overhead •  Comparatively slow

Page 8: SenseiDB

Why yet another Lucene based search engine?

•  Indexing elevates query latency •  Hard to distribute

•  Large memory overhead •  Comparatively slow

SenseiDB •  Designed for LinkedIn search use cases and the Homepage

Page 9: SenseiDB

Motivation •  Indexing/Query isolation

•  Structured vs. unstructured data (e.g. fulltext search support)

•  Faceted search

Page 10: SenseiDB

Motivation •  Indexing/Query isolation

•  Structured vs. unstructured data (e.g. fulltext search support)

•  Faceted search

•  Business intelligence

Page 11: SenseiDB

Sensei’s features •  Fast updates

•  Rich query language - BQL

•  Fulltext and faceted search

•  Distributed and elastic

•  Indexing and search customization

•  In memory M/R

Page 12: SenseiDB

What Sensei doesn’t do �  Transactions and OLTP

�  Dynamic shard rebalancing

�  Multi tenancy and table joins

�  Dynamic schema

Page 13: SenseiDB

Volume

�  5-100 mln documents per node

�  ~300K updates per minute

�  Query latency < 100 ms

Page 14: SenseiDB

Deployments �  Search engine for SeaS

�  Backend for USCP– 400 nodes

�  >6 deployments in the team $

�  Other companies(2 deployments at Xiaomi)

Page 15: SenseiDB

Sensei’s technologies

Lucene

Sensei

Page 16: SenseiDB

Sensei’s technologies

Zoie

Lucene

Sensei

Page 17: SenseiDB

Sensei’s technologies

Zoie

Lucene

Bobo

Sensei

Page 18: SenseiDB

Sensei’s technologies

Zoie

Lucene

Bobo Norbert

Zookeeper

Sensei

Page 19: SenseiDB

Vocabulary

Node Shard/Partition Replica

Page 20: SenseiDB

Vocabulary

Node Shard/Partition Replica

Page 21: SenseiDB

High level architecture

Page 22: SenseiDB

Data injection

Sensei node

Gateway

Kafka RabbitMQ Databus JDBC

Event w/ version

Get events with version bigger than the existing

Page 23: SenseiDB

Data guarantees •  Availability - replications

•  Eventually consistent across replications

•  Write durability - data stream

•  Write consistency - data stream

Page 24: SenseiDB

Configuration �  schema.xml

�  Indexed fields,

�  forward index customization

�  sensei.properties �  ports, plugins, zookeeper urls, etc

Page 25: SenseiDB

Features

Page 26: SenseiDB

Lucene realtime extension

Disk Index

Page 27: SenseiDB

Realtime updates •  Updates are seen right away < 1s upon inserting

•  Handles deletes and updates

•  Indexing latency stable as index size grows

•  Incremental and balanced segment merges

Page 28: SenseiDB

Hourglass(Time Series)

Page 29: SenseiDB

Offline indexing and archive •  Efficient M/R indexing generation on Hadoop over

ETL'd data

•  Bootstrap from HDFS

Page 30: SenseiDB

Query Engine - Bobo •  Query planning/optimization

•  Access to both inverted and forward data structures

•  High performance faceting

•  Dynamic sorting

•  Dynamic relevance support

•  Map/Reduce analytics engine

Page 31: SenseiDB

Bobo(cont.)

Lucene segment Lucene segment Lucene segment

Custom (forward) index

Custom (forward) index

Custom (forward) index

Result

Page 32: SenseiDB

Sensei API - BQL

SELECT color, category, year, makemodel FROM cars WHERE NOT MATCH(color, category) AGAINST("*van") GROUP BY category TOP 1 LIMIT 1000

Page 33: SenseiDB

Dynamic relevance SELECT * FROM cars WHERE price > 2000.00 USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END

Page 34: SenseiDB

Partial updates �  Storing data outside of Lucene

�  High update rate

�  Perfect for counters

Page 35: SenseiDB

Sensei in memory M/R

Broker

Node1

Node2

Page 36: SenseiDB

Sensei in memory M/R

Broker

Node1

Node2

Lucene segments

map(IntArray docs, FieldAccessor, FacetCountAccessor)

Page 37: SenseiDB

Sensei in memory M/R

Broker

Node1

Node2

Lucene segments

map(IntArray docs, FieldAccessor, FacetCountAccessor)

Page 38: SenseiDB

Sensei in memory M/R

Broker

Node1

Node2

Lucene segments

List<MapResult> combine(List<MapResult>)

Page 39: SenseiDB

Sensei in memory M/R

Broker

Node1

Node2

Node1

Node1

Lucene segments

List<MapResult> combine(List<MapResult>)

Page 40: SenseiDB

Sensei in memory M/R

Broker

Node1

Node2

Node1

Node1

Lucene segments

Broker

JSONObject reduce(List<MapResult>)

Page 41: SenseiDB

�  select distinctCount(memberId), sum(clickCount) where geo = ‘US/CA/SF’ group by seniority, age

Sensei in memory M/R

Page 42: SenseiDB

Roadmap •  Just finished

o  Sensei aggregation functions

o  Map/Reduce analytics engine

•  Plan o  Goshawk – for business inteligence (WVMP v2, LI

Impressions)

o  Zoie Redesign to support fixed length in memory segments

Page 43: SenseiDB

Sensei tweets demo

Page 44: SenseiDB

Questions?

�  SeaS Homepage: http://go/seas

�  Questions: ask_seas@

�  Sensei homepage: senseidb.com

�  Sensei Google group: sensei-search