SenseiDB

Sensei

Volodymyr Zhabiuk

Agenda

1.  History and motivation

2.  High level architecture

3.  Data guarantees

4.  Features detailed overview

5.  Quick demo

What is Sensei

�  search engine and database

�  Built on top of Lucene

�  Full text search, relevance, faceting

�  Distributed, horizontally scalable

History

•  Technology stack for LinkedIn.com's search, analytics and homepage

•  Open sourced in 2009, first 1.0.0 release February 2012

•  https://github.com/linkedin/sensei

•  http://senseidb.com

�  sensei-search Google group

�  Used by Xiaomi, several other OS deployments

Why yet another Lucene based search engine?


•  Indexing elevates query latency •  Hard to distribute



•  Large memory overhead •  Comparatively slow



•  Large memory overhead •  Comparatively slow

SenseiDB •  Designed for LinkedIn search use cases and the Homepage

Motivation •  Indexing/Query isolation

•  Structured vs. unstructured data (e.g. fulltext search support)

•  Faceted search

Motivation •  Indexing/Query isolation

•  Structured vs. unstructured data (e.g. fulltext search support)

•  Faceted search

•  Business intelligence

Sensei’s features •  Fast updates

•  Rich query language - BQL

•  Fulltext and faceted search

•  Distributed and elastic

•  Indexing and search customization

•  In memory M/R

What Sensei doesn’t do �  Transactions and OLTP

�  Dynamic shard rebalancing

�  Multi tenancy and table joins

�  Dynamic schema

Volume

�  5-100 mln documents per node

�  ~300K updates per minute

�  Query latency < 100 ms

Deployments �  Search engine for SeaS

�  Backend for USCP– 400 nodes

�  >6 deployments in the team $

�  Other companies(2 deployments at Xiaomi)

Sensei’s technologies

Lucene

Sensei


Zoie

Lucene

Sensei


Zoie

Lucene

Bobo

Sensei


Zoie

Lucene

Bobo Norbert

Zookeeper

Sensei

Vocabulary

Node Shard/Partition Replica

Vocabulary

Node Shard/Partition Replica

High level architecture

Data injection

Sensei node

Gateway

Kafka RabbitMQ Databus JDBC

Event w/ version

Get events with version bigger than the existing

Data guarantees •  Availability - replications

•  Eventually consistent across replications

•  Write durability - data stream

•  Write consistency - data stream

Configuration �  schema.xml

�  Indexed fields,

�  forward index customization

�  sensei.properties �  ports, plugins, zookeeper urls, etc

Features

Lucene realtime extension

Disk Index

Realtime updates •  Updates are seen right away < 1s upon inserting

•  Handles deletes and updates

•  Indexing latency stable as index size grows

•  Incremental and balanced segment merges

Hourglass(Time Series)

Offline indexing and archive •  Efficient M/R indexing generation on Hadoop over

ETL'd data

•  Bootstrap from HDFS

Query Engine - Bobo •  Query planning/optimization

•  Access to both inverted and forward data structures

•  High performance faceting

•  Dynamic sorting

•  Dynamic relevance support

•  Map/Reduce analytics engine

Bobo(cont.)

Lucene segment Lucene segment Lucene segment

Custom (forward) index



Result

Sensei API - BQL

SELECT color, category, year, makemodel FROM cars WHERE NOT MATCH(color, category) AGAINST("*van") GROUP BY category TOP 1 LIMIT 1000

Dynamic relevance SELECT * FROM cars WHERE price > 2000.00 USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END

Partial updates �  Storing data outside of Lucene

�  High update rate

�  Perfect for counters

Sensei in memory M/R

Broker

Node1

Node2


Broker

Node1

Node2

Lucene segments

map(IntArray docs, FieldAccessor, FacetCountAccessor)


Broker

Node1

Node2

Lucene segments

map(IntArray docs, FieldAccessor, FacetCountAccessor)


Broker

Node1

Node2

Lucene segments

List<MapResult> combine(List<MapResult>)


Broker

Node1

Node2

Node1

Node1

Lucene segments

List<MapResult> combine(List<MapResult>)


Broker

Node1

Node2

Node1

Node1

Lucene segments

Broker

JSONObject reduce(List<MapResult>)

�  select distinctCount(memberId), sum(clickCount) where geo = ‘US/CA/SF’ group by seniority, age


Roadmap •  Just finished

o  Sensei aggregation functions

o  Map/Reduce analytics engine

•  Plan o  Goshawk – for business inteligence (WVMP v2, LI

Impressions)

o  Zoie Redesign to support fixed length in memory segments

Sensei tweets demo

Questions?

�  SeaS Homepage: http://go/seas

�  Questions: ask_seas@

�  Sensei homepage: senseidb.com

�  Sensei Google group: sensei-search

Documents

SenseiDB