How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Preview:

DESCRIPTION

 

Citation preview

HOW LUCENE POWERS LINKEDIN SEGMENTATION & TARGETING PLATFORM

Hien Luu & Raj Rangaswamy

About Us

Hien  Luu   Rajasekaran  Rangaswamy  

•  Little bit about LinkedIn •  Segmentation & Targeting Platform Overview •  How Lucene powers Segmentation & Targeting Platform •  Q&A

Agenda

Our Mission Connect the world’s professionals to make them

more productive and successful.

Our Vision Create economic opportunity for every

professional in the world.

Members First!

©2013  LinkedIn  Corpora3on.  All  Rights  Reserved.  

The world’s largest professional network Over 65% of members are now international

Company  Pages    

>3M  

Languages    

>30M  

>90%  

Fortune  100  Companies    use  LinkedIn  Talent  Soln  to  hire  

Professional  searches  in  2012    

>5.7B  19  

Other Company Facts •  Headquartered  in  Mountain  View,  Calif.,  with  offices  around  the  world! •  LinkedIn  has  ~4200  full-­‐3me  employees  located  around  the  world    

Segmenta3on  &  Targe3ng  PlaRorm  Overview  

Segmentation & Targeting Platform Overview

Segmentation & Targeting Platform Overview

Segmentation & Targeting Platform Overview 1. Create attributes

§  Name §  Email §  State §  Occupation §  Etc.

2. Attributes Added to Table

Name   Email   State   OccupaEon   …  

John  Smith   jsmith@blah.com   California   Engineer  

Jane  Smith   smithj@mail.com   Nevada   HR  Manager  

3. Create Target Segment: California, Engineer

Name   Email   State   OccupaEon  

John  Smith   jsmith@blah.com   California   Engineer  

Jane  Doe   jdoe@email.com   California   Engineer  

4. Export List & Send Vendor

Jane  Doe   jdoe@email.com   California   Engineer  

•  Business definition –  Business would like to launch new campaigns often –  Business would like to specify targeting criteria using

arbitrary set of attributes –  Attributes need to be computed to fulfill the targeting

criteria –  The attribute data resides on Hadoop or TD –  Business is most comfortable with SQL-like language

Segmentation & Targeting Platform Overview

Segmentation & Targeting Platform Overview

A[ribute  Computa3on    

Engine  

A[ribute    Serving    Engine  

Segmentation & Targeting Platform Overview

A[ribute  Computa3on    

Engine  

Self-service

Support  various  data  sources  

Attribute consolidation

Attribute availability

Segmentation & Targeting Platform Overview

Attribute computation

~238M

PB

TB

TB

~440

Segmentation & Targeting Platform Overview

A[ribute    Serving    Engine  

Self-service

A[ribute  predicate  expression  

Build segments

Build lists

Segmentation & Targeting Platform Overview

Attribute Serving Engine

$  

 count filter sum complex

expressions

Σ  1234

~238M

~440

Segmentation & Targeting Platform Overview

Who are north American recruiters that don’t work for a competitor?

Who are the LinkedIn Talent Solution prospects in Europe?

Who are the job seekers?

Segmentation & Targeting Platform Overview

How  Lucene  powers  Segmenta3on  &  Targe3ng  PlaRorm  

•  Architecture –  Indexer Architecture –  Serving Architecture

•  Load Balanced Model •  Next Steps - Distributed Model •  DocValues •  Lessons Learnt •  Why not use an existing solution?

How Lucene powers Segmentation & Targeting Platform

Architecture

Data

StorageLayer

AttributeCreationEngine

AttributeMaterialization

EngineAttributeComputationEngine

AttributeMetastore

AttributeIndexingAttribute

ServingEngine

AttributeServingEngine

Architecture

Index Merger

Web Servers

HDFS

shard 1

shard 2

shard n

Avro data in HDFS

mysql attribute

store

Hadoop Indexer MR

Attribute Definitions

LuceneOutputFormat  RecordWriter        LuceneDocumentWrapper                        

           Document                            Index  

Mapper  K=>  AvroKey<GenericRecord>    V=>  AvroValue<NullWritable>   Reducer  K=>  NullWritable    V=>  LuceneDocumentWrapper  

Architecture JSON  Predicate  Expression  

JSON  Lucene    Query  Parser  

Inverted    Index  

Inverted    Index  

Inverted    Index  

Segment  &  List  

•  Architecture –  Indexer Architecture –  Serving Architecture

•  Load Balanced Model •  Next Steps - Distributed Model •  DocValues •  Lessons Learnt •  Why not use an existing solution?

How Lucene powers Segmentation & Targeting Platform

Serving – Load Balanced Model

Shard 1

Shared Drive

Shard 2 Shard n

Web Server 2 Web Server nWeb Server 1

Load Balancer

HTTP Request

Serving – Load Balanced Model

But  Wait…..  

•  Is  load  balancing  alone  good  enough?  

•  What  about  distribu3on  and  failover?  

•  Architecture –  Indexer Architecture –  Serving Architecture

•  Load Balanced Model •  Next Steps - Distributed Model •  DocValues •  Lessons Learnt •  Why not use an existing solution?

How Lucene powers Segmentation & Targeting Platform

Next Steps – Distributed Model

•  A  generic  cluster  management  framework  

•  Manage  par33oned  and  replicated  resources  in  distributed  systems  

•  Built  on  top  of  Zookeeper  that  hides  the  complexity  of  ZK  primi3ves  

•  Provides  distributed  features  such  as  leader  elec3on,  two-­‐phase  

commit  etc.  via  a  model  of  state  machine  

 hLp://helix.incubator.apache.org/  

Next Steps – Distributed Model

Shard 1

Shard 2

Web Server 2 Web Server 3Web Server 1

Load Balancer

HTTP Request

Scatter Gather

active

standby

Shard 2

Shard3

active

standby

Shard 3

Shard1

active

standby

Next Steps – Distributed Model

Shard 1

Shard 2

Web Server 2 Web Server 3Web Server 1

Load Balancer

HTTP Request

Scatter Gather

active

standby

Shard 2

Shard3

active

active

Shard 3

Shard1

failure

failure

•  Architecture –  Indexer Architecture –  Serving Architecture

•  Load Balanced Model •  Next Steps - Distributed Model •  DocValues •  Lessons Learnt •  Why not use an existing solution?

•  Once segments are built, users want to forecast, see a

target revenue projection for the campaigns that they

want to run.

•  Campaigns can be run on various Revenue Models

•  This involves adding per member Propensity Scores and

Dollar Amounts

DocValues – Use Case

DocValues – Why not Stored Fields?

Why  not  use  Stored  Fields?  

•  Stored  fields  have  one  indirec3on  per  

document  resul3ng  in  two  disk  seeks  

per  document  

•  Performance  cost  quickly  adds  up  when  

fetching  millions  of  documents  

Document ID

.fdx fetch filepointer to field data

.fdt scan by id until field is found

•  Why not use Field Cache?

–  Is memory resident

–  Works fine when there is enough memory

–  But keeping millions of un-inverted values in memory is

impossible

–  Additional cost to parse values (from String and to String)

DocValues – Why not Stored Fields?

•  Dense column based storage

–  (1 Value per Document and 1 Column per field and segment)

•  Accepts primitives

•  No conversion from/to String needed

•  Loads 80x-100x faster than building a FieldCache

•  All the work is done during Indexing

•  DocValue fields can be indexed and stored too

DocValues

•  Architecture –  Indexer Architecture –  Serving Architecture

•  Load Balanced Model •  Next Steps - Distributed Model •  DocValues •  Lessons Learnt •  Why not use an existing solution?

Indexing •  Reuse index writers, field and document instances

•  Create many partitions and merge them in a different process

•  Rebuild (bootstrap) entire index if possible

•  Use partial updates with caution

•  Analyze the index

Lessons Learnt

Serving •  Reuse a single instance of IndexSearcher

•  Limit usage of stored fields and term vectors

•  Plan for load balancing and failover

•  Cache term frequencies

•  Use different machines for serving and indexing

Lessons Learnt

•  Architecture –  Indexer Architecture –  Serving Architecture

•  Load Balanced Model •  Next Steps - Distributed Model •  DocValues •  Lessons Learnt •  Why not use an existing solution?

Why not use existing solutions?

•  Doesn’t  allow  dynamic  schema  •  Difficult  to  bootstrap  indexes  built  in  Hadoop  •  Indexing  elevates  query  latency    

•  Doesn’t  allow  dynamic  schema  •  Difficult  to  bootstrap  indexes  built  in  Hadoop  •  Larger  memory  overhead  •  Compara3vely  slow  

Ques3ons?    

More  info:  data.linkedin.com  

Recommended