Recruiting SolutionsRecruiting SolutionsRecruiting Solutions
Sriram SankarDaniel TunkelangPrincipal Staff Engineer Head, Query Understanding 1
Sriram Daniel
Socializing Search. Professionally.
Whether you’ve tried to find an Apache committer…
3
…or an Apache commander,
4
you’ve probably used LinkedIn Search.
5
Let’s talk about…
• Infrastructure • Quality
Sriram Daniel
6
LinkedIn Search leverages the economic graph.
7
Social means that relevance is highly personalized.
8
Machine-learned ranking, socially.
Relevance models incorporate user features:
score = P (Document | Query, User)
Our model: tree with logistic regression leaves.
8
X 2=0
X2=?
X2=1
X10< 0.1234 ?
Yes
No
9
LinkedIn’s focus: entity-oriented search.
Company
Employees
Jobs
Name
Search
10
Query understanding can act as a relevance filter.
10
for i in [1..n] s w1 w2 … wi
if Pc(s) > 0 a new Segment() a.segs {s} a.prob Pc(s) B[i] {a} for j in [1..i-1] for b in B[j] s wj wj+1 … wi
if Pc(s) > 0 a new Segment() a.segs b.segs U {s} a.prob b.prob * Pc(s) B[i] B[i] U {a} sort B[i] by prob truncate B[i] to size k
11
Less is more.
warren buffett
Jobs at LinkedIn
Searchlink
People currently working at LinkedIn
People who used to work at LinkedIn
Coming soon: entity-driven search assist.
13
Infrastructure
Lucene Map of terms to documents – the index Provides an API to add and remove documents to the
index Provides an API to query the index
14
BLAH BLAH BLAH Daniel BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH
BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2.
1.
Daniel Sriram LinkedIn
2
1
Inverted Index Forward Index
15
A standard scoring capability is built in
16
Extremely easy to build a search engine
But difficult to get sophisticated
17
The LinkedIn Search Stack
Query Rewriter
Index Retrieval
Scorer
Sorter/Blender
Request
Response
OfflineData
Building
Updates
LiveUpdates
Data
18
Search Index Served by Lucene
Inverted index Forward index Static rank based document ordering
19
Offline Data Builds on Hadoop
Multi-stage map-reduce pipeline allows complex data processing
Produces sharded single segment Lucene index with documents sorted by static rank
Produces data models for use in query rewriting
20
Live Data Updates
Feed based framework to support updates to offline data builds
Lucene enhanced with a partial index update capability
21
Query Rewriting (and Planning)
Accepts raw query and user metadata Produces Lucene retrieval query and metadata for
scoring May use data models built offline
22
Index Retrieval
Lucene query built by query rewriter is used to retrieve documents from the Lucene index
Documents are retrieved in static rank order (best document first)
Retrieval may be early-terminated – given that retrieval is in static rank order
No scoring is performed during retrieval
23
Scoring
Scoring is performed after retrieval Its input is the retrieved document (i.e., includes the
forward index), a description of how the retrieval query matched the document, and the scoring metadata produced by the rewriter
Costly features can be computed offline during the index building process in Hadoop – e.g., tf/idf calculations
24
Summary
Quality LinkedIn Search leverages the economic graph. Social means that relevance is highly personalized. Less is more: query understanding is a relevance filter. Moving in the direction of suggesting structured queries.
System Powered by Lucene, but with additional components. Offline data builds on Hadoop, partial index updates. Index uses static ranking and early termination. Scoring performed outside of Lucene.
25
Sriram SankarDaniel [email protected] [email protected]://linkedin.com/in/sriramxsankar https://linkedin.com/in/dtunkelang