Upload
lucidworks
View
150
Download
0
Embed Size (px)
Citation preview
Jayesh GovindarajanSearch Relevance @ Salesforce
Improving Enterprise findability
Jayesh Govindarajan
Senior Director Search Relevance, Data Science
Salesforce
1. How is search in the enterprise different ?
2. Enterprise findability problem3. Relevance, LETOR algorithms4. Deploying models in solr5. A model for every customer6. Putting the pieces together
Forward-Looking StatementsStatement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.
Largest Enterprise Search Service!
1.6
300TB+ 600M+Queries / Week
<2minIncremental Index Latency
500B+
Average Click Rank
Index Size
Documents in the Index
<120msQuery Latency on Search Server
7B+Index Updates / Day
Empower enterprise users to effortlessly find all the information they need in order to be successful with Salesforce
Intelligent, Fast and Powerful
Be a competitive differentiator for Salesforce
The Search Vision
What information do you need?
Demo time
1. How is enterprise search different ?
Diversity of Data is a Challenge!Sales Cloud
Structured dataSFA, B2C
Service CloudUnstructured data
Case Mgmt, KB, Field Svc
Community CloudEnterprise Social data Q&A, Chatter, Files
App CloudSearch APIs
Person SearchPeople data
Diversity of Intentions:
A Service agent exploring a community forum to educate himself: Recall
A Service agent looking for a case similar to the one she is currently assigned: Precision
A Sales rep looking for a named account to call: Precision
A Sales rep looking for contacts in an industry within a certain geo: Recall
Patterns of search and discovery differ by user roles, and searched entity
Customer diversity: one size doesn’t fit allMatching models to Customer Orgs
Some Orgs want a lower coefficient in some cases
Some Orgs want a higher coefficient
2. Understanding enterprise findability problem
Most ranking functions start off with a few boosts and end up like….this
Form
1. Query independent signals - multiplicative boosts in range [1-3] 2. Entity specific signals - additive boosts in the range of [1-12]
a. Accounts, Contacts, Leads - LastActivityScore, LastModifiedScoreb. Cases - CaseStatus, CaseEscalationScore
3. ...
Getting to a machine learned function has challenges
Constraint
1. Customers build apps on enterprise search platforms. One cannot simply cutover to a new ranking system.
Key Lessons
2. Understanding the current search equation is key to anticipating customer breakage/impact.
3. Important to formalize the “Human Intelligence” equation behind a working system.
3. Machine learning and Learning to Rank methods
Build a probabilistic model of relevance.Chance that user clicks on the ith record: pr(r,q,i)= L(r,q)*Rb(i)
- L() is a function which maps the (record,query) pair to the likelihood that a user clicks on r in response to q
- Rb() is a function which corrects for the positional bias probabilities
Master Relevance Equation
Goal: learn the best linear function of these variablesFrom queries and clicks.
Logistic RegressiondocPV score Clicked
queryId
-hnjnxlbxd 0 10.892 1
1ttuuy6n3 5 0.230 0
1ttuuy6n3 0 0.232 0
1ttuuy6n3 0 0.230 0
1ttuuy6n3 0 0.230 1
1ttuuy6n3 0 0.244 0
1ttuuy6n3 0 0.231 0
1ttuuy6n3 6 0.228 0
1ttuuy6n3 5 0.228 0
1ttuuy6n3 0 0.231 0
If this was the data, the simplest approach would be logistic regression
P(clicked) = sigmoid (a0+ a1(docPV) + a2(score))
Incremental effects of docPV and score on relevance
Bias term that affects all observations equally
The only thing to change about this is that we want a separate bias term for each positional rank.
Page views and Lucene score for result in position 1
Page views and Lucene score for result in position 3
Position that was clicked
Example: Query/Click Data
Goal:
b1
b2b3
b4 b5
1
Result 1 Result 2 Result 3 Result 4 Result 5
=
● Five logistic regressions with shared weights, but different biases.● Coefficients and biases are learned via MLE (SGD).
Learning
Results
---------- ACCOUNTS ----------
Coefficients-------------------------docPC 0.203docPV 0.312doclm_score 0.642lastAccessed_score 0.34score 0.251
Rank Bias-------------------------Rank 1 1.0Rank 2 0.884Rank 3 0.843Rank 4 0.788Rank 5 0.94
The model is much better at predicting which of the five results will be clicked.
Detect 50% of occurrences where the 5th result was clicked. Wrong on 1 out of 8 attempts
All else being equal, the odds of clicking position 2 are about .884 compared to the odds of clicking position 1
Opportunities are the subject of more general searches. E.g. “Which opportunities are John Smith working on?”
Searches for accounts or cases are more likely to be very specific. E.g. “I have a specific account in mind…”
Results: Coefficients (normalized by relative influence)q1 / docPV q2 / docPC d1 /
doclm_scored2 / lastaccessed
d3 / oppclosedate
d4 / oppclosed
d5 / caseEscState
d6 / caseClosed
Lucene Score
users 3.12 7.24
groups 4.59 4.59
files 1.32 3.74
cases 0.85 0.15 0.49 0.33 1.21
leads 2.01 2.30 1.01 1.62
contacts 1.87 1.42 1.15 0.76 2.76
accounts 1.64 1.07 1.65 0.96 4.17
oppy 2.27 0.53 0.53 0.62 0.81 3.64
kb 0.50 0.32 2.00
● Lucene score is always most important (except for leads)● LastModified is extremely important for accounts and leads, but not at all for cases.
4. Implementing Model representation in SOLR
Relevance Metadata JSON format{
"schema": 1.0,
"global": {
"pc_s": 1.5, Boost Parent Child scores by 1.5
"pv_s": 2.0, Boost PageView counts by 2
"lm_s": 1.333, Last Modified
},
"entity": {
"500": { Specific for Cases (key prefix 005)
"cc_s": 4.0, Boost Open Cases
"lm_s": 1.0, Apply a different boost for Last Modified
}
}
}
config/settings/default/search.xml under the RelevancyMetadata section
Relevance Metadata (RMD)
Relevance Model{ schemaVersion:"V1", orgId:"00D1234567", Account:{ PV:2.1, LastMod:0.5 },... QIR:"Solr", DBRerank:"CoreApp"}
Model Store
JSON is stored in a Blob field in Setup BPO or HBase tableChanges to format / schema won't affect table (it's just a blob)
AB ExperimentName:
Org:
Params: JSON
Model Deploy Org:
Params: JSON
QueryingPass the same JSON to Query layer and Solr Server, each should have code that knows what to do given the JSON
Model Builder (offline) solutions to help us (Devs / PM etc) build the model JSON files
Pass entire JSON to Solr, or just boost function
The same JSON is used to run AB experiment and eventually deploy into production
Relevancy coefficients are expressed in a JSON data structure, so that we can easily specify per-entity or global-to-org coefficients
5. Stacking base and custom models
Recap: And one size doesn’t fit allCluster Orgs based on their ACR response curves
Green orgs are hurt badly by increasing coefficient changes
Blue orgs are hurt badly by decreasing coefficient changes
Reds are hurt badly either way
Three distinct clusters observed in hierarchical clustering
Stacking modelsRELEVANCE PIPELINE
Base Model(All orgs, all entities)
...Accounts Model
Case Model
Knowledge Article Model
Feeds Model
} Entity SignalsOrg 1
Org 2
...
Org n
Putting the pieces together:Relevance ML Pipeline, Runtime
Relevance ML PipelineRELEVANCE PIPELINE
Common representation of ranking model. Infra to automate training, A/B testing and deployment.
FEATURE DETAILS
● Config driven Model Deployment
● Automated model generation
● Training and A/B experimentation
Core App Solr
Model Deployment
Model Evaluation
A/B Experiments
Model BuildingML Training Infrastructure
Search Query Logs
RMD JSON
Training Models
RMD JSON
RMD JSON
RMD JSON
Relevance Runtime
RMD JSON
LEARN
TEST
SHIP
RMD JSONRelevance Model Representation
Ranker
RMD JSON
Training Models
Relevance Runtime InfrastructureRELEVANCE RUNTIME
Executes machine learning models as a service in solr, at scale
FEATURE DETAILS
● Ranking functions in solr (linear and non-linear)
● Support for org, entity specific models
● Query Understanding
Enterprise Data
Index
Training Data Signals - Interaction, BehaviorClicks, Likes, Mentions
TF/IDF
QueryUnderstand
ing(NLP, Q&A)
Level 1Top K
RankerLevel 2
ML Ranker Model
Salesforce Cloud
Feature Engineering
Snippet Generation
Level 3Post-Ranking
Model
Mac
hine
Lea
rnin
g Pi
pelin
e
Search Results
Content
Users/Actions
Query/Intent
query
Thank you
We are Hiring !
ML EngineersEngineering Managers
Software EngineersData Scientists Join Salesforce Search Cloud
Mining Intent @ Work
Results: Positional Bias
Opportunities are the subject of more general searches. E.g. “Which opportunities are John Smith working on?”
Searches for accounts or cases are more likely to be very specific. E.g. “I have a specific account in mind…”