35
Enhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist Elder Research, Inc. [email protected]

Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Enhancing Search with Predictive Analytics

Text Analytics World – San Francisco 2014

Andrew Fast Chief Scientist

Elder Research, Inc. [email protected]

Page 2: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem
Page 3: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

•  “It is difficult to describe, but you know it when you see it.” –  Lord Justice Stuart Smith,

Cadogan Estates Limited v. Morris (1998)

•  Likewise, most textual concepts cannot be easily defined with a single keyword query

The Elephant Test

Quote  from:  h,p://www.bailii.org/ew/cases/EWCA/Civ/1998/1671.html  

Page 4: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Goals for Today

Show how predictive analytics can be used improve the findability of

elephants through user-customizable search filters

 

Page 5: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Effective Search is Simple, Right?

KEYWORD QUERY SEARCH INDEX +

RELEVANT DOCS

Page 6: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

SEARCH INDEX KEYWORD QUERY

RELEVANT DOCS

+ {Intent} {Vocabulary Mismatch}

# of Users

Page 7: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

What Users Want…

Page 8: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

What Users Get…

Page 9: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

h,p://xkcd.com/1334/  

Second  

Page 10: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

•  Text mining can be viewed from many different perspectives

•  No single view provides a complete solution

•  Must consider the

entire “beast” to get the best solution

“Blind Men and The Elephant”

Page 11: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

11

The  9  Levels  of  AnalyLcs  Descrip(ve  Techniques:  1  –  Standard  Repor(ng    

 “How  much  did  we  sell  last  quarter?”  2  –  Custom  Repor(ng  or  “Slicing  and  Dicing”  the  Data  (Excel)  

 “How  many  invesLgaLons  did  we  perform  in  each  state  last  year?”  3  –  Queries/drilldowns  (SQL,  OLAP)    

 “Which  contractors  received  over  $10  million  in  sole-­‐source  contracts  last  year?”  4  –  Dashboards/alerts  (Business  Intelligence)    

 “In  what  sectors  have  customer  complaints  grown  since  last  quarter?”  5  –  Sta(s(cal  Analysis  

 “Is  frequency  of  communicaLon  with  the  customer  correlated  with  saLsfacLon?”  6  –  Clustering  (Unsupervised  Learning)  

 “How  many  fundamentally  different  types  of  behaviors  are  in  the  data  and  what  do  they  generally  look  like?”  

Predic(ve  Techniques:  7  –  Predic(ve  Modeling  

 “Which  contracts  are  most  likely  to  be  fraudulent?”  8  –  Op(miza(on  &  Simula(on  

 “What  number  of  invesLgators  would  we  put  on  each  case  to  maximize  expected  return?”  9  –  Next  Genera(on  Analy(cs  –  Text  Mining  &  Link  Analysis  

 “Do  the  transacLons  reveal  a  coordinated  set  of  people  likely  to  be  a  fraud  ring?”          

Page 12: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

•  Search and Predictive modeling each provides a different trade-off between power and generality.

Why Predictive Analytics?

Keyword  queries  can  answer  any  query,  but  with  limited  depth  for  complex  queries.  

Document Classification

Generality

Pow

er

Keyword Search

A  predicLve  model  can  answer  one  query  well,  especially  a  complex  query  

Page 13: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Trough of Disillusionment

Source: Hype Cycle for Emerging Technologies 2012, Gartner

Page 14: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Our Approach •  A “search ensemble” ranking function that

“boosts” keyword relevance based on a predictive model

High  Keyword  Relevance,  High  Model  Ranking  

Model  Ranking  

Keyw

ord  Re

levance  

High  Keyword  Relevance,  Low  Model  Ranking  

Low  Keyword  Relevance,  Low  Model  Ranking  

Low  Keyword  Relevance,  High  Model  Ranking  

Page 15: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

CASE STUDY

Page 16: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

The Problem •  The Goal: Explore NEW interesting ideas using

OLD social entrepreneurship contest entries

•  The Data: A collection of contest entries from 19 different contests sponsored by our client –  Contests cover a range of topics such as health,

education, literacy, finance, technology, and geo-tourism.

•  The Challenge: Emphasize high-quality entries in the results as entry quality varies widely

Page 17: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Combining Search and Predictive Models •  Keyword ranking does not help you find high-

quality entries … •  … but Model Ranking is not topic centric.

•  Complimentary strengths –  Search for exploration and discovery –  Predictive Models for long-term trends and correlations

Page 18: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Target Variable •  Identify characteristics of past entries that are

correlated with that proposal being ‘Shortlisted’ by the Contest Judges

•  Rankings: 1 – Likely Finalist 2 – Top Tier 3 – Honorable Mention 4 – Passed Screening 5 – No

•  Note: Not every contest used all 5 rankings

‘Shortlisted’  

Page 19: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

The Inputs •  Learn a logistic regression model to fit the feature

weights

•  Inputs:

Structured  Data  

Taxonomy   Textual  Features  

•  Budget  Size  •  Maturity  •  Impact  

•  Auto-­‐tagging  taxonomy  terms  

•  Length  •  Lexical  

Diversity  

Page 20: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

•  Joint work with Beth Maser and Richard Iams at PPC

•  Non-traditional, general approach –  Broad, flexible taxonomy

•  Focus on the range of interests of the organization

The Taxonomy

Page 21: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Using the Taxonomy •  Each contest emphasizes different branches of

the taxonomy –  Taxonomy features need to be contest specific

•  Step 1: Use the “Wisdom of Crowds” to find the center of each contest

•  Step 2: Rate each entry based on the distance from the center

Page 22: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Evaluation: Area Under the ROC •  Evaluate the overall ranking provided by the model.

–  Higher means more ‘Shortlisted’ entries at the top of the list

Page 23: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Evaluation: Lift •  Evaluates the improvement using the model at a

fixed amount of work –  How much more efficient are the judges using our

model alone?

•  Every contest showed positive lift. –  Maximum lift of 3.3 –  Average lift of 1.67

Page 24: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

New Contest Data (English only)

Page 25: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

THE SEARCH APPROACH

Page 26: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Our Approach •  A new search ranking function that “boosts”

keyword relevance for probable shortlisted entries

High  Keyword  Relevance,  High  Model  Ranking  

Model  Ranking  

Keyw

ord  Re

levance  

High  Keyword  Relevance,  Low  Model  Ranking  

Low  Keyword  Relevance,  Low  Model  Ranking  

Low  Keyword  Relevance,  High  Model  Ranking  

Page 27: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

The Prototype Platform

ERI  Text  Mining  

Model    (PredicLve  +  Taxonomy)  

Search  Index  

Custom  Search  Interface  

Data  

Page 28: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Faceted Search with Solr

Apache Solr is an open-source faceted search engine (http://lucene.apache.org/solr)

Page 29: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

•  You have a specific question in mind –  May a piece of categorical metadata –  May be able to extracted from text

•  e.g., Country, Disease

•  Human validated historical data available is available

•  Relevant concepts are complex or otherwise hard to define.

Predictive Analytics works when…

Page 30: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Text Mining Taxonomy

Are you interested in results about individual words or at a higher level

(i.e., sentences, paragraphs or documents)?

Do you want to sort all documents into

categories or search for specific documents ?

Do you want to automatically identify specific facts or gain

overall understanding?

Do you have categories already?

Are your documents independent or connected via

hyperlinks?

Information Retrieval

Web MiningDocument

Classification

DocumentClustering

InformationExtraction

Concept Extraction

ConnectedIndependent

Is your focus on the meaning of the text or the

structure?

Natural Language Processing

Structure Meaning

Text Mining Foundations

WordsDocuments

No Categories Have Categories

Search SortUnderstandingSpecific Facts

From Chapter 2

Page 31: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Make Effective Trade-offs

•  Each text mining area provides a different trade-off between power and generality.

Document Classification

Generality

Pow

er

"More Like This"

Controlled Vocabulary

Expanded Keyword Search

Page 32: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

“Seeing Elephants”

•  “This means to become experienced, or to have passed through life or some event (or series of events) and come out on the other side wiser, or to just plain see, hear, feel, and experience everything that an occasion, or life itself, can provide.” –  Grant Barrett, Host of A Way with Words

h,p://grantbarre,.com/the-­‐elephant-­‐in-­‐the-­‐language  

Page 33: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

33

Contact Information

Andrew Fast, Ph.D. Chief Scientist

[email protected]

(434) 973-7673 www.datamininglab.com

Page 34: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

Practical Text Mining •  Winner of the 2012

PROSE award for Computing and Information Science

•  Written for a technical audience seeking more text experience

•  Includes trial versions of major software tools

Page 35: Enhancing Search with Predictive AnalyticsEnhancing Search with Predictive Analytics Text Analytics World – San Francisco 2014 Andrew Fast Chief Scientist ... CASE STUDY. The Problem

35  

Andrew Fast"Chief Scientist, Elder Research, Inc.

Dr. Fast graduated Magna Cum Laude from Bethel University and earned Master’s and Ph.D. degrees in Computer Science from the University of Massachusetts Amherst. There, his research focused on causal data mining and mining complex relational data such as social networks. At ERI, Andrew leads the development of new tools and algorithms for data and text mining for applications of capabilities assessment, fraud detection, and national security. Dr. Fast has published on an array of applications including detecting securities fraud using the social network among brokers, and understanding the structure of criminal and violent groups. Other publications cover modeling peer-to-peer music file sharing networks, understanding how collective classification works, and predicting playoff success of NFL head coaches (work featured on ESPN.com). With John Elder and other co-authors, Andrew has written a book on Practical Text Mining, that was awarded the prose Award for Computing and Information Science in 2012.

Dr. Andrew Fast leads research in Text Mining and Social Network Analysis at Elder Research, the nation’s leading data mining consultancy. ERI was founded in 1995 and has offices in Charlottesville VA and Washington DC,(www.datamininglab.com). ERI focuses on Federal, commercial, investment, and security applications of advanced analytics, including stock selection, image recognition, biometrics, process optimization, cross-selling, drug efficacy, credit scoring, risk management, and fraud detection.