49
A Static Rank Framework for Lucene/Solr Mike Schultz [email protected]

A Static Rank Framework for Lucene/Solr Mike Schultz [email protected]

Embed Size (px)

Citation preview

A Static Rank Framework for Lucene/SolrMike [email protected]

Static Rank for Solr/Lucene

• Dynamic Rank

• Why Static Rank

• Combining Scores

• Static Rank Components

Multiple Fields / Multiple Types

PubDate

IsNews

MediaType

TextBody

Continuous (Date, Int, Float, …)

Multiple Fields / Multiple Types

PubDate

IsNews

MediaType

TextBody

Continuous (Date, Int, Float, …)

Boolean (True, False)

Multiple Fields / Multiple Types

PubDate

IsNews

MediaType

TextBody

Continuous (Date, Int, Float, …)

Boolean (True, False)

Enum (Book, CD, DVD, Cassette)

Multiple Fields / Multiple Types

PubDate

IsNews

MediaType

TextBody

Continuous (Date, Int, Float, …)

Boolean (True, False)

Enum (Book, CD, DVD, Cassette)

Text (Natural Language)

Dynamic Rank

PubDate

IsNews

MediaType

TextBody

TF * IDF

Query

Dynamic Score

Dynamic Rank

• Query Dependent = F(Q,D)PubDate

IsNews

MediaType

TextBody

TF * IDF

Query

Dynamic Score

Dynamic Rank

• Query Dependent = F(Q,D)• Huge dynamic range (0.001-1502.3)

PubDate

IsNews

MediaType

TextBody

TF * IDF

Query

Dynamic Score

Dynamic Rank

• Query Dependent = F(Q,D)• Huge dynamic range (0.001-1502.3)• Not comparable across queries

PubDate

IsNews

MediaType

TextBody

TF * IDF

Query

Dynamic Score

Dynamic Rank

• Query Dependent = F(Q,D)• Huge dynamic range (0.001-1502.3)• Not comparable across queries• Not easily normalized

PubDate

IsNews

MediaType

TextBody

TF * IDF

Query

Dynamic Score

Why Static Rank?

PubDate

IsNews

MediaType

TextBody

Query

Static Rank System Static Score

Why Static Rank?

PubDate

IsNews

MediaType

TextBody

Query

Static Rank System Static Score

All (dynamic) things equal, I want– Newer over older

Why Static Rank?

PubDate

IsNews

MediaType

TextBody

Query

Static Rank System Static Score

All (dynamic) things equal, I want– Newer over older– CD over cassette

Why Static Rank?

PubDate

IsNews

MediaType

TextBody

Query

Static Rank System Static Score

All (dynamic) things equal, I want– Newer over older– CD over cassette– Arbitrary feature A over arbitrary

feature B

Static Rank

PubDate

IsNews

MediaType

TextBody

Query

Static Rank System

• Query Independent = F(D)– i.e. static across queries

Static Score

Static Rank

PubDate

IsNews

MediaType

TextBody

Query

Static Rank System

• Query Independent = F(D)– i.e. static across queries

• More easily bounded

Static Score

Combined Rank

PubDate

IsNews

MediaType

TextBody

TF * IDF

Query

Static Rank System

Cust

om Q

uery

Com

bine

d Sc

ore

Framework - Requirements

Cust

om Q

uery

Com

bine

d Sc

ore

• Intuitive, hand-tunable, debuggable

Framework - Requirements

Cust

om Q

uery

Com

bine

d Sc

ore

• Intuitive, hand-tunable, debuggable• Query-time only, no re-indexing

Framework - Requirements

Cust

om Q

uery

Com

bine

d Sc

ore

• Intuitive, hand-tunable, debuggable• Query-time only, no re-indexing• Minimal parameters

Framework - Requirements

Cust

om Q

uery

Com

bine

d Sc

ore

• Intuitive, hand-tunable, debuggable• Query-time only, no re-indexing• Minimal parameters• Static Rank should boost / demote– But not too much!– Docs should stay in their own dynamic

rank “neighborhood”.

Combining Scores - Approaches

Cust

om Q

uery

Com

bine

d Sc

ore

• Addition?– Dynamic(0.0001) + Static(0.3) = 0.3001– Dynamic(1542.1) + Static(0.3) = 1542.4– Difficult to get right across queries

Combining Scores - Approaches

Cust

om Q

uery

Com

bine

d Sc

ore

• Multiplication?– Dynamic(50.0) * Static(0.3) = 15.0– Dynamic(10.0) * Static(2.0) = 20.0– Could work, but awkward

Combining Scores - Approaches

Line

ar Q

uery

Com

bine

d Sc

ore

1. Bound StaticScore: -1.0 to 1.02. CScore = DScore*(100+S%*SScore)– At most, staticRank will boost/demote

dynamicScore by S%– CScore = 0.014 * (100+30*0.5)– CScore = 145.3 * (100+30*-0.5)

LinearQuery

Static Rank

PubDate

IsNews

MediaType

TextBody

Query

Static Rank System Static Score

Static Rank

PubDate

IsNews

MediaType

TextBody

Query

Static Rank System Static Score

• Extend solr.ValueSource/Parser

Static Rank

PubDate

IsNews

MediaType

TextBody

Query

Static Rank System Static Score

• Extend solr.ValueSource/Parser • Uses field cache for inputs

Static Rank

PubDate

IsNews

MediaType

TextBody

Query

Static Rank System Static Score

• Extend solr.ValueSource/Parser • Uses field cache for inputs• Extremely fast

Static Rank

PubDate

IsNews

MediaType

Static Rank

PubDate

IsNews

MediaType

AgoValueSource

years ago

Static Rank

PubDate

IsNews

MediaType

MuxValueSource

0

T

F

AgoValueSource

years ago

years ago

MuxValueSource Config

Static Rank

PubDate

IsNews

MediaType

0

T

F

EnumValueSource

MuxValueSourceAgoValueSource

years ago

years ago

EnumValueSource Config

• Maps Fixed-Vocabulary to YEARS AGO• A hierarchy and 3 values: MIN,0,MAX• All things equal (dynamically), DVD = +3.3 years

Static Rank

PubDate

IsNews

MediaType

0

T

F

SumValueSource

EnumValueSource

MuxValueSourceAgoValueSource

years ago

years ago

years ago

years ago ?

-1

1

Mapping YearsAgo to -1.0 – 1.0• Step Function: if > 10 years-ago = -1, else = +1• 1 parameter• Too abrupt

Mapping YearsAgo to -1.0 – 1.0• Step Function: if > 10 years-ago = -1, else = +1• 1 parameter• Too abrupt

• Linear• No parameters (fixed)• Too gradual over 2000+ years

Mapping YearsAgo to -1.0 – 1.0• Step Function: if > 10 years-ago = -1, else = +1• 1 parameter• Too abrupt

• Linear• No parameters (fixed)• Too gradual over 2000+ years

• Sigmoid• 2 parameters• Smooth over entire range• Easy to calculate

Sigmoid

Slope

Sigmoid

Slope x-intercept (year)

1.0

-1.0

Years-ago

x0 = 1.5 years ago

Static Rank

PubDate

IsNews

MediaType

0

T

F

SumValueSource

EnumValueSource

MuxValueSourceAgoValueSource

SigmoidValueSource

-1

1

years ago

years ago

years ago

SigmoidValueSource Config

Static Rank Config

Conclusion

• solr.ValueSource/Parser - fast and flexible

Conclusion

• solr.ValueSource/Parser - fast and flexible

• CScore = DScore * (100 + S% * SScore)• -1.0 < SScore < 1.0

Conclusion

• solr.ValueSource/Parser - fast and flexible

• CScore = DScore * (100 + S% * SScore)• -1.0 < SScore < 1.0

• “Time” as a common currency for static features