Investigating per Topic Upper Bound for Session Search ...zhiwen.georgetown.domains/slides/ICTIR17_evaluation_V6.pdf · "The water filling model and the cube test: multi-dimensional

InvestigatingperTopicUpperBoundforSessionSearchEvaluation

Zhiwen Tang

DepartmentofComputerScienceGeorgetownUniversity

GraceHuiYang

[email protected] [email protected]

SessionSearch

• Multiplerunsofsearch

• Complexinformationneed

• Evaluationneedstoconsiderthewholeprocess

1

• Usefulinformationthattheusergains• Rawrelevancescore

• Discounting• Basedondocumentranking• Basedondiversity

• User’sefforts• Timespent• Lengthsofdocumentsbeingviewed

2

EvaluationofSessionSearch

• Mostsessionsearchmetricsconsiderallthosefactorsintooneoverwhelminglycomplexformula

• Theoptimalvalue,akaupperbound,ofthosemetricshighlyvariesondifferentsearchtopics

• InCranfield-likesettings(e.g.TREC),thedifferenceisoftenignored

3

TheProblem

• Twosystems

• Allthesystemsreturns5docsperround

• Eachsystemconductsoneroundofinteraction

• Metric:• CubeTest:

• Luo,Jiyun,etal."Thewaterfillingmodelandthecubetest:multi-dimensionalevaluationforprofessionalsearch." CIKM,2013.

4

Toyexample

𝐶𝑇 =∑ ∑ ∑ 𝜃&�

& 𝑟𝑒𝑙 𝑖, 𝑗 ∗ 𝛾1(&,3,456)|93:;<|4=6 >

3=6

∑ ∑ 𝑐𝑜𝑠𝑡(𝑖, 𝑗)|93:;<|4=6 >

3=6

5

ToyexampleDoc Relevancescoreregardingtopic-subtopic

1-1 1-2 2-1 2-2 2-3 2-4 2-5

d1 1 4

d2 3 4

d3 4

d4 4

d5 4

System Topic1 CT-topic1

Topic2 CT-topic2

CT-avg NormalizedCT-avg

System1 d1, irrel,irrel,irrel,irrel 1 d1,d3,d4,d5,irrel 16 8.5 0.596

System2 d2, irrel,irrel,irrel,irrel 3 d1,d3,d4,d5,irrel 14 8.5 0.787

Optimal d1, d2,irrel,irrel,irrel 4 d1, d2,d3,d4,d5 17

• Whatistheoptimalmetricvaluethatasystemcanachieve?

• Howtogettheupperboundforeachsearchtopic?

• Howdoesitaffecttheevaluationconclusions?• Varianceofdifferenttopics

• Normalization

6

ResearchQuestions

𝑠𝑐𝑜𝑟𝑒C = D𝑟𝑎𝑤_𝑠𝑐𝑜𝑟𝑒 𝑡𝑜𝑝𝑖𝑐, 𝐴 − 𝑙𝑜𝑤𝑒𝑟_𝑏𝑜𝑢𝑛𝑑(𝑡𝑜𝑝𝑖𝑐)𝑢𝑝𝑝𝑒𝑟_𝑏𝑜𝑢𝑛𝑑 𝑡𝑜𝑝𝑖𝑐 − 𝑙𝑜𝑤𝑒𝑟_𝑏𝑜𝑢𝑛𝑑(𝑡𝑜𝑝𝑖𝑐)

�

;OP3&

• Session-DCG(sDCG)

• Järvelin,Kalervo,etal."Discountedcumulatedgainbasedevaluationofmultiple-queryIRsessions." AdvancesinInformationRetrieval (2008):4-15.

• CubeTest(CT)

• Luo,Jiyun,etal."Thewaterfillingmodelandthecubetest:multi-dimensionalevaluationforprofessionalsearch." CIKM,2013.

• ExpectedUtility(EU)

• Yang,Yiming,andAbhimanyuLad."Modelingexpectedutilityofmulti-sessioninformationdistillation." ConferenceontheTheoryofInformationRetrieval.Springer,Berlin,Heidelberg,2009.

7

Sessionsearchmetrics

𝐸𝑈 =D𝑃 𝜔 D D 𝜃& ∗ 𝛾1 &,3,456�

&∈V<,W

− 𝑎 ∗ 𝑐𝑜𝑠𝑡(𝑖, 𝑗)�

3,4 ∈X

)�

X

𝐶𝑇 =∑ ∑ ∑ 𝜃&�

& 𝑟𝑒𝑙 𝑖, 𝑗 ∗ 𝛾1(&,3,456)|93:;<|4=6 >

3=6

∑ ∑ 𝑐𝑜𝑠𝑡(𝑖, 𝑗)|93:;<|4=6 >

3=6

𝑠𝐷𝐶𝐺 =D D𝑟𝑒𝑙(𝑖, 𝑗)

1 + log` 𝑗 ∗ 1 + logà 𝑖

|93:;<|

4=6

>

3=6

• Gain• Theamountofusefulinformationausercanlearnfromadocument

• Cost• Theefforttheuserspendsonthatdocument

• Rankingdiscounts:• Basedontheoriginalrankingpositionofadocument• Assumption:theloweradocumentranks,thelesslikelytheuserwillreadit

• Noveltydiscounts:• Measuresuser’sknowledgecoverage,ageneralformofrankingdiscount• Assumption:Ifadocumentisrelatedtoasubtopic/nuggetthattheuserreadbefore,thenitcontributeslessnovelinformationaboutthissubtopic/nugget

8

Deconstructthemetrics

• sDCG

• CubeTest

• ExpectedUtility

9


CostGain Rank_discount Novelty_discount


1 + log` 𝑗 ∗ 1 + logà 𝑖

|93:;<|

4=6

>

3=6

𝐶𝑇 =∑ ∑ ∑ 𝜃&�

& 𝑟𝑒𝑙 𝑖, 𝑗 ∗ 𝛾1(&,3,456)|93:;<|4=6 >

3=6

∑ ∑ 𝑐𝑜𝑠𝑡(𝑖, 𝑗)|93:;<|4=6 >

3=6

𝐸𝑈 =D𝑃 𝜔 D D 𝜃& ∗ 𝛾1 &,3,456�

&∈V<,W

− 𝑎 ∗ 𝑐𝑜𝑠𝑡(𝑖, 𝑗)�

3,4 ∈X

)�

X

• sDCG

• CubeTest

• ExpectedUtility

10


𝑠𝐷𝐶𝐺 = 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐺𝑎𝑖𝑛 =D𝑟𝑎𝑛𝑘_𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡V

�

V

∗ 𝑔𝑎𝑖𝑛V

𝐶𝑇 =𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐺𝑎𝑖𝑛

𝐶𝑜𝑠𝑡=∑ ∑ 𝑛𝑜𝑣𝑒𝑙𝑡𝑦_𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡V,& ∗ 𝑔𝑎𝑖𝑛V,&�

&�V

∑ 𝑐𝑜𝑠𝑡V�V

𝐸𝑈 = 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐺𝑎𝑖𝑛 − 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐶𝑜𝑠𝑡

= DD𝑛𝑜𝑣𝑒𝑙𝑡𝑦_𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡V,& ∗ 𝑟𝑎𝑛𝑘_𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡V ∗ 𝑔𝑎𝑖𝑛V,&

�

&

−D𝑟𝑎𝑛𝑘_𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡V ∗ 𝑐𝑜𝑠𝑡V�

V

�

V

• Factorsconsideredinthemetrics:• Gain,Cost,Rankingdiscount,Noveltydiscount

• Wearedealingwithrankings• Howtomaximize/minimizethediscountedsum?

11

OptimizationMethod

• RearrangementInequality

• InIR,ProbabilityRankingPrinciple[4]• theoveralleffectivenessofanIRsystemcanbeachievedthebestbyrankingthedocumentsbytheirusefulnessindescendingorder

12

Oursolution

𝑥6𝑦1 + 𝑥g𝑦156 +…+ 𝑥1𝑦6 ≤ 𝑥j 6 𝑦6 + 𝑥j g 𝑦g +…+ 𝑥j 1 𝑦1 ≤ 𝑥6𝑦6 + 𝑥g𝑦g + ⋯+ 𝑥1𝑦1𝑓𝑜𝑟𝑥6 ≤ 𝑥g … ≤ 𝑥1𝑎𝑛𝑑𝑦6 ≤ 𝑦g … ≤ 𝑦1

13

Oursolution

• Butinourproblem:• Multiplerankinglistsarerequiredtobeoptimizedsimultaneously• E.g.Maximizethegainonallthesubtopicssimultaneously

• How?• Optimizeeachrequiredrankinglistindependentlytoapproximatetheoverallbound

• Onlyonerankinglistneedstobeoptimized

14

sDCG

𝑠𝐷𝐶𝐺 = 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐺𝑎𝑖𝑛 =D𝑟𝑎𝑛𝑘_𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡V

�

V

∗ 𝑔𝑎𝑖𝑛V

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒D D𝑟𝑒𝑙(𝑖, 𝑗)

1 + log` 𝑗 ∗ 1 + logà 𝑖

|93:;<|

4=6

>

3=6


1 + log` 𝑗 ∗ 1 + logà 𝑖

|93:;<|

4=6

>

3=6

• #(C)+1rankinglistsneedtobeoptimized

15

CubeTest(CT)

𝐶𝑇 =∑ 𝜃& ∑ ∑ 𝑟𝑒𝑙 𝑖, 𝑗 ∗ 𝛾1(&,3,456)|93:;<|

4=6 >3=6

�&

∑ ∑ 𝑐𝑜𝑠𝑡(𝑖, 𝑗)|93:;<|4=6 >

3=6


𝐶𝑜𝑠𝑡=∑ ∑ 𝑛𝑜𝑣𝑒𝑙𝑡𝑦_𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡V,& ∗ 𝑔𝑎𝑖𝑛V,&�

&�V

∑ 𝑐𝑜𝑠𝑡V�V

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒D D 𝑟𝑒𝑙& 𝑖, 𝑗 ∗ 𝛾∑ 93:;o p 456<qrosr ∀𝑐

93:;<

4=6

>

3=6

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒D D 𝑐𝑜𝑠𝑡(𝑖, 𝑗)93:;<

4=6

>

3=6

• AnapproximationofEU[2]

• 𝟂:thesubsetofdocumentstheuserchecked• #(C)+1rankinglistsneedtobeoptimized

16

ExpectedUtility(EU)

𝐸𝑈 = 1

1 − 𝛾 D𝜃& 1 − 𝛾∑ v X 1 &,X�

w

�

&

− 𝑎D𝑃 𝜔 𝑙𝑒𝑛(𝜔)�

X

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒D D 𝑟𝑒𝑙& 𝑖, 𝑗 ∗ 1 − 𝑝 456∀𝑐93:;<

4=6

>

3=6

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒D D 𝑐𝑜𝑠𝑡 𝑖, 𝑗 1 − 𝑝 45693:;<

4=6

>

3=6

• Dataset:• SubmittedrunsofTREC2016DynamicDomaintrack• SomestatisticsofTREC2016DDcorpus:

• #Topics=53• #Subtopics=242• #relevantdocs=14597

17

Experiments

18

Boundsondifferenttopics

𝑠𝐷𝐶𝐺 = 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐺𝑎𝑖𝑛

19



𝐶𝑜𝑠𝑡

20


𝐸𝑈 = 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐺𝑎𝑖𝑛−𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐶𝑜𝑠𝑡

• Thedifferenceoftheoptimalvalueametricwouldproducefordifferenttopicsislargeandshouldnotbeignored.

21

Conclusion1

22

NormalizationEffect𝑠𝐷𝐶𝐺 = 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐺𝑎𝑖𝑛

23

NormalizationEffect𝐶𝑇 =

𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐺𝑎𝑖𝑛𝐶𝑜𝑠𝑡

24

NormalizationEffect𝐸𝑈 = 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐺𝑎𝑖𝑛 − 𝑎 ∗ 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐶𝑜𝑠𝑡 𝑎 = 0.01

25

NormalizationEffect𝐸𝑈 = 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐺𝑎𝑖𝑛 − 𝑎 ∗ 𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑒𝑑𝐶𝑜𝑠𝑡 𝑎 = 0.001

• Usingtheboundsfornormalizationbringsinmorefairnessintoevaluation

26

Conclusion2

• Deconstructionofsessionsearchmetrics

• Computingtheupperboundoneachsearchtopic

• Hugevarianceontheupperboundsamongtopics

• Normalizationprovidesanotherviewpoint

27

Summary

• Canthisboundhelpusdesignabettersessionsearchsystem?

• Lazyuser,smartsystem

• Ifthesystemhascompletedthefirst𝑘 iterationsandknowsitsactualscore

• Ifitalsoknowstheupperboundscorefor𝑘+1iterations

• Stoporcontinue?

28

Discussion

• Usedinthisyear’sTREC-DDevaluation• https://github.com/trec-dd/trec-dd-jig• http://trec-dd.org/

29

Resource

30

Thankyou!

31

Reference

• [1]Kalervo Järvelin,SusanLPrice,LoisMLDelcambre,andMarianneLykkeNielsen.2008. Discountedcumulatedgainbasedevaluationofmultiple-queryIRsessions. InEuropeanConferenceonInformationRetrieval.Springer,4-15.

• [2]Jiyun Luo,ChristopherWing,HuiYang,andMartiHearst.2013. Thewaterllingmodelandthecubetest:multi-dimensionalevaluationforprofessionalsearch.In Proceedingsofthe22ndACMinternationalconferenceonInformation&KnowledgeManagement.ACM,709-714.• [3]Yiming YangandAbhimanyuLad.2009. Modelingexpectedutilityofmulti-sessioninformationdistillation. InConferenceontheTheoryofInformationRetrieval.Springer,164-175.• [4]Robertson,StephenE."TheprobabilityrankingprincipleinIR." Journalofdocumentation 33.4(1977):294-304.

Documents

Investigating per Topic Upper Bound for Session Search ...zhiwen.georgetown.domains/slides/ICTIR17_evaluation_V6.pdf · "The water filling model and the cube test: multi-dimensional