Fast Mining of Interesting Phrases from Subsets of Text Corpora

IBM Research - India, Bengaluru, India

April 19, 2023 1

Fast Mining of Interesting Phrasesfrom Subsets of Text Corpora

Deepak P, Atreyee Dey, Debapriyo Majumdar*1IBM Research - India, Bengaluru, INDIA

EDBT 2014 Conference, Athens, Greece

*presently with Indian Statistical Institute, Kolkata, India

April 19, 2023 IBM Research – India, Bengaluru 2

Problem Description

Text Corpus

ukraine, crimea …

Chosen Subset

Crimea independence, 0.90USA Russia Relations, 0.85

G8 Membership, 0.81……),(

)',()',(

Dpfreq

DpfreqDpID

D D’

Given a text corpus D, and a subset D’, specified by a keyword query, find the top-k Interesting Phrases for D’ wrt D


Earlier Approaches

p1 d12 d13 d30 d9901

p9876 d1 d11 d305 d8100

Phrase Indexing, Simistis et al., VLDB 2008

O(|P|)

Document Indexing, Bedathur et al., VLDB 2010 and Gao & Michel, EDBT 2012

d1 p5 p43 p167 p8970

d9998 p23 p49 p305 p9987O(|D’|)


Estimating Interestingness: AND Query Consider an AND query composed of k key-words

Q = {Q1, Q2, …, Qk}

Docspp

QkQDocsQkQppDpID #)(

}),...,1({#),...,1|()',(

),(

)',()',(

Dpfreq

DpfreqDpID

)|,...,1()(

),...,1|(pQkQp

pp

QkQpp


Query Word Independence Assumption

)1|2()1,2|1()1|2,1( PQpPQQpPQQp

Consider an AND Query of two words Q1 and Q2

We would like to estimate p(P1|Q1, Q2) as an estimate of the interestingness of P1

Instead, we could estimate p(Q1, Q2|P1) (as shown in previous slide)

)1|2()1|1( PQpPQp

k

i

k

i

pQippQippQkQp11

)|(log)|()|,...,1(

For OR Query Handling details, refer to the paper


Our Disk-Resident Indexes

w1 p30 p12 p990 p13

w9876 p810 p11 p305 p8

0.23 0.21 0.18 0.002

0.1 0.08 0.007 0.0001

The score that is stored along with each phrase is p(w|p)All values are stored in sorted order


Aggregation Approach: NRA We use the well-known NRA algorithm to do

aggregation of the lists corresponding to the query words, to arrive at the top phrases

At any point, we have upper and lower bounds. An example sum-aggregation below

w1 P1, 0.04167 P5, 0.0333

w2 P103, 0.26 P1, 0.113

P1 – [0.1547, 0.1547]P5 – [0.0333,0.1433]P103 – [0.26, 0.2933]

……


Our In-Memory Indexes

w1 p12 p13 p30 p990

w9876 p8 p11 p305 p810

0.21 0.002 0.23 0.18

0.0001 0.08 0.007 0.1

The score that is stored along with each phrase is p(w|p)All values are stored in PhraseID sorted order

Indexes may be created by preserving just the top-10% values of each list

We will use simple Sort-Merge-Join on these lists for In-Memory operation


Example Results Query: trade reserves (Reuters Dataset)

– economic minister

– reserves

– taiwan’s foreign exchange reserves

– economic planning

– economic planning and development


Result Quality Evaluation

0.90.910.920.930.940.950.960.970.980.99

1

20-AND 50-AND

PrecMRRNDCGMAP

PubMed Dataset


Running Times: Disk-based Operation (NRA)

100

1000

10000

100000

1000000

10000000

0 20 40 60 80 100

AND-NRA

OR-NRA

AND-GM

OR-GM

X-Axis: Percentage of NRA Lists Traversed

PubMed Dataset


Percentages of Lists Traversed (NRA)

27 28 29 30 31 32 33 34

Reuters-AND

Reuters-OR

Pubmed-AND

Pubmed-OR


Running Times: Mem-based Operation (SMJ)

1

10

100

1000

10000

100000

1000000

10000000

0 10 20 30 40 50 60 70 80 90 100

AND-SMJ

OR-SMJ

AND-GM

OR-GM

X-Axis: Percentage of Entries Stored

PubMed Dataset


Shortcomings Index Sizes

– Earlier approaches index only phrases and documents

– Our method has word-specific indexes, with each word having a list in the index

– Number of words across documents could be much more than the number of phrases

– If we would like to support querying over all possible words, index sizes could get large

Queries on Metadata Facets– Instead of using keyword queries, document subsets could also be

chosen using metadata facets

– E.g., venue:sigmod AND year:2007, on a set of scholarly publications

– Our independence assumption has not yet been tested on metadata facets


Summary Proposed an approach for the problem of mining interesting

phrases from subsets of text corpora

Outlined the query word independence assumption that is seen to be empirically useful in accurately identifying interesting phrases

Our approach is seen to be up to 90% accurate, while being able to achieve turnaround times that are orders of magnitude better than those of the current techniques

Future Work– Other potential avenues for leveraging the independence

assumption for phrase analytics

– Methods to speed up interesting phrase mining over metadata facets

IBM Research - India, Bengaluru, India

April 19, 2023 16

Thank You

Questions, Comments, Suggestions?

Documents

Fast Mining of Interesting Phrases from Subsets of Text Corpora