16
IBM Research - India, Bengaluru, India June 20, 2022 1 Fast Mining of Interesting Phrases from Subsets of Text Corpora Deepak P , Atreyee Dey, Debapriyo Majumdar * 1 IBM Research - India, Bengaluru, INDIA EDBT 2014 Conference, Athens, Greece ently with Indian Statistical Institute, Kolkata, India

Fast Mining of Interesting Phrases from Subsets of Text Corpora

Embed Size (px)

DESCRIPTION

Fast Mining of Interesting Phrases from Subsets of Text Corpora. Deepak P , Atreyee Dey, Debapriyo Majumdar* 1 IBM Research - India, Bengaluru, INDIA. EDBT 2014 Conference, Athens, Greece. *presently with Indian Statistical Institute, Kolkata, India. Problem Description. D’. D. - PowerPoint PPT Presentation

Citation preview

Page 1: Fast Mining of Interesting Phrases from Subsets of Text Corpora

IBM Research - India, Bengaluru, India

April 19, 2023 1

Fast Mining of Interesting Phrasesfrom Subsets of Text Corpora

Deepak P, Atreyee Dey, Debapriyo Majumdar*1IBM Research - India, Bengaluru, INDIA

EDBT 2014 Conference, Athens, Greece

*presently with Indian Statistical Institute, Kolkata, India

Page 2: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 2

Problem Description

Text Corpus

ukraine, crimea …

Chosen Subset

Crimea independence, 0.90USA Russia Relations, 0.85

G8 Membership, 0.81……),(

)',()',(

Dpfreq

DpfreqDpID

D D’

Given a text corpus D, and a subset D’, specified by a keyword query, find the top-k Interesting Phrases for D’ wrt D

Page 3: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 3

Earlier Approaches

p1 d12 d13 d30 d9901

p9876 d1 d11 d305 d8100

Phrase Indexing, Simistis et al., VLDB 2008

O(|P|)

Document Indexing, Bedathur et al., VLDB 2010 and Gao & Michel, EDBT 2012

d1 p5 p43 p167 p8970

d9998 p23 p49 p305 p9987O(|D’|)

Page 4: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 4

Estimating Interestingness: AND Query Consider an AND query composed of k key-words

Q = {Q1, Q2, …, Qk}

Docspp

QkQDocsQkQppDpID #)(

}),...,1({#),...,1|()',(

),(

)',()',(

Dpfreq

DpfreqDpID

)|,...,1()(

),...,1|(pQkQp

pp

QkQpp

Page 5: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 5

Query Word Independence Assumption

)1|2()1,2|1()1|2,1( PQpPQQpPQQp

Consider an AND Query of two words Q1 and Q2

We would like to estimate p(P1|Q1, Q2) as an estimate of the interestingness of P1

Instead, we could estimate p(Q1, Q2|P1) (as shown in previous slide)

)1|2()1|1( PQpPQp

k

i

k

i

pQippQippQkQp11

)|(log)|()|,...,1(

For OR Query Handling details, refer to the paper

Page 6: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 6

Our Disk-Resident Indexes

w1 p30 p12 p990 p13

w9876 p810 p11 p305 p8

0.23 0.21 0.18 0.002

0.1 0.08 0.007 0.0001

The score that is stored along with each phrase is p(w|p)All values are stored in sorted order

Page 7: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 7

Aggregation Approach: NRA We use the well-known NRA algorithm to do

aggregation of the lists corresponding to the query words, to arrive at the top phrases

At any point, we have upper and lower bounds. An example sum-aggregation below

w1 P1, 0.04167 P5, 0.0333

w2 P103, 0.26 P1, 0.113

P1 – [0.1547, 0.1547]P5 – [0.0333,0.1433]P103 – [0.26, 0.2933]

……

Page 8: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 8

Our In-Memory Indexes

w1 p12 p13 p30 p990

w9876 p8 p11 p305 p810

0.21 0.002 0.23 0.18

0.0001 0.08 0.007 0.1

The score that is stored along with each phrase is p(w|p)All values are stored in PhraseID sorted order

Indexes may be created by preserving just the top-10% values of each list

We will use simple Sort-Merge-Join on these lists for In-Memory operation

Page 9: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 9

Example Results Query: trade reserves (Reuters Dataset)

– economic minister

– reserves

– taiwan’s foreign exchange reserves

– economic planning

– economic planning and development

Page 10: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 10

Result Quality Evaluation

0.90.910.920.930.940.950.960.970.980.99

1

20-AND 50-AND

PrecMRRNDCGMAP

PubMed Dataset

Page 11: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 11

Running Times: Disk-based Operation (NRA)

100

1000

10000

100000

1000000

10000000

0 20 40 60 80 100

AND-NRA

OR-NRA

AND-GM

OR-GM

X-Axis: Percentage of NRA Lists Traversed

PubMed Dataset

Page 12: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 12

Percentages of Lists Traversed (NRA)

27 28 29 30 31 32 33 34

Reuters-AND

Reuters-OR

Pubmed-AND

Pubmed-OR

Page 13: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 13

Running Times: Mem-based Operation (SMJ)

1

10

100

1000

10000

100000

1000000

10000000

0 10 20 30 40 50 60 70 80 90 100

AND-SMJ

OR-SMJ

AND-GM

OR-GM

X-Axis: Percentage of Entries Stored

PubMed Dataset

Page 14: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 14

Shortcomings Index Sizes

– Earlier approaches index only phrases and documents

– Our method has word-specific indexes, with each word having a list in the index

– Number of words across documents could be much more than the number of phrases

– If we would like to support querying over all possible words, index sizes could get large

Queries on Metadata Facets– Instead of using keyword queries, document subsets could also be

chosen using metadata facets

– E.g., venue:sigmod AND year:2007, on a set of scholarly publications

– Our independence assumption has not yet been tested on metadata facets

Page 15: Fast Mining of Interesting Phrases from Subsets of Text Corpora

April 19, 2023 IBM Research – India, Bengaluru 15

Summary Proposed an approach for the problem of mining interesting

phrases from subsets of text corpora

Outlined the query word independence assumption that is seen to be empirically useful in accurately identifying interesting phrases

Our approach is seen to be up to 90% accurate, while being able to achieve turnaround times that are orders of magnitude better than those of the current techniques

Future Work– Other potential avenues for leveraging the independence

assumption for phrase analytics

– Methods to speed up interesting phrase mining over metadata facets

Page 16: Fast Mining of Interesting Phrases from Subsets of Text Corpora

IBM Research - India, Bengaluru, India

April 19, 2023 16

Thank You

Questions, Comments, Suggestions?