View
214
Download
1
Tags:
Embed Size (px)
Citation preview
04/18/23 1
A Binary-Categorization Approach for A Binary-Categorization Approach for Classifying Multiple-Record Web Classifying Multiple-Record Web Documents Using a Probabilistic Documents Using a Probabilistic
Retrieval ModelRetrieval Model
Department of Computer Science
Brigham Young University
Quan Wang
November 2001
04/18/23 2
OverviewOverview
Probabilistic Retrieval Model– Application ontology
– Document representations
– Ranking documents based on logistic regression analysis
Experimental Result
04/18/23 3
Application OntologyApplication Ontology
Car
Year Price
Make Model
Mileage Feature PhoneNr
1:*
1:*
1:*
1:*
1:* 1:* 1:*
0:0.975:1 0:0.8:1
0:0.908:10:1.15:*0:2.2:*
0:0.925:1
0:0.45:1
04/18/23 4
Document RepresentationDocument Representation
A set of <index term : term frequency> pairs A1:x1, …….. An:xn.
A density heuristic value y; A grouping heuristic value z;
Document d (x1,……,xn, y, z) (V, y, z)
04/18/23 5
Independence AssumptionIndependence Assumption
P(R|x1,……,xn, y, z)
Independenceassumption
P(R|x1) P(R|xn) P(R|y) P(R|z)* ***
04/18/23 6
Logistic RegressionLogistic Regression
P
x
P(R|x)* ** * *******
*** * ******* ** * xi
P(R|xi)
P(R| x) = 1/(1+exp(-(C0+C1 x))), ln(O(R|x) = C0+C1 x.
04/18/23 7
Probabilistic Retrieval Based on Logistic Probabilistic Retrieval Based on Logistic Regression AnalysisRegression Analysis
Data processing Data analysis Probabilistic retrieval on car-ads application
ontology Correlation relations
04/18/23 8
Data ProcessingData Processing
The corresponding normalized vector
V’ = (X1’, …….. Xn’) is computed as
V’ = |V| / |u|
V
where V is a document vector, u is an ontology vector.
,
04/18/23 12
Statistical Information : Statistical Information : PP-Value-Value
A p-value is a significance indicator.
A large p-value indicates either a bad regression model or a statistically insignificant index term.
We should keep only significant index terms.
04/18/23 13
Select Important Index TermsSelect Important Index Terms
Features PhoneN Density Grouping
P-value .001 .034 .052 .012
Year Make Model Mileage Price
P-value .679 .002 .074 .002 .001
The car-ads application ontology
Double S-curve
04/18/23 14
Probabilistic Retrieval ModelProbabilistic Retrieval Model
ln(O(R|xi)), ln(O(R|y)), ln(O(R|z))
> 0 < 0
relevant irrelevant
04/18/23 15
Correlation RelationsCorrelation Relations
Correlation: There are strong positive correlations among document properties (e.g. Death Date & Birth Date in the obituaries).
Correlations are extra information implicitly contained in a document.
Correlation relations handle “patterns”, e.g., Birth Date-Death Date pair appearing in obituaries application ontology.
04/18/23 16
Special Web DocumentsSpecial Web Documents
Multiple-record Web documents Similar content, format (e.g. item for sale) Same lexical object values (e.g. Honda makes cars and
motorcycles)
8 documents (motorcycle, boat, snowmobile, bicycle) for the car-ads, and 5 documents (death notice, bibliography for famous people, find a graveyard, politician died young, famous people died in car accident) for the obituary.
04/18/23 17
Experimental ResultsExperimental Results
Car-ads obituary
recall 100% 100%
precision 83.3%* 83.3%
accuracy 92.9% 92.0%
*Ten out of eighteen negative documents are specially selected.