Upload
myles-smith
View
216
Download
0
Embed Size (px)
Citation preview
Implementing Query Implementing Query ClassificationClassification
HYP: End of Semester Update prepared Minh
Previously…Previously…Web search queries:◦Understand user goal
Broder (et al 2002):◦Queries are classified into 3 categories:
Informational Navigational Transactional
Previously…Previously…Functional Faceted Web Query
Classification Ambiguity: Polysemous, General, Specific Authority Sensitivity: Yes - No Spatial Sensitivity: Yes - No Temporal Sensitivity: Yes - No
◦Query’s 4-Tuple: <Am, Au, S, T>◦3 * 2 * 2 * 2 = 24 different combinations.
Temporal SensitivityTemporal SensitivityDefinition:◦A keyword is temporal sensitive if the results
returned by querying it on web search engine tends to change with respect to time.
◦Example: Temporal sensitive: Liverpool, Beyonce, Jennifer
Hawkins, etc.. Non-temporal sensitive: video, buying car, etc..
Up-to-date Project ScopeUp-to-date Project ScopeObjective: to analyze the temporal
sensitivity facet of web search queries.Problem: find the temporal correlation
between web queries
Web Query HistogramWeb Query HistogramPeriodic queries:
Non-periodic queries:
Champions League Final
Liverpool
Queries CorrelationQueries CorrelationCorrelation
Observation: 2 keywords are temporally related to each other
Proposed System FrameworkProposed System Framework
1. Ask Google Trends for query’s histogram2. Use histogram digitizer program
(Plotparser by WeiHua) to get the numerical data
3. Query Correlation: • Calculate correlation coefficient between
queries
4. Query classification
Google TrendsGoogle Trends
Histogram DigitizerHistogram Digitizer
Queries Correlation: 1Queries Correlation: 1stst attempt attemptCalculate Correlation coefficient:◦Using data of 45 months: Jan 2004 until
September 2007◦Calculate coefficient based on the entire
histograms
Result classification: 1Result classification: 1stst attempt attemptData of 15 different popular keywords, of
which:◦ Periodic keywords:
Champions League Final, Grammy, Pro Evolution Soccer, Oscar Winner, Valentine, Chrismas(!).
◦ Related keywords: PS2, Xbox, Jack Nicholson, Beyonce , chocolate, chocolateNews, Liverpool, EA Sport, Konami
All keywords are compare to each other based on correlation coefficient of their histograms.
(15*14)/2 = 105 instances
Result classification: 1Result classification: 1stst attempt attemptClassification based on threshold
method:◦Statistical result:
Threshold value: 0.25Correlation Prediction
True Positive Rate False Positive Rate
Yes 88.89% 10.34%
No 89.66% 11.11%
11stst attempt Problems: attempt Problems:Very low threshold value◦Only one feature used.
Using entire histogram, while some keywords are only temporally related to each other at some periods of time.◦Example: Valentine – Chocolate (Correlation
appears during February)
Queries Correlation: 2Queries Correlation: 2ndnd attempt attemptInteresting period:◦Period in which two query are highly related
to each other
-> Segmentation (Clustering) problem
Clustering Using Simple K meansClustering Using Simple K meansAlgorithm to predict no. of clustersUse WEKA to cluster the histogram
Query Correlation: 2Query Correlation: 2ndnd attempt attemptPeriodic keywords detection:◦ Identify repeated pattern using correlation◦Periodic query tends to have highly
correlation coefficient on repeated part.
Interesting Periods ProjectionInteresting Periods ProjectionInteresting periods from related keyword
histogram is to be projected on periodic keyword’s histogram
Result Classification: 2Result Classification: 2ndnd Attempt AttemptUsing previous datasetRelated keywords are compared with
each of periodic keywords for correlationResult:◦Manage to increase threshold value to: 0.5
22ndnd attempt problems attempt problemsK – means clustering does not guarantee
correct interesting periods detection:◦Due to the fact that we have to provide no. of
cluster for K-means -> implemented algorithm to determine no. of
cluster failed to provide correct valueSmall training data set. Too simple method of threshold
detector.
Queries Correlation: 3Queries Correlation: 3rdrd attempt attemptNeed to find another way to identify
interesting period.Peak period:◦Period in which there is a high peak in query
volumePeak detection problem:◦Mapping and smoothing using convolution
Clustering using peak detectionClustering using peak detectionMapping:
Clustering using peak detectionClustering using peak detectionSmoothing using convolution:
Clustering using peak detectionClustering using peak detectionPeak Detection: using simple slope-
change algorithm to determine peaks and valleys ◦(with threshold value: mean)
Interesting periods ProjectionsInteresting periods ProjectionsInteresting periods from related keyword
histogram is to be projected on periodic keyword’s histogram and vice versa
Result Classification: 3Result Classification: 3rdrd attempt attemptUse large training data:◦47 popular keywords, of which:
15 periodic keywords and 32 related keywords Each related keyword is to compared with every
periodic keyword to get correlation coefficient (Coef).
◦Data size: 15 * 32 = 480 instances
Result Classification: 3Result Classification: 3rdrd attempt attemptApply Naïve Bayes Classifier (WEKA):
6 features: Average Coef from related keyword projection (AveRCoef) Average Coef from periodic keyword projection (AvePCoef) Overall Average Coef [= (AveRCoef+AvePCoef)/2]
Max Coef from related keyword projection (MaxRCoef) Max Coef from periodic keyword projection (MaxPCoef) Average Max Coef [= (MaxRCoef+MaxPCoef)/2 ]
Result Classification: 3Result Classification: 3rdrd attempt attemptStatistical Result:
Confusion Matrix
Correlation Prediction
True Positive Rate
False Positive Rate
Recall F-Measure
Yes 89.3% 5.2% 0.893 0.725
No 94.8% 10.7% 0.948 0.969
A B <- classified as
25 3 A = Yes
16 294 B = No
Future attempt: Future attempt: Query NormalizationQuery NormalizationSearch volumes tends to increase as the
Internet becomes more popularHistogram for Top 20 most popular
keywords of all time:
Future attempt: Future attempt: NormalizationNormalizationHistograms need to be normalize to
ignore this trend’s effect!Proposed action:◦Subtract time effect◦Current Problem: More distortions are added
due to scaling problem. -> histogram from Google have been scaled. We
have no information of raw data.
Future attempt: Future attempt: From Periodic to Non-periodicFrom Periodic to Non-periodicFind the correlation between two non-
periodic queries.Proposed Problem: some keywords are
highly searched after other keywords◦Example: “tsunami” is usually searched after
“earthquake” is issued.
Future attempt: Future attempt: From Periodic to Non-PeriodicFrom Periodic to Non-Periodic
Tsunami
Earthquake
Potential ApplicationsPotential ApplicationsResults re-ranking:◦Move result that is more up-to-date up on the
result list Example: when user ask for Beyonce during the
time of Grammy -> result that related to Grammy will have a higher rank
Server Buffering:◦When user query Beyonce, the web page that
related to Grammy will be buffer in local server in hope that the user will tend to search for Grammy eventually.
Question?Question?
The EndThe End