35
Implementing Query Implementing Query Classification Classification HYP: End of Semester Update prepared Minh

Implementing Query Classification HYP: End of Semester Update prepared Minh

Embed Size (px)

Citation preview

Page 1: Implementing Query Classification HYP: End of Semester Update prepared Minh

Implementing Query Implementing Query ClassificationClassification

HYP: End of Semester Update prepared Minh

Page 2: Implementing Query Classification HYP: End of Semester Update prepared Minh

Previously…Previously…Web search queries:◦Understand user goal

Broder (et al 2002):◦Queries are classified into 3 categories:

Informational Navigational Transactional

Page 3: Implementing Query Classification HYP: End of Semester Update prepared Minh

Previously…Previously…Functional Faceted Web Query

Classification Ambiguity: Polysemous, General, Specific Authority Sensitivity: Yes - No Spatial Sensitivity: Yes - No Temporal Sensitivity: Yes - No

◦Query’s 4-Tuple: <Am, Au, S, T>◦3 * 2 * 2 * 2 = 24 different combinations.

Page 4: Implementing Query Classification HYP: End of Semester Update prepared Minh

Temporal SensitivityTemporal SensitivityDefinition:◦A keyword is temporal sensitive if the results

returned by querying it on web search engine tends to change with respect to time.

◦Example: Temporal sensitive: Liverpool, Beyonce, Jennifer

Hawkins, etc.. Non-temporal sensitive: video, buying car, etc..

Page 5: Implementing Query Classification HYP: End of Semester Update prepared Minh

Up-to-date Project ScopeUp-to-date Project ScopeObjective: to analyze the temporal

sensitivity facet of web search queries.Problem: find the temporal correlation

between web queries

Page 6: Implementing Query Classification HYP: End of Semester Update prepared Minh

Web Query HistogramWeb Query HistogramPeriodic queries:

Non-periodic queries:

Champions League Final

Liverpool

Page 7: Implementing Query Classification HYP: End of Semester Update prepared Minh

Queries CorrelationQueries CorrelationCorrelation

Observation: 2 keywords are temporally related to each other

Page 8: Implementing Query Classification HYP: End of Semester Update prepared Minh

Proposed System FrameworkProposed System Framework

1. Ask Google Trends for query’s histogram2. Use histogram digitizer program

(Plotparser by WeiHua) to get the numerical data

3. Query Correlation: • Calculate correlation coefficient between

queries

4. Query classification

Page 9: Implementing Query Classification HYP: End of Semester Update prepared Minh

Google TrendsGoogle Trends

Page 10: Implementing Query Classification HYP: End of Semester Update prepared Minh

Histogram DigitizerHistogram Digitizer

Page 11: Implementing Query Classification HYP: End of Semester Update prepared Minh

Queries Correlation: 1Queries Correlation: 1stst attempt attemptCalculate Correlation coefficient:◦Using data of 45 months: Jan 2004 until

September 2007◦Calculate coefficient based on the entire

histograms

Page 12: Implementing Query Classification HYP: End of Semester Update prepared Minh

Result classification: 1Result classification: 1stst attempt attemptData of 15 different popular keywords, of

which:◦ Periodic keywords:

Champions League Final, Grammy, Pro Evolution Soccer, Oscar Winner, Valentine, Chrismas(!).

◦ Related keywords: PS2, Xbox, Jack Nicholson, Beyonce , chocolate, chocolateNews, Liverpool, EA Sport, Konami

All keywords are compare to each other based on correlation coefficient of their histograms.

(15*14)/2 = 105 instances

Page 13: Implementing Query Classification HYP: End of Semester Update prepared Minh

Result classification: 1Result classification: 1stst attempt attemptClassification based on threshold

method:◦Statistical result:

Threshold value: 0.25Correlation Prediction

True Positive Rate False Positive Rate

Yes 88.89% 10.34%

No 89.66% 11.11%

Page 14: Implementing Query Classification HYP: End of Semester Update prepared Minh

11stst attempt Problems: attempt Problems:Very low threshold value◦Only one feature used.

Using entire histogram, while some keywords are only temporally related to each other at some periods of time.◦Example: Valentine – Chocolate (Correlation

appears during February)

Page 15: Implementing Query Classification HYP: End of Semester Update prepared Minh

Queries Correlation: 2Queries Correlation: 2ndnd attempt attemptInteresting period:◦Period in which two query are highly related

to each other

-> Segmentation (Clustering) problem

Page 16: Implementing Query Classification HYP: End of Semester Update prepared Minh

Clustering Using Simple K meansClustering Using Simple K meansAlgorithm to predict no. of clustersUse WEKA to cluster the histogram

Page 17: Implementing Query Classification HYP: End of Semester Update prepared Minh

Query Correlation: 2Query Correlation: 2ndnd attempt attemptPeriodic keywords detection:◦ Identify repeated pattern using correlation◦Periodic query tends to have highly

correlation coefficient on repeated part.

Page 18: Implementing Query Classification HYP: End of Semester Update prepared Minh

Interesting Periods ProjectionInteresting Periods ProjectionInteresting periods from related keyword

histogram is to be projected on periodic keyword’s histogram

Page 19: Implementing Query Classification HYP: End of Semester Update prepared Minh

Result Classification: 2Result Classification: 2ndnd Attempt AttemptUsing previous datasetRelated keywords are compared with

each of periodic keywords for correlationResult:◦Manage to increase threshold value to: 0.5

Page 20: Implementing Query Classification HYP: End of Semester Update prepared Minh

22ndnd attempt problems attempt problemsK – means clustering does not guarantee

correct interesting periods detection:◦Due to the fact that we have to provide no. of

cluster for K-means -> implemented algorithm to determine no. of

cluster failed to provide correct valueSmall training data set. Too simple method of threshold

detector.

Page 21: Implementing Query Classification HYP: End of Semester Update prepared Minh

Queries Correlation: 3Queries Correlation: 3rdrd attempt attemptNeed to find another way to identify

interesting period.Peak period:◦Period in which there is a high peak in query

volumePeak detection problem:◦Mapping and smoothing using convolution

Page 22: Implementing Query Classification HYP: End of Semester Update prepared Minh

Clustering using peak detectionClustering using peak detectionMapping:

Page 23: Implementing Query Classification HYP: End of Semester Update prepared Minh

Clustering using peak detectionClustering using peak detectionSmoothing using convolution:

Page 24: Implementing Query Classification HYP: End of Semester Update prepared Minh

Clustering using peak detectionClustering using peak detectionPeak Detection: using simple slope-

change algorithm to determine peaks and valleys ◦(with threshold value: mean)

Page 25: Implementing Query Classification HYP: End of Semester Update prepared Minh

Interesting periods ProjectionsInteresting periods ProjectionsInteresting periods from related keyword

histogram is to be projected on periodic keyword’s histogram and vice versa

Page 26: Implementing Query Classification HYP: End of Semester Update prepared Minh

Result Classification: 3Result Classification: 3rdrd attempt attemptUse large training data:◦47 popular keywords, of which:

15 periodic keywords and 32 related keywords Each related keyword is to compared with every

periodic keyword to get correlation coefficient (Coef).

◦Data size: 15 * 32 = 480 instances

Page 27: Implementing Query Classification HYP: End of Semester Update prepared Minh

Result Classification: 3Result Classification: 3rdrd attempt attemptApply Naïve Bayes Classifier (WEKA):

6 features: Average Coef from related keyword projection (AveRCoef) Average Coef from periodic keyword projection (AvePCoef) Overall Average Coef [= (AveRCoef+AvePCoef)/2]

Max Coef from related keyword projection (MaxRCoef) Max Coef from periodic keyword projection (MaxPCoef) Average Max Coef [= (MaxRCoef+MaxPCoef)/2 ]

Page 28: Implementing Query Classification HYP: End of Semester Update prepared Minh

Result Classification: 3Result Classification: 3rdrd attempt attemptStatistical Result:

Confusion Matrix

Correlation Prediction

True Positive Rate

False Positive Rate

Recall F-Measure

Yes 89.3% 5.2% 0.893 0.725

No 94.8% 10.7% 0.948 0.969

A B <- classified as

25 3 A = Yes

16 294 B = No

Page 29: Implementing Query Classification HYP: End of Semester Update prepared Minh

Future attempt: Future attempt: Query NormalizationQuery NormalizationSearch volumes tends to increase as the

Internet becomes more popularHistogram for Top 20 most popular

keywords of all time:

Page 30: Implementing Query Classification HYP: End of Semester Update prepared Minh

Future attempt: Future attempt: NormalizationNormalizationHistograms need to be normalize to

ignore this trend’s effect!Proposed action:◦Subtract time effect◦Current Problem: More distortions are added

due to scaling problem. -> histogram from Google have been scaled. We

have no information of raw data.

Page 31: Implementing Query Classification HYP: End of Semester Update prepared Minh

Future attempt: Future attempt: From Periodic to Non-periodicFrom Periodic to Non-periodicFind the correlation between two non-

periodic queries.Proposed Problem: some keywords are

highly searched after other keywords◦Example: “tsunami” is usually searched after

“earthquake” is issued.

Page 32: Implementing Query Classification HYP: End of Semester Update prepared Minh

Future attempt: Future attempt: From Periodic to Non-PeriodicFrom Periodic to Non-Periodic

Tsunami

Earthquake

Page 33: Implementing Query Classification HYP: End of Semester Update prepared Minh

Potential ApplicationsPotential ApplicationsResults re-ranking:◦Move result that is more up-to-date up on the

result list Example: when user ask for Beyonce during the

time of Grammy -> result that related to Grammy will have a higher rank

Server Buffering:◦When user query Beyonce, the web page that

related to Grammy will be buffer in local server in hope that the user will tend to search for Grammy eventually.

Page 34: Implementing Query Classification HYP: End of Semester Update prepared Minh

Question?Question?

Page 35: Implementing Query Classification HYP: End of Semester Update prepared Minh

The EndThe End