Upload
quennell-poirier
View
27
Download
3
Embed Size (px)
DESCRIPTION
Trustworthiness assessment (on web pages). Task 3.3. Introduction. The number of available data sources keeps increasing at fast pace Sensors embedded in mobile phones, websites, blogs, … Data becomes more valuable when combined from different sources - PowerPoint PPT Presentation
Citation preview
19/04/23 Planet Data - Madrid 2
Credibility assessment on web
pages
Introduction
• The number of available data sources keeps increasing at fast pace• Sensors embedded in mobile phones, websites, blogs, …
• Data becomes more valuable when combined from different sources
• What about the trustworthiness of this aggregated data?• Unknown data sources
• No standard way to evaluate trustworthiness
• Subjectivity of the consumer of the data
• Important economic incentive to lie
• Interesting case of the WWW
• Web credibility assessment
19/04/23 Planet Data - Madrid 3
Credibility assessment on web
pages
What is the problem of web credibility ?
• Non credible websites represent an important percentage of the web• Credibility seen as an aggregation of objective and subjective components
(Fogg)• Credibility= trustworthiness AND expertise• Web users can be naïve or lazy and won’t try to verify information• Focus on domains where expertise is hard to evaluate for lambda users
• Medical treatments• Trading operations• Ideological assertions
• Economic / politic interests are at stacks
19/04/23 Planet Data - Madrid 4
Credibility assessment on web
pages
Background
• Trustworthiness components in the context of web credibility:• Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web
search results.• Accuracy: referential importance• Authority: social reputation • Objectivity: content typicality• Currency: update frequency• Coverage: coverage of topic
• M. J. Metzger. Making sense of credibility on the web: Models for evaluating online information and recommendations for future research.
• Credentials• Advertisements• Design
19/04/23 Planet Data - Madrid 5
Credibility assessment on web
pages
Credibility assessment as a classification problem
• Use historical information on evaluations for future credibility assessment
• A machine learning approach• Binary classification
• Users evaluate pages as credible or non-credible• Content-based Features
• Extracted programmatically from web pages
• Training set and test set• Leave-one-out cross validation• Tested by category
19/04/23 Planet Data - Madrid 6
Credibility assessment on web
pages
Feature selection• Categories
• Act as a filter, only pages from the same category are tested for similarity• Keywords and Entities in the document
• Reflect the topic of the web page at a finer grain• Sentiment analysis
• Computed at the words level• Used in conjunction with keywords & entities
• Part of speech• Extra feature reflecting the overall structure of the webpage
• Number of Ads displayed (in process)• They distract users from their activity and the page loose credibility
• Complexity of the css files (not included yet)• Pages with no structure tend to loose credibility
• PageRank• Google’s metric which include a credibility measure
19/04/23 Planet Data - Madrid 7
Credibility assessment on web
pages
Experimental setup
• Two machine learning algorithms• kNN Item-Item algorithm
• Compute a similarity between pages• take only into account the most similar pages
• C4.5 decision tree• Has good performance in general• However not suitable for multivalued features (keywords, entities)• Defined as a baseline
• Microsoft corpus• 1000 pages evaluated for credibility by experts and regular users• Divided into 5 topics
• Top 40 pages retrieved by search engines for 5 queries• Rescaled from Likert scale [0;5] to binary scale {-1;1}
19/04/23 Planet Data - Madrid 8
Credibility assessment on web
pages
Content-based rating
• kNN item-item algorithm
• Based on similarity between pages rated by the user
• Aggregated similarities
• Based on pages features’ similarity
• Cosine similarity for monovalued features (POS, pageRank, …)
• Jaccard similarity for multivalued features (keywords, entities)
• Only positive similarity are taken into account
mssimilarItejji
mssimilarItejjuji
ius
rs
,
,,
,
19/04/23 Planet Data - Madrid 10
Credibility assessment on web
pages
Results
• Mixed results• Precision ~ 0.7, recall ~ 0.8• Impossible to predict accurately the credibility• Biased by ratings distribution over classes
19/04/23 Planet Data - Madrid 11
Credibility assessment on web
pages
Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
celebrities environment health personalfinance
politics
kNN precision
kNN recall
ML precision
ML recall
• Tests on keywords + entities + sentiment• Similar results (Precision ~ 0.7, Recall ~ 0.8)
19/04/23 Planet Data - Madrid 12
Credibility assessment on web
pages
Results
00.10.20.30.40.50.60.70.80.9
1
celebrities environment health personalf inance
politics
kNN precision
kNN recall
ML precision
ML recall
Mixed results among classes
• Tests on all features (POS + keywords + entities + sentiments)• Similar results (Precision ~ 0.7 and Recall ~ 0.8)
19/04/23 Planet Data - Madrid 13
Credibility assessment on web
pages
Future work
• Semantic distances• Pages seen as set of concepts• Definition of a distance between two sets in the concepts space
• Similarity using a path distance in a concept hierarchy• Social referrals
• Use evaluation of other peoples • Weights based on their trustworthiness• Estimate page credibility based on beta reputation
• Combine reputation with classification approaches to have an aggregated metric• To get better estimation of the credibility than the two components
separated
19/04/23 Planet Data - Madrid 14
Credibility assessment on web
pages
Conclusion
• Project based on content-based aspects
• Results promising although room for improvement• Accuracy of the prediction
• Time complexity of the implementation
• Several features remain unimplemented• Local extraction of features
• Integration of new page features
• Semantic aspect of web pages