14
16/03/22 Jean-Eudes Ranvier Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3

Trustworthiness assessment (on web pages)

Embed Size (px)

DESCRIPTION

Trustworthiness assessment (on web pages). Task 3.3. Introduction. The number of available data sources keeps increasing at fast pace Sensors embedded in mobile phones, websites, blogs, … Data becomes more valuable when combined from different sources - PowerPoint PPT Presentation

Citation preview

19/04/23 Jean-Eudes RanvierPlanet Data - Madrid

Trustworthiness assessment

(on web pages)

Task 3.3

19/04/23 Planet Data - Madrid 2

Credibility assessment on web

pages

Introduction

• The number of available data sources keeps increasing at fast pace• Sensors embedded in mobile phones, websites, blogs, …

• Data becomes more valuable when combined from different sources

• What about the trustworthiness of this aggregated data?• Unknown data sources

• No standard way to evaluate trustworthiness

• Subjectivity of the consumer of the data

• Important economic incentive to lie

• Interesting case of the WWW

• Web credibility assessment

19/04/23 Planet Data - Madrid 3

Credibility assessment on web

pages

What is the problem of web credibility ?

• Non credible websites represent an important percentage of the web• Credibility seen as an aggregation of objective and subjective components

(Fogg)• Credibility= trustworthiness AND expertise• Web users can be naïve or lazy and won’t try to verify information• Focus on domains where expertise is hard to evaluate for lambda users

• Medical treatments• Trading operations• Ideological assertions

• Economic / politic interests are at stacks

19/04/23 Planet Data - Madrid 4

Credibility assessment on web

pages

Background

• Trustworthiness components in the context of web credibility:• Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web

search results.• Accuracy: referential importance• Authority: social reputation • Objectivity: content typicality• Currency: update frequency• Coverage: coverage of topic

• M. J. Metzger. Making sense of credibility on the web: Models for evaluating online information and recommendations for future research.

• Credentials• Advertisements• Design

Jean-Eudes Ranvier
Is it good to put the references like that in the slide?

19/04/23 Planet Data - Madrid 5

Credibility assessment on web

pages

Credibility assessment as a classification problem

• Use historical information on evaluations for future credibility assessment

• A machine learning approach• Binary classification

• Users evaluate pages as credible or non-credible• Content-based Features

• Extracted programmatically from web pages

• Training set and test set• Leave-one-out cross validation• Tested by category

19/04/23 Planet Data - Madrid 6

Credibility assessment on web

pages

Feature selection• Categories

• Act as a filter, only pages from the same category are tested for similarity• Keywords and Entities in the document

• Reflect the topic of the web page at a finer grain• Sentiment analysis

• Computed at the words level• Used in conjunction with keywords & entities

• Part of speech• Extra feature reflecting the overall structure of the webpage

• Number of Ads displayed (in process)• They distract users from their activity and the page loose credibility

• Complexity of the css files (not included yet)• Pages with no structure tend to loose credibility

• PageRank• Google’s metric which include a credibility measure

19/04/23 Planet Data - Madrid 7

Credibility assessment on web

pages

Experimental setup

• Two machine learning algorithms• kNN Item-Item algorithm

• Compute a similarity between pages• take only into account the most similar pages

• C4.5 decision tree• Has good performance in general• However not suitable for multivalued features (keywords, entities)• Defined as a baseline

• Microsoft corpus• 1000 pages evaluated for credibility by experts and regular users• Divided into 5 topics

• Top 40 pages retrieved by search engines for 5 queries• Rescaled from Likert scale [0;5] to binary scale {-1;1}

19/04/23 Planet Data - Madrid 8

Credibility assessment on web

pages

Content-based rating

• kNN item-item algorithm

• Based on similarity between pages rated by the user

• Aggregated similarities

• Based on pages features’ similarity

• Cosine similarity for monovalued features (POS, pageRank, …)

• Jaccard similarity for multivalued features (keywords, entities)

• Only positive similarity are taken into account

mssimilarItejji

mssimilarItejjuji

ius

rs

,

,,

,

19/04/23 Jean-Eudes RanvierPlanet Data - Madrid

Evaluation

Preliminary results

19/04/23 Planet Data - Madrid 10

Credibility assessment on web

pages

Results

• Mixed results• Precision ~ 0.7, recall ~ 0.8• Impossible to predict accurately the credibility• Biased by ratings distribution over classes

19/04/23 Planet Data - Madrid 11

Credibility assessment on web

pages

Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

celebrities environment health personalfinance

politics

kNN precision

kNN recall

ML precision

ML recall

• Tests on keywords + entities + sentiment• Similar results (Precision ~ 0.7, Recall ~ 0.8)

19/04/23 Planet Data - Madrid 12

Credibility assessment on web

pages

Results

00.10.20.30.40.50.60.70.80.9

1

celebrities environment health personalf inance

politics

kNN precision

kNN recall

ML precision

ML recall

Mixed results among classes

• Tests on all features (POS + keywords + entities + sentiments)• Similar results (Precision ~ 0.7 and Recall ~ 0.8)

19/04/23 Planet Data - Madrid 13

Credibility assessment on web

pages

Future work

• Semantic distances• Pages seen as set of concepts• Definition of a distance between two sets in the concepts space

• Similarity using a path distance in a concept hierarchy• Social referrals

• Use evaluation of other peoples • Weights based on their trustworthiness• Estimate page credibility based on beta reputation

• Combine reputation with classification approaches to have an aggregated metric• To get better estimation of the credibility than the two components

separated

19/04/23 Planet Data - Madrid 14

Credibility assessment on web

pages

Conclusion

• Project based on content-based aspects

• Results promising although room for improvement• Accuracy of the prediction

• Time complexity of the implementation

• Several features remain unimplemented• Local extraction of features

• Integration of new page features

• Semantic aspect of web pages

Jean-Eudes Ranvier
Isn't to concret?