View
45
Download
0
Category
Tags:
Preview:
DESCRIPTION
Privacy Policy Evaluation. A tool for automatic natural language privacy policy analysis. The Aim and the Look and Feel. Purposes. A tool to help the user in performing privacy level assessment towards a website The tool should be able to: - PowerPoint PPT Presentation
Citation preview
Privacy Policy Evaluation
A tool for automatic natural language privacy policy analysis
THE AIM AND THE LOOK AND FEEL
Purposes
• A tool to help the user in performing privacy level assessment towards a website
• The tool should be able to:– analyze the privacy policies of a web site (written
in Natural Language– verify its compliance to the Privacy Principles (PP) • e.g. the ones specified by the EU
– verify its compliance to the user preferences
Server SideUser Side
DB
WebSite
Natural Language Privacy Policy
publish
save data
DB
DB
end-user
TEXT ANALYSIS
User’sPrivacy
Preferencesdefine
EUPrivacy
Principle
GOALS EXTRACTION
PRIVACY EVALUATION
PRIVACY ENFORCEMENT
Privacy Best
Practise
POLICY GENARATOR
<XACML>
SPARCLE
Set Of Goals
Server SideUser Side
DB
WEB SERVER
Natural Language Privacy Policy
publish
save data
DB
DB
END-USER
TEXT MINING
User’sPreferences
define
EUPrivacy
Principle
PRINCIPLES COMPLIANCE
CHECKING
COMPLIANCE RESULTS
PRIVACY ENFORCEMENT
read
NOTICE Verify the presence of statements regarding the way your information is used and collected
CHOICE
USE
CORRECTION
SECURITY
ENFORCEMENT
THIRD PARTIES
User’s Properties Settings
PET SettingsSelect the controls you want to carry out on the privacy policy
User's Preferences View
NOTIFICATION
PURPOSE SPECIFICATION Verify the presence of statements specifying the purposes for which the data will be collected
CONSENT
ACCESS AND CORRECTION
SECURITY
LIMITED DATA COLLECTION
THIRD PARTIES
LIMITED TIME RETENTION
PET SettingsSelect the controls you want to carry out on the privacy policy
Privacy Policy Results View
NOTIFICATION
PURPOSE SPECIFICATION Ours purpose is collecting non-personally identifying information to better understand how ours visitors use our website.
CONSENT
ACCESS AND CORRECTION
SECURITY
LIMITED DATA COLLECTION
THIRD PARTIES
LIMITED TIME RETENTION
WARNING We may use and disclose medical information about you without your consent
PET ResultsSelect the checks you want to carry out on the privacy policy
Example of real Privacy Policy for an hospital
• Example 1:– O'Connor (hospital) is the sole owner and user of all
information collected on the Web site.– O'Connor does not sell, give, or disclose the information
(statistical or personal) gathered on the Web site to any third party
• Example 2: – We may use and disclose medical information about you
without your consent if necessary for reviews preparatory to research, but none of your medical information would be removed from the Hospital in the course of such reviews.
(Key) Action listPrinciple Key Words
Notice Collect, use, require, include, ask, describe, notify
Choice Opt-out, delete, remove, unsubscribe
Use Purposes, aiming, using for,
Correction Modify, update, correct, access, change, review
Security Security, Encryption, SSL, https, password
Enforcement Store, retention, control, protect
Transfer to Third Parties Share, sell, provide, disclose, publish, aggregate, access, send, tell, transfer
Threats - Key words
Sell, disclose, publish, send, aggregate, identify, reveal, violate, tell, transfer
Example of goalPrinciple Sentence
Notice We collect your credit card information for the time necessary to deliver the product
Approaches Looking for key words (i.e. look for the
verbs we are interested in) Full text analysis ( i.e. analyze each
sentence and verify to which goals (if any) it corresponds )
THE ARCHITECTURE
The Architecture
Privacy Evaluation GUI
Sentence Splitter
Text MiningMachine Learning Module
Sentence Layout Organization
Privacy Policy document User’s Settings
Results Shown to the user
Sentence Classification
TEXT MINING
Text Mining versus Data Mining• Text mining is different from Data Mining
– Why?• According to Witten ‘text mining’ denotes any system that analyze large quantities
of NL text and detects lexical or linguistic usage patterns• Data Mining is looking for patterns in data while text mining is looking for patterns
in text• In data mining the information is hidden and automatic tools are needed to
discover it whole in text mining the information it’s present and clear, not hidden at all
• In text mining it’s the human resource the problem: for humans it’s impossible to read the whole text by themselves – A good output from text mining is a comprehensible summaries of salient features in a
large body of text
[1] I. H. Witten, “Text mining,” in Practical handbook of internet computing, Boca Raton, Florida: Chapman & Hall/CRC Press, 2005, pp. 1-22.
Bag of Words• A document can be represented as a bag of words:
– i.e. a list of all the words of the documents with a counter of how often each word appears in the document
– Some words (the stop words) are discarded– A query is seen as a set of words. – The documents are consulted to verify which one satisfies the query– A technique of relevance ranking allows to assess the importance of
each term according to a collection of documents or to the single document. • this allows to rank each document according to the query
• In the web search engines the query is usually composed by few words while in expert information retrieval systems query may be way more complex
Text categorization (1)
• Is the assignment of certain documents (or part of the document) to predefined categories according to their contents
• Automatic text categorization helps to detect the topics that a document covers– the dominant approach is to use machine learning to infer categories
automatically from a training set of pre-classified documents. • the categories are simple label with no associated semantic • in our case the categories will be the single privacy principle we are
interested in
• Machine learning techniques for text categorization are based on rules and decision trees
Text categorization (2)• Example for the principle “Transfer to Third Parties ”
– if (share&data)• or (sell&data) • or (third party &data)
– then “THIRD PARTY”• Rules like this can be produces using standard machine learning techniques • The training data contains a substantial sample of documents (paragraph) for
each category• Each document (or paragraph) is a positive instance for the category it
represents and a negative one for all the other categories• Typical approaches extract features from each document and use the
feature vectors as input to a schema that learns how to classify documents– The key words can be the features and the word occurrence can be the feature
value. Using the vector of features it’s possible to build a model for each category – the model predicts whether or not a category can be associated to a given
document based on the words in it and on their occurrence– Using the bag of words the semantic is lost (only words are accounted) but
experiments shows that adding semantic does not significantly improves the results
Approach to solve our problems1. Category Definition
– Define the categories we are interested in • e.g. the privacy principles
2. Annotation– Select a certain number of privacy policy (verify in literature the correct
number) and annotate the sentences (in the policy) related to each category
3. Bag of words– for each category define the bag of words (this can be an automated
process)– e.g. security has : {https, login, password, security, protocol….}
4. Training set– The annotated documents are our training set
5. Machine Learning – Apply machine learning algorithms to learn the classification rules
6. Usage– use the classification rules to classify the sentences of new privacy policies
in the right category
1. Category Definition (a)
• The categories are chosen as “privacy principle” adapted from – Directive 95/46/EC – Directive 2002/58/EC 2002– and the OECD Guidelines
Principle Description
1.Notification• Notify to the data object the identity of the data controller• Notify to the data object to whom his data are disclosed• Give to the data subject information about the data categories collected
2. Purpose Specification
• Data controller should specify for which purposes the data will be collected• Usage different from the original purposes can be done only after the data controller (or the third party) obtained a new
allowance from the user3. Consent • The data subject should give his consent to the data collection
4. Access and Correction
• The data subject has the right to access to the data collected;• The data subject has the right to modify the data collected; • The data collected by the data collector should be kept up-to date, and means to do that should be provided;• Notification to third party to whom data have been disclosed, of any rectification or erasure of data;
5. Security
• The data controller must implement appropriate technical measures against unlawful use or data loss;• The level of security should be appropriate to the risk and to associated to the process, and to the nature of the data to be
protected;• Data controller should take appropriate security measures (against unauthorized access or data loss) and inform the users about
them (out of any charge);• Data controller should inform the users of any special risks of a breach of the security;
6. Third Parties• The user should be fully informed when his data is transferred from one service provider to another ;• Any transition of data can be done if the user is well aware of it, and for the use and purposes for which the data were collected
7. Cookies
• Cookies are legitimate tools if the users is well informed of the information the cookie keeps and for which purposes such information are used
• Users should have the opportunity to refuse cookies and the service provider has the right to block some services in that case;• The method to give information, and offer refuse to the use of cookies, should be as user-friendly as possible;
8. Limited Collection
• The amount of personal collection should be reduced at the minimum possible;
9. Direct Marketing• SMS, emails or unsolicited communication for direct marketing can only be done after the user gave his explicit consent• Users should be given the possibility to withdraw their consent to direct marketing at any time
10. Limited Time Retention
• Data must be erased, or made anonymous, when it is no longer needed for the purposes it has been collected;• Data collectors must inform the user for how long data is needed for the processing;• The storage of data can be done only for the time necessary to the management of the purpose;
1. Category Definition (b)
2. Annotation (a)
• One problem for the annotation part is verify the quality of the annotator.
• Usually, the authors of the experiment are also the ones annotating the documents for the training set.
• Their quality has to be assessed comparing their annotations with the ones of other users
• A way to do this is presented in the following slides
2. Annotation (b)
• Ask 4, 5 persons to annotate the same privacy policy– For each sentences, annotate whether they are relevant
for one of the ten principle (categories)• Make a matrix of similarity to verify how close to
each other are the annotators– The similarity can be measured with an F_value ( or the
cosine-similarity measures)• Bad if <0.5• more o less ok if between 0.5 and 0.75• Very good if >0.75
2. Annotation (c)A0 A1 A2 A3 A4
A0 1 0.75 0.75 0.6 0.8
A1 … 1 0.8 … …
A2 … … 1 … …
A3 … … … 1 …
A4 … … … … 1
• if the annotations of A0 and A1 (i.e. the annotation of the authors of the experiment) are in average good then the authors can be good annotators
• Let’s call Fxy the value in the cell AxAy (agreement coefficient)
2. Annotation (d)
• If– a= number of paragraphs both Ax and Ay says belong to the category C– b= number of paragraph were Ax says belong to C and Ay says No– c= number of paragraphs were Ax says do not belong to C and Ay says yes– d= number of paragraphs were both Ax and Ay say do not belong to C
• then the Fxy can be computed as: (these formulas need to be verified!!! or better computation need to be found) – Recall = a / (a+b) ?– Precision = a/ (a+c) ?– F-Value = ?– Kappa-Value = ?
• Maybe the recall is the formula more useful in our case
AyYes No
AxYes a b
No c d
• How can we compute Fxy, the agreement value between the annotators Ax and Ay (regarding a given category C)?
3. Bag of Words
• A set of words relevant for the document– Bag of words selection (a.k.a. feature selection)
can be made selecting all the words of a document (in our case a document is a sentence)
– To each word of the bag of word a weight should be given
– Such process can be made automatically
4. Training Set
• The training set is the set of privacy policies we annotate and to which we apply Machine Learning algorithms
• The policies have to be chosen according to two criteria:
• belonging to the same domain (e.g. e-health)• belonging to as different domain as possible
• The size of the training set needs to be defined
5. Machine Learning
• You can use WEKA for machine Learning and use WORDNET for the relationship between words
• SYC is a tool for the ontology for the all the knowledge of the world– not recommended – Use first of all the cosine similarity to make a sort of
pre-test• Look at the Normal Decision Tree that should
give rules as result (??)
Tools
• Weka• TagHelper Tools– download the program + the documentation– Repeat the example showed in TagHelperManual.ppt to
understand how it’s possible to create a model starting from annotated documents
– The file AccessMyrecordsPrivacyPolicy.xls is the annotation result of the privacy policy available at http://www.accessmyrecords.com/privacy_policy.htm
– The model has been build on this annotations– You can test the model using the file EMC2.xls
MASTER THESIS REQUIREMENTS
Task 1 - Objectives1. Literature Study on Text Mining
– I. H. Witten, “Text mining,” in Practical handbook of internet computing, Boca Raton, Florida: Chapman & Hall/CRC Press, 2005, pp. 1-22.
2. Elisa will provide Yuanhao with a defined the set of categories (privacy principle) – and guidelines to follow to associate each sentence of a privacy policy to a
category
3. Yuanhao will select 2 privacy policies (e-health domain) and will proceed to the annotation of such policies according to the categories and the guidelines indicated above
4. Elisa will also annotate such privacy policies5. Other 4 persons need to be chosen to annotate the 2 above
mentioned privacy policies (these persons will be provided with the same annotation guidelines mentioned above)
Task 1 - OUPUT
• At the end of the TASK 1 the 6 annotators (Elisa, Yuanhao + 4), will provide annotated privacy policies in the following format– a xsl file with two column. The first column
indicate the category (e.g. choice, use, security) and the second column indicates the sentence of the privacy policies associated to that category• Note that every sentence of the privacy policy needs to
be associated to a category
TASK 2 – Objectives
• Validation of Elisa and Yuanhao as annotators.– The method described in the slide Annotation (c)
and (d) needs to be applied. – The annotators needs to be compared – The method to do that needs to be formalized• i.e. a measure to verify the agreement level between
two annotators needs to be found (e.g. precision, recall, f-measure)
Task 2 - OUTPUT
• Elisa and Yuanhao are demonstrated to be valid annotators
• They can start to annotate the training set of privacy policy
Task 3 – Objectives
• Annotate the privacy policy for the training set• 10 privacy policies (in the domain of the e-
health) may be sufficient (find support in the Literature)
• Build up the model using machine learning techniques
OUTOUTThe model for the privacy policy categorization
HIGH PRIORITY
Task 3: Suggestion• Read the paper (very useful)
— F. Sebastiani, “Machine learning in automated text categorization”.
• The tagHelperTool can be useful (it’s an open source tool)– Read the paper for technical details about it (not everything is interesting for us):
• C. Rosé et al., “Analyzing collaborative learning processes automatically: Exploiting the advances of computational linguistics in computer-supported collaborative learning”
– Look at the classes FeatureSelector and EnglishTokenizer to see how the tool treats stop words and how it creates its own bag of words
– etc/English.stp contains the file with the English stop words– This system uses WEKA classifiers
• Rainbow– a program that performs statistical text classification. It is based on the Bow library. For
more information about obtaining the source and citing its use, see the Bow home page.• Verify whether other tools are good to execute the following steps:
– Stop words elimination– Stemming– Bag-of-words (features) extraction– Words reduction (if necessary)– Machine learning model (classifier construction: verify which algorithm is better to use)
• Also, start writing the report about that so I can read it once back
HIGH PRIORITY
TASK 4 – Objectives
• Apply the model (output of task 3) to new privacy policies
• Measure and document the quality of the model– Precision– Recall– F-Value
HIGH PRIORITY
TASK 5 - Objectives
• Build Up the GUI able to analyze privacy policies and show the different categories (principle) accounted by each policy
• The GUI may be web-based• The user should have the possibility to set the
categories (privacy principle) he is interested in verify
• The GUI should be based on the look&Feel shown in the very first slides of this presentation
Recommended