View
214
Download
0
Category
Tags:
Preview:
Citation preview
Strategy for systematic anonymisation of
multi-lingual interaction corpora.
C. Reffay1, F.-M. Blondel1, E. Giguet2
1 STEF – ENS-Cachan / IFÉ – ENS-Lyon 2 GREYC, Université Caen Basse-Normandie, CNRS
IC'2012 - C Reffay, F-M Blondel, E Giguet 2
Outline
• Introduction
• Anonymisation process– Marking process– Finding new forms– Replacement process
• Testing the process on a Galanet session
• What did we learn? What works?
• Next step…
IC'2012 - C Reffay, F-M Blondel, E Giguet 3
The corpus
• Galanet Session 2011-2012: “Nômades...nomadi...nómades... des langues”
(Resp.: SandrineD)• 4 teams : Italy, Brazil, France & Spain• During 3.5 months, • 103 teenagers, 83 authors wrote…
915 Messages containing (message body)• Volume: 47 740 forms, 217 477 characters• Lexicon: 9 655 distinct forms
IC'2012 - C Reffay, F-M Blondel, E Giguet 4
The objective is to share!
But anonymisation is a hard work (by hand)– The corpus may be enormous– Subtleties: homonyms & synonyms
Personal data are not sharable
Anonymisation… the solution?
Need a software to support
IC'2012 - C Reffay, F-M Blondel, E Giguet 5
Anonymisation purpose
• Hide personal information systematicallysystematically– Names (first names, last names, usernames…)– Identifiers (Passport, National Student Number, …)– Locations (city, street, address, coordinates)– Institution/Workplace (school, sport club, firm, …)– Contact references (e-mail, mobile, MSN, skype,
twitter, telephone/fax)– Explicit references (URL of homepages, blogs)– Social media usernames (facebook, MySpace, Hi5,
Soundcloud, Badoo, Bebo, Friendster, Netlog, …)
• Maintaining text coherence and consistency
IC'2012 - C Reffay, F-M Blondel, E Giguet 6
Personal data: examples• {(f331s2970m2)2011-11-30T19:24 Gabibr Re: Quelques
informations ... answers SandrineD (f331s2970m1)} “Eu amo a língua Francesa! Quem sabe falar francês me adiconem no meu FACEBOOK;) J'aime parler français! Qui peut parler français? M'ajouter dans FACEBOOK;) Nom: GABRIELA MEDEIROS.”
• {(f333s3016m2)2011-12-27T09:25 Miche Re: Les stéréotypes culinaires answers SandrineD (f333s3016m1)} “inviate i vostri documenti alla mia mail mikinessi@yahoo.it grazie!!!;)”
• {(f330s2914m8)2011-10-22T19:52 PBS Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Yo me llamo Peimikà Bibiana. Como mi madre es tailandesa y mi padre es italiano, mi primer nombre, Peimikà, es tailandés y significa " dueña del amor ", mientras mi según nombre, Bibiana, es italiano y procede del etrusco " vibius " que significa " vida ". Me gusta mucho tener dos nombres (en Italia es más usual tener un nombre) y sobre todo estoy orgullosa de los orígenes diferentes que tienen y que hacen mi nombre aún más particular (además Peimikà no es muy difundido en tampoco en Tailandia y tampoco Bibiana en Italia”
IC'2012 - C Reffay, F-M Blondel, E Giguet 7
Just google it!
IC'2012 - C Reffay, F-M Blondel, E Giguet 8
Peimikà Bibiana… google search (2)
IC'2012 - C Reffay, F-M Blondel, E Giguet 9
Anonymisation Principles
1. All identified lexical forms must be (computationally) marked even if not modified by a replacement form.
2. Any reference (e.g.: name, institution or location) may be imprecise enough to encompass several hundreds people.
Original lexical form Replacement formReplaced
by
Mark
Once anonymised, no participant may be identifiable by an external person
IC'2012 - C Reffay, F-M Blondel, E Giguet 10
Anonymisation• Before:
{(f330s2880m3)2011-10-17T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)}
Bonjour, je m'appelle Kellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Rosa Luxemburg à Canet, non loin de Perpignan…
• After:{(f330s2880m3)2011-10-17T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)}
Bonjour, je m'appelle Kittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Margherita Duras* à Aigues-Vives*, non loin de Perpignan…
Before After
IC'2012 - C Reffay, F-M Blondel, E Giguet 11
Hypotheses
• A fully automated method does not exist for all corpora
• Some decisions have to be taken by the researcher, not by the software
• Accuracy of the method will be achieved only for a given context (ex: Galanet)
• “Named entities” do not occur randomly
Let’s find the regularities Interactively with the expert: the researcher
IC'2012 - C Reffay, F-M Blondel, E Giguet 12
Concepts manipulated
Institution,Participant,
Public person,Relative,Street,
City…
Existing objects Named entities Lexical forms
Name,Surname,
Username,First name,Last name,Addresses,
Tel. number,MSN…
Pedro,KellyM,Eli, Elô,Kelly,
Bergamo, Canet,Rosa Luxembourg,
0609785643,
Real world CorpusReference
IC'2012 - C Reffay, F-M Blondel, E Giguet 13
Anonymisation process
Corpus toanonymise
Corpus with marked
Entities
Named entitiestransformation tableInitial list of
participants,usernames,institution…
Process/RulesDiscovering new forms
MarkingProcess
AnonymisedCorpus
ReplacementProcess
IC'2012 - C Reffay, F-M Blondel, E Giguet 14
Transformation table: example
Synonyms: the same entity has different forms
=≠
Homonyms: the same form refers to different entities
IC'2012 - C Reffay, F-M Blondel, E Giguet 15
Marking one form: Example (Kelly)A- List of all occurrences (with their context) with a concordancer
IC'2012 - C Reffay, F-M Blondel, E Giguet 16
Marking one form: Example (Kelly)
+
B- Update the transformation table (ex: Public person Gene Kelly)
IC'2012 - C Reffay, F-M Blondel, E Giguet 17
Marking one form: Example (Kelly)
C- Associate each occurrence to the appropriate entity
(=> In the corpus: Surround the occurrence by XML tags)
Last name, Normal form, unchangedrefers to the public person Gene Kelly
First name, Normal form, to be changedrefers to the participant KellyM
IC'2012 - C Reffay, F-M Blondel, E Giguet 18
Detecting new forms: 2 strategies
• Lexical rules: similar forms – Eli -> Elô Ely ELY Seli– Gabriela -> GABRIELA– José -> Jose
• Context rules: Similar context– First names: “mi chiamo …”, “accord avec …”– Cities: “Soy de …”, “vivo en …”, “j’habite à …”
IC'2012 - C Reffay, F-M Blondel, E Giguet 19
1st Strategy: Lexical variation rules
adriana Alexia Antonhy baptiste Cleisa Elô Ely ELY Seli Louise MAnuel Federiac fran Fran GABRIELA guillem iñigo Jacqueline jean Jose Kellly Leo léo MariAna mary May Miche michelina moni olalla oleguer
Adriana Alèxia Anthony Baptiste Cleissa Eli… Elouise Emmanuel Federica Ferran Gabriela Guillem Iñigo Jaqueline Jean José Kelly Léo Mariana Mary Michela Monica Olalla Oleguer
103Knownforms
31New
forms
IC'2012 - C Reffay, F-M Blondel, E Giguet 20
2nd Strategy : Context rules
103 Known first names (Adrià, …, Veronica)
145 contexts: Left/RightTotal: more than 250 tested rules
15 good new formsAntonhy Belle Bet Christine Fede Federiac Kellly Leo Line Maria May Peimikà Regina fran jean léo
47 rules approved
IC'2012 - C Reffay, F-M Blondel, E Giguet 21
Replacing process• Before:
{(f330s2880m3)2011-10-17T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)}
Bonjour, je m'appelle Kellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Rosa Luxemburg à Canet, non loin de Perpignan…
• After:{(f330s2880m3)2011-10-17T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)}
Bonjour, je m'appelle Kittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Margherita Duras* à Aigues-Vives*, non loin de Perpignan…
IC'2012 - C Reffay, F-M Blondel, E Giguet 22
Conclusion
1. A new process/algorithm for anonymisation
2. Confront hypotheses to a first corpus– 47 rules approved for first names => 15 new forms– 103 first names => 31 existing derivations– Anonymisation not 100% auto => confirmed
3. Anonymisation possible? in a world with Google– Use Google to evaluate the frequency of a first name!
IC'2012 - C Reffay, F-M Blondel, E Giguet 23
Next steps…
• Finalize concrete anonymisation of this corpus– Discuss some choices with SandrineD for:– Usernames, cities, email addresses,…– Get feedback from SandrineD
• Verify on a bigger (Galanet) corpus:– The process– The rules
• Co-develop the tool :– within the research community… – in the (ANR) CORDIAL project?
Grazie !
More precisely
IC'2012 - C Reffay, F-M Blondel, E Giguet 26
New forms discovering: 2 strategies
103 Known first names (Adrià, …, Veronica)
LexicalRules
ContextRules
317 candidates145 contexts: Left/RightLeft: One form: 75 => 13780 occ.Left: 2 forms seq.: 123 => 1700 occ.Total: more than 250 tested rules
50 Auto34 frequent words
16 known
200 Easy180 common words
20 username
67 Tests5 common 31 good new forms
1 relative new: Maria
30 public names
47 rules approved
15 good new forms
IC'2012 - C Reffay, F-M Blondel, E Giguet 27
Contexts of 145 occ. of 103 first names(using TXM, case insensitive)
IC'2012 - C Reffay, F-M Blondel, E Giguet 28
The corpus lexicon
• A list of (lexical forms ► Frequence)– de ►1015– que ► 965– la ► 673– …– porque ► 48– …– Addams ► 1
9655 unique forms
IC'2012 - C Reffay, F-M Blondel, E Giguet 29
Who is concerned?
« Les applications informatiques à des fins pédagogiques et éducatives mobilisent des données permettant d’identifier directement mais aussi indirectement les personnes physiques. Une attention particulière doit être portée sur la collecte de données sensibles ainsi que sur les procédés d’anonymisation des données. »
(Mallet-Poujol 2004: p 21)
For more information, see the European Commission Directive (95/46/EC)
IC'2012 - C Reffay, F-M Blondel, E Giguet 30
Legal context (95/46/EC)
• (Art7) Member States shall provide that personal data may be processed only if: the data subject has unambiguously given his consent;…
• (Art8) Member States shall prohibit the processing of personal data revealing sensitive information (racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life)
• (Art8) […] Inform the data subject on:– The identity of the controller of the data collection,– The purposes of the processing – The recipients or categories of recipients of the data,– The existence of the right of access to and the right to rectify the
data concerning him
IC'2012 - C Reffay, F-M Blondel, E Giguet 31
Text coherence and consistency• {(f330s2914m11)2011-10-20T16:43 M_Cavalcanti Re: Por que me chamo
assim?! Answers Eloandrade (f330s2914m1)} “aaah, o meu é uma homenagem a uma de minhas tias e minha avó que se chamam Ana e ao resto de minhas tias que se chamam Maria. Daí, Mariana:)”
• {(f330s2914m10)-2011-10-20T21:06 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Gostei da criatividade da sua mãe MariAna! Rsrsrs”
• {(f330s2914m3)2011-10-28T00:54 LineCosta Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Ah meu nome é em homenagem a Jacqueline Kennedy, esposa do ex- presidente dos EUA, e também porque sempre foi um dos nomes preferidos do meu pai.: D”
• {(f330s2914m18)2011-10-19T20:36 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Bem, minha mãe queria que meu nome começasse com a letra E (como o dela!), um certo dia ela viu o nome de uma atriz brasileira chamada Louise Cardoso. Gostou do " Louise ", mas queria com a letra E, então ficou " Elouise "! Só depois, quando eu cresci é que descobri que meu nome era de origem francesa.. Hahaha”
IC'2012 - C Reffay, F-M Blondel, E Giguet 32
TXM: http://textometrie.ens-lyon.fr/
IC'2012 - C Reffay, F-M Blondel, E Giguet 33
Named entities
A named entity is a lexical form identifying a precise object (first/last name,
communication ref., city, institution, etc.)
Examples:
Names: Christophe, Blondel, Giguet, Paris,
Communication ref.: 0678600614, …
Location: Grenoble, Paris, Parigi, …
Institution: ENS Cachan, CNRS, …
IC'2012 - C Reffay, F-M Blondel, E Giguet 34
Managing named entities
• Homonyms refer to different objects– In the corpus we have 2 participants named “Guillem”:
The same first name refers to different persons.– In “Gene Kelly”, Kelly = public person last name– in “Galdric, Kelly et Antonhy”, it’s a participant first name
• Different synonyms refer to the same object– Kellly & Kelly, – Anthony & Antonhy, – Elô & Elouise
IC'2012 - C Reffay, F-M Blondel, E Giguet 35
Referring to global entities
IC'2012 - C Reffay, F-M Blondel, E Giguet 36
Overall method and tools
1. Define a process/algorithm for anonymisation2. Confront hypotheses to a first corpus
– Using existing tools (Excel, TXM/Calico, Notepad++)– Doing many work by hand
(having automation in mind)– Facing/solving/avoiding problems– Evaluating/Suggesting (new) hypotheses
3. Discuss the result with the original researcher4. Verify on a second (bigger corpus)5. Co-develop the tool within the research
community
IC'2012 - C Reffay, F-M Blondel, E Giguet 37
Find Nei/nei with a concordancer
All occurrences refer to the Italian common word “nei”
IC'2012 - C Reffay, F-M Blondel, E Giguet 38
Another example
• {(f330s2914m5)2011-10-23T21:52 CR_Martins Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Meu nome é Cleissa Regina, Cleissa porque minha mãe viu na tv uma repórter chamada Cleisa e achou parecido com o nome dela, Cléia e Regina porque o nome do meu pai é Reginaldo. Assim como a PBS gosto muito de ter 2 nomes e Cleissa é bem raro, nunca conheci ninguém chamado assim.”
IC'2012 - C Reffay, F-M Blondel, E Giguet 39
Peimikà Bibiana… a unique case? No! Let’s try Cleissa Regina…
IC'2012 - C Reffay, F-M Blondel, E Giguet 40
How to detect new forms?
• Lexical rules (look for similar forms): – Ignoring accents (ex: José, Jose)– Ignoring case (ex: José, jose, JOSÉ, …)– Levenstein distance between 2 forms: number of
extra/missing/inversion of characters– For graphy size <5 : Dist<=1– For graphy size >=5 : Dist<=2
• Context rules: (ex: “mi chiamo …”, “merci …”)
IC'2012 - C Reffay, F-M Blondel, E Giguet 41
Lexical variations 1/2
UPPER Exact
Levenstein Levenstein nb differences
Known New distance distance Case accents Add/Sup/Inv
Adriana adriana 0 1 1
Alèxia Alexia 1 1 1
Anthony Antonhy 2 2 2
Baptiste baptiste 0 1 1
Cleissa Cleisa 1 1 1
Eli Elô 1 1 1
Eli Ely 1 1
Eli ELY 1 2 1
Eli Seli 1 2 1 1
Elouise Louise 1 2 1 1
Emmanuel MAnuel 2 4 2 2
Federica Federiac 2 2 2
Ferran fran 2 3 1 2
Ferran Fran 2 2 2
IC'2012 - C Reffay, F-M Blondel, E Giguet 42
Lexical variations 2/2UPPER Exact
Levenstein Levenstein nb differences
Known New distance distance Case accents Add/Sup/Inv
Gabriela GABRIELA 0 7 7
Guillem guillem 0 1 1
Iñigo iñigo 0 1 1
Jaqueline Jacqueline 1 1 1
Jean jean 0 1 1
José Jose 1 1 1
Kelly Kellly 1 1 1
Léo Leo 1 1 1
Léo léo 0 1 1
Mariana MariAna 0 1 1 2
Mary mary 0 1 1 1
Mary May 1 1 1
Michela Miche 2 2 2
Michela michelina 2 3 1 2
Monica moni 2 3 1 2
Olalla olalla 0 1 1
Oleguer oleguer 0 1 1
IC'2012 - C Reffay, F-M Blondel, E Giguet 43
Some good context rules (1/3)Context Total Known New New forms detected Accuracy
sou <F> 10 2 20%
appelle <F> 9 4 1 Kelly 56%
Cara <F> 7 1 1 May 29%
Ciao <F> 6 1 17%
Merci <F> 9 1 2 Belle, léo 44%
soy <F> 5 2 40%
equipe <F> 5 1 20%
Hombre <F> 4 1 25%
dicho <F> 3 1 33%
llamo <F> 3 2 1 Peimikà 100%
appel <F> 3 1 33%
raison <F> 3 1 33%
choix <F> 3 1 33%
chamam <F> 2 1 1 Maria 100%
tampoco <F> 2 1 50%
IC'2012 - C Reffay, F-M Blondel, E Giguet 44
Some good context rules (2/3)Context Total Known New New forms detected Accuracy
{BOM} <F>, 62 8 1 Fede 15%
je m’appelle <F> 5 5 100%
Accord avec <F> 9 4 1 Bet 56%
Concordo com a <F> 3 2 1 Line 100%
meu nome é <F> 3 2 67%
moi c’est <F> 2 2 100%
<F>, ho 8 2 25%
<F>, j’habite 2 2 100%
<F>, je 8 2 25%
je m’appel <F> 1 0 1 jean 100%
suis avec <F> 2 1 50%
<F> a dit 1 1 100%
dit el <F> 1 1 100%
diu el <F> 1 1 1 100%
nombre, <F> 2 1 1 Peimikà 100%
IC'2012 - C Reffay, F-M Blondel, E Giguet 45
Generic context rules
Context Total Known New New forms detected Accuracy
<F>, <Known> 15 2 1 Regina 20%
<Known> i <F> 3 1 33%
<F> i <Known> 1 1 100%
<Known> et <F> 6 2 2 Antonhy, Leo 67%
<F> et <Known> 3 2 1 Federiac 100%
<Known> e <F> 3 1 33%
<F> e <Known> 3 1 33%
Recommended