Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Anonymising CMC corpora: A reasonable way to share
Christophe ReffaySTEF (ENS Cachan)
IFÉ (ENS Lyon)
14 feb. 2013 Dortmund'2013 - [email protected] 2
Outline• Introduction:
– The corpus example– Anonymisation objective and purpose– Anonymisation principles and hypotheses– The Mulce “anony” XML schema
• Anonymisation process– Marking process– Finding new forms– Replacement process
• Conclusion? Suggestions? =>Perspectives
14 feb. 2013 Dortmund'2013 - [email protected] 3
The corpus (1/2)• A GALANET session:
– Based on Inter-Comprehension principles:• Write mainly in your mother language• Read and try to understand others
– Discussion forum and collaborative production– For teenagers (~16 years old)
Multilingual interaction messages Many typos & grammatical errors Use of internet language (smiley,…)
14 feb. 2013 Dortmund'2013 - [email protected] 4
The corpus (2/2)• Galanet Session 2011-2012:
“Nômades...nomadi...nómades... des langues”
• 4 teams : Italy, Brazil, France & Spain• During 3.5 months, • 103 teenagers, 83 authors wrote…
915 Messages containing (message body)• Volume: 47 740 tokens, 217 477 characters• Lexicon: 9 655 distinct types
14 feb. 2013 Dortmund'2013 - [email protected] 5
The objective is to share!
But anonymisation is a hard work (by hand)– The corpus may be enormous– Subtleties: homonyms & synonyms
Personal data are not sharable
Anonymisation… the solution?
Need a software to support
14 feb. 2013 Dortmund'2013 - [email protected] 6
Anonymisation purpose • Hide personal information systematicallysystematically
– Names (first names, last names, usernames…)– Identifiers (Passport, National Student Number, …)– Locations (city, street, address, coordinates)– Institution/Workplace (school, sport club, firm, …)– Contact references (e-mail, mobile, MSN, skype,
twitter, telephone/fax)– Explicit references (URL of homepages, blogs)– Social media usernames (facebook, MySpace, Hi5,
Soundcloud, Badoo, Bebo, Friendster, Netlog, …)• Maintaining text coherence and consistency
14 feb. 2013 Dortmund'2013 - [email protected] 7
Personal data: examples• {(f331s2970m2)2011-11-30T19:24 Gabibr Re: Quelques
informations ... answers SandrineD (f331s2970m1)} “Eu amo a língua Francesa! Quem sabe falar francês me adiconem no meu FACEBOOK;) J'aime parler français! Qui peut parler français? M'ajouter dans FACEBOOK;) Nom: GABRIELA MEDEIROS.”
• {(f333s3016m2)2011-12-27T09:25 Miche Re: Les stéréotypesculinaires answers SandrineD (f333s3016m1)} “inviate i vostridocumenti alla mia mail [email protected] grazie!!!;)”
• {(f330s2914m8)2011-10-22T19:52 PBS Re: Por que me chamoassim?! answers Eloandrade (f330s2914m1)} “Yo me llamoPeimikà Bibiana. Como mi madre es tailandesa y mi padre es italiano, mi primer nombre, Peimikà, es tailandés y significa " dueñadel amor ", mientras mi según nombre, Bibiana, es italiano y procede del etrusco " vibius " que significa " vida ". Me gusta muchotener dos nombres (en Italia es más usual tener un nombre) y sobre todo estoy orgullosa de los orígenes diferentes que tienen y que hacen mi nombre aún más particular (además Peimikà no es muy difundido en tampoco en Tailandia y tampoco Bibiana en Italia”
14 feb. 2013 Dortmund'2013 - [email protected] 8
Just google it!
14 feb. 2013 Dortmund'2013 - [email protected] 9
Peimikà Bibiana… google search (2)
14 feb. 2013 Dortmund'2013 - [email protected] 10
Anonymisation Principles
1. All identified tokens must be (computationally) marked even if not modified by a replacement token.
2. Any reference (e.g.: name, institution or location) may be imprecise enough to encompass several hundreds people.
Original lexical form Replacement formReplacedby
Mark
Once anonymised, no participant may be identifiable by an external person
14 feb. 2013 Dortmund'2013 - [email protected] 11
Anonymisation• Before:
{(f330s2880m3)2011-10-17T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)}
Bonjour, je m'appelle Kellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Rosa Luxemburg à Canet, non loin de Perpignan…
• After:{(f330s2880m3)2011-10-17T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)}
Bonjour, je m'appelle Kittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Margherita Duras* à Aigues-Vives*, non loin de Perpignan…
Before After
Anonymisation
process
Looking for a reasonable
14 feb. 2013 Dortmund'2013 - [email protected] 13
Hypotheses• A fully automated method does not exist
for any corpora• Some decisions have to be taken
by the researcher, not by the software• Accuracy of the method will be achieved
only for a given context (ex: Galanet)• “Named entities” do not occur randomly
Let’s find the regularities Interactively with the expert: the researcher
14 feb. 2013 Dortmund'2013 - [email protected] 14
Concepts manipulated
Institution,Participant,
Public person,Relative,Street,
City…
Existing objects Named entities Lexical forms
Name,Surname,
Username,First name,Last name,Addresses,
Tel. number,MSN…
Pedro,KellyM,Eli, Elô,Kelly,
Bergamo, Canet,Rosa Luxembourg,
0609785643,
Real world CorpusReference
MarienkircheMarienkircheMarien-kircheMaria-kirche
Marienkirche in Lübeck von Osten
Marienkirche in Dortmund
14 feb. 2013 Dortmund'2013 - [email protected] 15
Anonymisation process
Corpus toanonymise
Corpus with marked
Entities
Named entitiestransformation tableInitial list of
participants,usernames,institution…
Process/RulesDiscovering new forms
MarkingProcess
AnonymisedCorpus
ReplacementProcess
14 feb. 2013 Dortmund'2013 - [email protected] 16
Transformation table: example
Synonyms: the same entity has different forms
=≠
Homonyms: the same form refers to different entities
14 feb. 2013 Dortmund'2013 - [email protected] 17
Anonymisation process
Corpus toanonymise
Corpus with marked
Entities
Named entitiestransformation tableInitial list of
participants,usernames,institution…
Process/RulesDiscovering new forms
MarkingProcess
AnonymisedCorpus
ReplacementProcess
14 feb. 2013 Dortmund'2013 - [email protected] 18
Marking one form: Example (Kelly)A- List of all occurrences (within their context) with a concordancer
14 feb. 2013 Dortmund'2013 - [email protected] 19
Marking one form: Example (Kelly)
+
B- Update the transformation table (ex: add Public person Gene Kelly)
14 feb. 2013 Dortmund'2013 - [email protected] 20
Marking one form: Example (Kelly)
C- Associate each occurrence to the appropriate entity
(=> In the corpus: Surround the occurrence by XML tags)
Last name, Normal form, unchangedrefers to the public person Gene Kelly
First name, Normal form, to be changedrefers to the participant KellyM
14 feb. 2013 Dortmund'2013 - [email protected] 21
Anonymisation process
Corpus toanonymise
Corpus with marked
Entities
Named entitiestransformation tableInitial list of
participants,usernames,institution…
Process/RulesDiscovering new forms
MarkingProcess
AnonymisedCorpus
ReplacementProcess
14 feb. 2013 Dortmund'2013 - [email protected] 22
Detecting new forms: 2 strategies
• Lexical rules: similar forms – Eli -> Elô Ely ELY Seli– Gabriela -> GABRIELA– José -> Jose
• Context rules: Similar context– First names: “mi chiamo …”, “accord avec …”– Cities: “Soy de …”, “vivo en …”, “j’habite à …”
14 feb. 2013 Dortmund'2013 - [email protected] 23
1st Strategy: Lexical variation rules
adriana Alexia Antonhy baptiste Cleisa Elô Ely ELY Seli Louise MAnuel Federiac fran Fran GABRIELA guillem iñigo Jacqueline jean Jose Kellly Leo léo MariAna mary May Miche michelina moni olalla oleguer
Adriana Alèxia Anthony Baptiste Cleissa Eli…Elouise Emmanuel Federica Ferran Gabriela Guillem Iñigo Jaqueline Jean José Kelly Léo Mariana Mary Michela Monica Olalla Oleguer
103Knownforms
31New
forms
14 feb. 2013 Dortmund'2013 - [email protected] 24
Lexical variations
110oleguerOleguer
110olallaOlalla
2132moniMonica
2132michelinaMichela
222MicheMichela
111MayMary
1110maryMary
2110MariAnaMariana
110léoLéo
111LeoLéo
111KelllyKelly
111JoseJosé
110jeanJean
111JacquelineJaqueline
110iñigoIñigo
110guillemGuillem
770GABRIELAGabriela
Add/Sup/InvaccentsCasedistancedistanceNewKnown
nb differencesLevensteinLevenstein
ExactUPPER
14 feb. 2013 Dortmund'2013 - [email protected] 25
2nd Strategy : Context rules
103 Known first names (Adrià, …, Veronica)
145 contexts: Left/RightTotal: more than 250 tested rules
15 good new formsAntonhy Belle Bet Christine FedeFederiac Kellly Leo Line Maria May Peimikà Regina fran jean léo
47 rules approved
14 feb. 2013 Dortmund'2013 - [email protected] 26
Anonymisation process
Corpus toanonymise
Corpus with marked
Entities
Named entitiestransformation tableInitial list of
participants,usernames,institution…
Process/RulesDiscovering new forms
MarkingProcess
AnonymisedCorpus
ReplacementProcess
14 feb. 2013 Dortmund'2013 - [email protected] 27
Replacing process• Before:
{(f330s2880m3)2011-10-17T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)}
Bonjour, je m'appelle Kellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Rosa Luxemburg à Canet, non loin de Perpignan…
• After:{(f330s2880m3)2011-10-17T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)}
Bonjour, je m'appelle Kittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Margherita Duras* à Aigues-Vives*, non loin de Perpignan…
14 feb. 2013 Dortmund'2013 - [email protected] 28
The Mulce anony.xsd schemahttp://lrl-diffusion.univ-bpclermont.fr/mulce/metadata/mce-schemas/mce_anony.xsd
<actordesignation actorref = “{actor_code}” person = “real/fictitious” process = “{process_mark}”><firstname type = “initial/abbreviated/shortname/complete”
correct = “exact/modified/wrong”modified = “true/false”>Replacement form</firstname>
{May be some separators like: spaces, linefeed, paragraph, or «,.;:!?=+ -/\~*_["]#(')&@$% », others?}
<surname type = “initial/abbreviated/complete” correct = “exact/modified/wrong” modified = “true/false”>Replacement form</surname>
{May be some separators}<lastname type = “initial/abbreviated/complete” correct =
“exact/modified/wrong” modified = “true/false”>Replacement form</lastname>
</actordesignation>
14 feb. 2013 Dortmund'2013 - [email protected] 29
Anonymisation• Before:Bonjour, je m'appelle Kellly. J'ai 16 ans, …
• After:Bonjour, je m'appelle<actordesignation actorref="FLG01" person=“real” process=“WBCMC_CR130214”><firstname type=“complete” correct=“wrong”modified=“true”>Kittty</firstname></actordesignation>. J'ai 16 ans, …
Before After
14 feb. 2013 Dortmund'2013 - [email protected] 30
Overall method and tools1. Define a process/algorithm for anonymisation2. Confront hypotheses to a first corpus
– Using existing tools (Excel, TXM/Calico, Notepad++)– Doing many work by hand (with automation in mind)– Facing/solving/avoiding problems– Evaluating/Suggesting (new) hypotheses
3. Discuss the result with the original researcher4. Verify on a second (bigger corpus)5. Co-develop the tool within the research
community
14 feb. 2013 Dortmund'2013 - [email protected] 31
Conclusion
1. An anonymisation method:– For any language– not relying on any external information (dictionnary)
2. Confront hypotheses to a first corpus– 47 rules approved for first names => 15 new forms– 103 first names => 31 existing derivations– Anonymisation not 100% auto => confirmed
3. Anonymisation possible in a world with Google?– Use Google to get the frequency of a named entities!
Strategy for systematic anonymisation of multi-lingual interaction corpora.
C. Reffay1, F.-M. Blondel1, E. Giguet21 STEF – ENS-Cachan / IFÉ – ENS-Lyon
2 GREYC, Université Caen Basse-Normandie, CNRS
Acknowledgement: Most of the material used in this presentation had been produced for the communication of the “Inter-Comprehension” conference, held in Grenoble in June 2012
Vielen Dank!
More precisely
14 feb. 2013 Dortmund'2013 - [email protected] 35
New forms discovering: 2 strategies
103 Known first names (Adrià, …, Veronica)
LexicalRules
ContextRules
317 candidates 145 contexts: Left/RightLeft: One form: 75 => 13780 occ.Left: 2 forms seq.: 123 => 1700 occ.Total: more than 250 tested rules
5 common 31 good new forms1 relative new: Maria30 public names
67 Tests
180 common words20 username200 Easy
34 frequent words16 known50 Auto
47 rules approved
15 good new forms
14 feb. 2013 Dortmund'2013 - [email protected] 36
Contexts of 145 occ. of 103 first names(using TXM, case insensitive)
14 feb. 2013 Dortmund'2013 - [email protected] 37
The corpus lexicon
• A list of (lexical forms ► Frequence)– de ►1015– que ► 965– la ► 673– …– porque ► 48– …– Addams ► 1
9655 unique forms
14 feb. 2013 Dortmund'2013 - [email protected] 38
Who is concerned?
« Les applications informatiques à des fins pédagogiques et éducatives mobilisent des données permettant d’identifier directement mais aussi indirectement les personnes physiques. Une attention particulière doit être portée sur la collecte de données sensibles ainsi que sur les procédés d’anonymisation des données. »
(Mallet-Poujol 2004: p 21)
For more information, see the European Commission Directive (95/46/EC)
Translation
14 feb. 2013 Dortmund'2013 - [email protected] 39
Who is concerned?
« Learning and teaching software involves data that could be used to identify directly or even indirectly physical persons. We have to pay attention particularly on sensitive data collection and on the process of data anonymisation. »
(Mallet-Poujol 2004: p 21)
For more information, see the European Commission Directive (95/46/EC)
14 feb. 2013 Dortmund'2013 - [email protected] 40
Legal context (95/46/EC)
• (Art7) Member States shall provide that personal datamay be processed only if: the data subject has unambiguously given his consent;…
• (Art8) Member States shall prohibit the processing of personal data revealing sensitive information (racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life)
• (Art8) […] Inform the data subject on:– The identity of the controller of the data collection,– The purposes of the processing – The recipients or categories of recipients of the data,– The existence of the right of access to and the right to rectify the
data concerning him
14 feb. 2013 Dortmund'2013 - [email protected] 41
Text coherence and consistency• {(f330s2914m11)2011-10-20T16:43 M_Cavalcanti Re: Por que me chamo
assim?! Answers Eloandrade (f330s2914m1)} “aaah, o meu é uma homenagem a uma de minhas tias e minha avó que se chamam Ana e ao resto de minhas tias que se chamam Maria. Daí, Mariana:)”
• {(f330s2914m10)-2011-10-20T21:06 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Gostei da criatividade da sua mãe MariAna! Rsrsrs”
• {(f330s2914m3)2011-10-28T00:54 LineCosta Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Ah meu nome é em homenagem a Jacqueline Kennedy, esposa do ex- presidente dos EUA, e também porque sempre foi um dos nomes preferidos do meu pai.: D”
• {(f330s2914m18)2011-10-19T20:36 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Bem, minha mãe queria que meu nome começasse com a letra E (como o dela!), um certo dia ela viu o nome de uma atriz brasileira chamada Louise Cardoso. Gostou do " Louise ", mas queria com a letra E, então ficou " Elouise "! Só depois, quando eu cresci é que descobri que meu nome era de origem francesa.. Hahaha”
14 feb. 2013 Dortmund'2013 - [email protected] 42
TXM: http://textometrie.ens-lyon.fr/
14 feb. 2013 Dortmund'2013 - [email protected] 43
Named entities
A named entity is a lexical formidentifying a precise object (first/last name, communication ref., city, institution, etc.)
Examples:Names: Christophe, Blondel, Giguet, Paris, Communication ref.: 0678600614, …Location: Grenoble, Paris, Parigi, …Institution: ENS Cachan, CNRS, …
14 feb. 2013 Dortmund'2013 - [email protected] 44
Managing named entities
• Homonyms refer to different objects– In the corpus we have 2 participants named “Guillem”:
The same first name refers to different persons.– In “Gene Kelly”, Kelly = public person last name– in “Galdric, Kelly et Antonhy”, it’s a participant first name
• Different synonyms refer to the same object– Kellly & Kelly, – Anthony & Antonhy, – Elô & Elouise
14 feb. 2013 Dortmund'2013 - [email protected] 45
Referring to global entities
14 feb. 2013 Dortmund'2013 - [email protected] 46
Find Nei/nei with a concordancer
All occurrences refer to the Italian common word “nei”
14 feb. 2013 Dortmund'2013 - [email protected] 47
Another example
• {(f330s2914m5)2011-10-23T21:52 CR_MartinsRe: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Meu nome éCleissa Regina, Cleissa porque minha mãe viu na tv uma repórter chamada Cleisa e achou parecido com o nome dela, Cléia e Regina porque o nome do meu pai é Reginaldo. Assim como a PBS gosto muito de ter 2 nomes e Cleissa é bem raro, nunca conheci ninguém chamado assim.”
14 feb. 2013 Dortmund'2013 - [email protected] 48
Peimikà Bibiana… a unique case? No! Let’s try Cleissa Regina…
14 feb. 2013 Dortmund'2013 - [email protected] 49
How to detect new forms?
• Lexical rules (look for similar forms): – Ignoring accents (ex: José, Jose)– Ignoring case (ex: José, jose, JOSÉ, …)– Levenstein distance between 2 forms: number of
extra/missing/inversion of characters– For word size <5 : Dist<=1– For word size >=5 : Dist<=2
• Context rules: (ex: “mi chiamo …”, “merci …”)
14 feb. 2013 Dortmund'2013 - [email protected] 50
Lexical variations 1/2
222FranFerran
2132franFerran
222FederiacFederica
2242MAnuelEmmanuel
1121LouiseElouise
1121SeliEli
121ELYEli
11ElyEli
111ElôEli
111CleisaCleissa
110baptisteBaptiste
222AntonhyAnthony
111AlexiaAlèxia
110adrianaAdriana
Add/Sup/InvaccentsCasedistancedistanceNewKnown
nb differencesLevensteinLevenstein
ExactUPPER
14 feb. 2013 Dortmund'2013 - [email protected] 51
Lexical variations 2/2
110oleguerOleguer
110olallaOlalla
2132moniMonica
2132michelinaMichela
222MicheMichela
111MayMary
1110maryMary
2110MariAnaMariana
110léoLéo
111LeoLéo
111KelllyKelly
111JoseJosé
110jeanJean
111JacquelineJaqueline
110iñigoIñigo
110guillemGuillem
770GABRIELAGabriela
Add/Sup/InvaccentsCasedistancedistanceNewKnown
nb differencesLevensteinLevenstein
ExactUPPER
14 feb. 2013 Dortmund'2013 - [email protected] 52
Some good context rules (1/3)
100%Maria112chamam <F>
50%12tampoco <F>
1
2
11
New
33%13choix <F>
33%13raison <F>
33%13appel <F>
100%Peimikà23llamo <F>
33%13dicho <F>
25%14Hombre <F>
20%15equipe <F>
40%25soy <F>
44%Belle, léo19Merci <F>
17%16Ciao <F>
29%May17Cara <F>
56%Kelly49appelle <F>
20%210sou <F>
AccuracyNew forms detectedKnownTotalContext
14 feb. 2013 Dortmund'2013 - [email protected] 53
Some good context rules (2/3)
100%111diu el <F>
100%Peimikà112nombre, <F>
1
11
1New
100%11dit el <F>
100%11<F> a dit
50%12suis avec <F>
100%jean01je m’appel <F>
25%28<F>, je
100%22<F>, j’habite
25%28<F>, ho
100%22moi c’est <F>
67%23meu nome é <F>
100%Line23Concordo com a <F>
56%Bet49Accord avec <F>
100%55je m’appelle <F>
15%Fede862{BOM} <F>,
AccuracyNew forms detectedKnownTotalContext
14 feb. 2013 Dortmund'2013 - [email protected] 54
Generic context rules
12
1New
33%13<F> e <Known>
33%13<Known> e <F>
100%Federiac23<F> et <Known>
67%Antonhy, Leo26<Known> et <F>
100%11<F> i <Known>
33%13<Known> i <F>
20%Regina215<F>, <Known>
AccuracyNew forms detectedKnownTotalContext