Large -Scale Cost-sensitive Online Social Network Profile Linkage

Large-Scale Cost-sensitive Online Social Network Profile Linkage

Background & MotivationFoot prints in different social networks.User identification in social analysis.Privacy & securityCommercial & government applications

OutlineProblem definitionRelated workApproach

Experiment

Conclusion & future work

Problem DefinitionTerminology

Identity: PersonProfile/User: Your footprint on social mediaProfile Linkage: Link your footprints together

Input & OutputInput: profiles of one site as QUERY and profiles of the other site as TARGET.Output: all pairs of classified matched profiles.

Characteristics of profile

Name (semi vs. structured)

{“given name”: “haochen”, “family name”: “zhang”}name: zhang haochen

Semi-structured schemaIncompleteness & missing attributes

Privacy policyVirtual identification

Free text descriptionBio, About me, Tags

Multilingualism

MultilingualismTop 5 languages in dataset of Facebook

EnglishPortugueseSpanishChineseFrench

Most frequent tokens in different languages

chris, john, michaelchen, wang, leecarlos, garcia, danielsergey, olga, alexander

About 70% users are in English7.2% users register as different localesTransliteration

昊辰 => Haochen

Feature AcquisitionNetwork communication costs too much time.Usage limit of the web service.

1000 invocations per day for Google Maps API

Compute complexity comparing to string similarity.

Image processing algorithm.

Overview of approach

Classification of Potential LinksFeatures

representationSupervised

learningCost-sensitive

Feature Acquisition

Pruning with CanopyParameter tuning Canopy construction

Entity-based Representation of ProfilesMapping Tokenization Entity extraction

Canopy: design

Canopy: efficiency

Local FeaturesUsername

Jaro Winkler Similarity

LanguageJaccard Simlarity

Description, URLCosine similarity with TF×IDF

PopularityDefined as the friend amount of a user.Adopt following metric

External FeaturesGeographic Location

Values are diverse with different types.Google Maps API:

string-represented location => geographic information

Spherical distance between two locations as the feature

Avatarχ2 dissimilarity of the avatar’s gray-scale histogram.

Classification: learningProbabilistic model derived from naïve bayes

Independent feature assumption

Classification: learningIterative inference

Terminate if S_n is discriminative.Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative

Order of the features

Classification: learningInitial value

Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched.

as the initial value

Dataset of experimentData source

152,294 Twitter users154,379 LinkedIn users

Ground truth: 9,750 identities4,779 identities with both accounts.3,339 identities with only Twitter account.1,632 identities with only LinkedIn account.

Experiment: Performance on overall linkage

I-Acc(Identity Accuracy)correctly identified identities / all identities in ground truth

Better than naïve learning method caused by adopting the prior.Different performance on different learning methods.

Experiment: Cost-sensitive feature acquisition

5% improvement of F1 by taking 148743 external feature acquisitions.Different order of external features.

Rank by costRank by distinguishability

Three sections divided by two inflection points.

Discussion: dataset construction

Dataset constructionConnections

Cannot correctly reflect the web-scale occasion.Name is too significant.

People searchDifficult to construct the ground truth.

Solution?

Discussion: people search task

Query in LinkedIn by Twitter user’s name Average 10 results for each query

Pre Rec F1Human 0.643 0.900 0.750NB_Local 0.369 0.441 0.402NB_All 0.418 0.493 0.453C4.5_Local 0.594 0.240 0.342C4.5_All 0.609 0.380 0.468CSPL_Local 0.543 0.658 0.595CSPL_All 0.578 0.713 0.638

Discussion: feature dependency

Compare features independently.2 people in Tsinghua with same name Li Peng2 people in NUS with same name Li Peng

Construct different IDF table for name in different locale.

Not generallyNot significantly effective

ConclusionWe proposed an supervised probabilistic to solve the identity linkage problem effectively.Prior that users sharing rarer tokens are more likely matched improves the performance of the approach.Iterative inference is able to reduce unnecessary feature acquisitions.

Thank you

Large -Scale Cost-sensitive Online Social Network Profile Linkage

Documents

Linkage error and linkage bias: A guide for IDI users

Purpose: To study the relaonship btwn convec’ve ac’vity ......Linkage between WNP & NEI monsoon Interannual time scale Intraseasonal time scale} WNP suppression -> } Monsoon

Room, Suite Scale, Class III Biological Safety Cabinet ... Applied Biosafety Decon Paper... · Room, Suite Scale, Class III Biological Safety Cabinet, and Sensitive Equipment Decontamination

POD: Practical Object Detection with Scale-Sensitive Network · 2019-09-16 · POD: Practical Object Detection with Scale-Sensitive Network Junran Peng1,2,3, Ming Sun2, Zhaoxiang

Constructing Genetic Linkage Maps with MAPMAKER…home.cc.umanitoba.ca/.../doc/mapmaker/mapmaker.tutorial.pdf · Constructing Genetic Linkage Maps with ... Constructing Genetic Linkage

2014 linkage 20 conversations linkage mena

Large-Scale Visual Search - University of Texas at San … · · 2013-11-27Large-scale Visual Search ... Re-ranking Online Query Formulation Image Ranking ... (Locality Sensitive

Large-Scale Distributed Locality-Sensitive Hashing for ...eliezers/sisap2014preprint.pdf · Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data Eliezer Silva2,

Linkage to Care - University of Washingtondepts.washington.edu/nwaetc/presentations/uploads/109/linkage_to_care.pdf · Linkage to Care: Definition & Target • Linkage = initiation

Large-scale Linkage Disequilibrium Mapping of Rheumatoid Arthritis-associated Genes in Japan ～ Results and Perspectives ～ December 9, 2005 Human Genome

Table of Contents - Kogan.comAlarm linkage: triggered record, linkage alarm output and linkage PTZ preset, sound alarm, report to alarm center, linkage channel single screen display

Reliable Publish/Subscribe Middleware for Time-sensitive ...gokhale/WWW/papers/DEBS09_Rel_PubSub.pdfReliable Publish/Subscribe Middleware for Time-sensitive Internet-scale Applications

HYDRA: Large-scale Social Identity Linkage via ...14.pdf · HYDRA: Large-scale Social Identity Linkage via Heterogeneous Behavior Modeling Siyuan Liu , Shuhui Wang +, Feida Zhu #,

Large-scale Linkage Disequilibrium Mapping to Identify Rheumatoid Arthritis-associated Genes

Research Paper Academic linkage: A linkage platform for

Combined linkage and linkage disequilibrium analysis of a

Gene Linkage

POD: Practical Object Detection with Scale-Sensitive Network

rbb.union.edurbb.union.edu/courses/mer312/Lectures/Atlas Geared Fivebar.pdf · bar linkage which is not necessarily a Grashoff crank ... symmetric linkage is used as the ... A scale

Five sections: Data Linkage in WA Introduction Linkage