A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department

A Privacy Preserving Efficient Protocol for Semantic Similarity Join

Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Bilal Hawashin, Farshad Fotouhi Traian Marius

Truta Department of Computer Science Truta Department of Computer Science

Wayne State University Northern Kentucky Wayne State University Northern Kentucky University University

OutlinesOutlines

What is Similarity JoinWhat is Similarity Join Long String ValuesLong String Values Our ContributionOur Contribution Privacy Preserving Protocol For Long String Privacy Preserving Protocol For Long String

ValuesValues Experiments and ResultsExperiments and Results Conclusions/Future WorkConclusions/Future Work Contact InformationContact Information

MotivationMotivation

NameName AddressAddress MajorMajor ……

John John SmithSmith

4115 Main 4115 Main St.St.

BiologyBiology

Mary Mary JonesJones

2619 Ford 2619 Ford Rd.Rd.

Chemical Chemical Eng.Eng.

NameName AddressAddress Monthly Monthly Sal.Sal.

……

Smith, Smith, JohnJohn

4115 Main 4115 Main StreetStreet

16451645

Mary Mary JonsJons

2619 Ford 2619 Ford Rd.Rd.

21002100

Is Natural Join always suitable?

Similarity JoinSimilarity Join

Joining a pair of records if they have Joining a pair of records if they have SIMILAR values in the join attribute.SIMILAR values in the join attribute.

Formally, similarity join consists of Formally, similarity join consists of grouping pairs of records whose grouping pairs of records whose similarity is greater than a threshold, similarity is greater than a threshold, TT. .

Studied widely in the literature, and Studied widely in the literature, and referred to as record linkage, entity referred to as record linkage, entity matching, duplicate detection, citation matching, duplicate detection, citation resolution, …resolution, …

Our Previous Contribution: Our Previous Contribution: Long String Values (ICDM Long String Values (ICDM

MMIS10)MMIS10) The termThe term long string long string refers to the data type refers to the data type

representing any string value with unlimited length.representing any string value with unlimited length. The term The term long attributelong attribute refers to any attribute of long refers to any attribute of long

string data type. string data type. Most tables contain at least one attribute with long Most tables contain at least one attribute with long

string values.string values. Examples are Paper Abstract, Product Description, Examples are Paper Abstract, Product Description,

Movie Summary, User Comment, …Movie Summary, User Comment, … Most of the previous work studied similarity join on Most of the previous work studied similarity join on

short fields.short fields. In our previous work, we showed that using long In our previous work, we showed that using long

attributes as join attributes under supervised attributes as join attributes under supervised learning can enhance the similarity join performance.learning can enhance the similarity join performance.

ExampleExample

P1 TitleP1 Title P1 P1 KwdsKwds

P1 P1 AuthrsAuthrs

P1 P1 AbstractAbstract

……


P2 P2 AuthrsAuthrs


……


P3 P3 AuthrsAuthrs


……

……P10 P10 TitleTitle

P10 P10 KwdsKwds

P10 P10 AuthrsAuthrs


……

P11 P11 TitleTitle

P11 P11 KwdsKwds

P11 P11 AuthrsAuthrs


……

……

Our Paper (Motivation)Our Paper (Motivation) Some sources may not allow sharing its Some sources may not allow sharing its

whole data in the similarity join process.whole data in the similarity join process. Solution: Privacy Preserved Similarity Join.Solution: Privacy Preserved Similarity Join.

Using long attributes as join attributes can Using long attributes as join attributes can increase the similarity join accuracy.increase the similarity join accuracy.

Up to our knowledge, all the current Privacy Up to our knowledge, all the current Privacy Preserved SJ algorithms use short attributes.Preserved SJ algorithms use short attributes.

Most of the current privacy preserved SJ Most of the current privacy preserved SJ algorithms ignore the semantic similarities algorithms ignore the semantic similarities among the values.among the values.

Problem FormulationProblem Formulation

Our goal is to find a Privacy Our goal is to find a Privacy Preserved Similarity Join Algorithm Preserved Similarity Join Algorithm when the join attribute is a long when the join attribute is a long attribute and consider the semantic attribute and consider the semantic similarities among such long values.similarities among such long values.

Our Work PlanOur Work Plan

Phase1: Compare multiple similarity Phase1: Compare multiple similarity methods for long attributes when methods for long attributes when similarity thresholds are used.similarity thresholds are used.

Phase2: Use the best method as part Phase2: Use the best method as part in the privacy preserved SJ protocol.in the privacy preserved SJ protocol.

Phase1: Finding Best SJ Phase1: Finding Best SJ Method for Long Strings Method for Long Strings

with Thresholdwith Threshold Candidate Methods:Candidate Methods:

Diffusion Maps.Diffusion Maps. Latent Semantic Indexing.Latent Semantic Indexing. Locality Preserving Projection.Locality Preserving Projection.

Performance Performance MeasurementsMeasurements

F1 Measurement: the harmonic mean F1 Measurement: the harmonic mean between recall R and precision P.between recall R and precision P.

Where recall is the ratio of the Where recall is the ratio of the relevant data among the retrieved relevant data among the retrieved data, and precision is the ratio of the data, and precision is the ratio of the accurate data among the retrieved accurate data among the retrieved data. data.

Performance Performance Measurements(Cont.)Measurements(Cont.)

Preprocessing time is the time needed to read the dataset and generate matrices that could be used later as an input to the semantic operation.

Operation time is the time needed to apply the semantic method.

Matching time is the time required by the third party, C, to find the cosine similarity among the records provided by both A and B in the reduced space and compare the similarities with the predefined similarity threshold.

DatasetsDatasets

IMDB Internet Movies Dataset:IMDB Internet Movies Dataset: Movie Summary FieldMovie Summary Field

Amazon Dataset:Amazon Dataset: Product TitleProduct Title Product DescriptionProduct Description

Phase1 ResultsPhase1 Results

Finding best dimensionality Finding best dimensionality reduction method using Movie reduction method using Movie Summary from IMDB Dataset (Left) Summary from IMDB Dataset (Left) and Product Descriptions from and Product Descriptions from Amazon (Right). Amazon (Right).


Preprocessing Time:Preprocessing Time:

Read Dataset (1000 Movie Summaries)

12 Sec.

TF.IDF Weighting 1 Sec.

Reduce Dimensionality using Mean TF.IDF

0.5 Sec.

Find Shared Features

Negligible


Operation Time for the best Operation Time for the best performing methods from phase 1.performing methods from phase 1.

Matching Time is negligible.Matching Time is negligible.

Our ProtocolOur Protocol

Both sources A and B share the Both sources A and B share the Threshold value T to decide similar Threshold value T to decide similar pairs later.pairs later.

Our ProtocolOur Protocol

P1 TitleP1 Title P1 P1 AuthorsAuthors


……



……



Source A

Source BPx TitlePx Title Px Px

AuthorsAuthorsPx Px AbstractAbstract

……

Find Term_LSV Find Term_LSV Frequency Matrix for Frequency Matrix for

Each SourceEach SourceLSV1LSV1 LSV2LSV2 LSV3LSV3

ImageImage 44 00 00

ClassifyClassify 55 00 00

SimilaritSimilarityy

00 66 55

JoinJoin 00 66 44

MMa

Find TD_Weighted Matrix Find TD_Weighted Matrix Using TF.IDF WeightingUsing TF.IDF Weighting

LSV1LSV1 LSV2LSV2 LSV3LSV3

ImageImage 0.90.9 00 00

ClassifyClassify 0.70.7 00 00

SimilariSimilarityty

00 0.850.85 0.90.9

JoinJoin 00 0.70.7 0.850.85

WeightedWeightedMMa

TF.IDF WeightingTF.IDF Weighting

TF.IDF weighting of a term W in a long TF.IDF weighting of a term W in a long string value x is given as:string value x is given as:

where tfw,x is the frequency of the term w in the long string value x, and idfw is , where N is the number of long string values in the relation, and nw is the number of long string values in the relation that contains the term w.

MeanTF.IDF Feature MeanTF.IDF Feature SelectionSelection

MeanTF.IDF is an unsupervised feature MeanTF.IDF is an unsupervised feature selection method.selection method.

Every feature (term) is assigned a value Every feature (term) is assigned a value according to its importance. according to its importance.

The Value of a term feature w is given asThe Value of a term feature w is given as

Where TF.IDF(w, x) is the weighting of Where TF.IDF(w, x) is the weighting of feature w in long string value x, and N is feature w in long string value x, and N is the total number of long string values.the total number of long string values.

Apply MeanTF.IDF on WeightedMApply MeanTF.IDF on WeightedMa a

and Get and Get Important Features to Imp_Fe Important Features to Imp_Feaa.. Add Random features to Add Random features to Imp_FeImp_Fea a to to

get get Rand_ Imp_FeRand_ Imp_Fea.a. Rand_ Imp_FeRand_ Imp_Fea a and Rand_ and Rand_ Imp_FeImp_Feb b are are returned to C.returned to C. C Finds the intersection and C Finds the intersection and return the return the shared important features SF shared important features SF to both A to both A and B.and B.

Reduced WeightedM Reduced WeightedM Dimensions in Both Dimensions in Both Sources using SF.Sources using SF.

LSV1LSV1 LSV2LSV2 LSV3LSV3

ImageImage 0.90.9 00 00


00 0.850.85 0.90.9

……

SFSFa

Add Random Vectors to Add Random Vectors to SFSF

LSV1LSV1 LSV2LSV2 LSV3LSV3 RandoRandom Colsm Cols

ImageImage 0.90.9 00 00 0.60.6


00 0.850.85 0.90.9 0.20.2

……

Rand_Weighted_a

Find WFind Waa (The Kernel) (The Kernel)1-Cos_Sim(LSV1,LSV1)=01-Cos_Sim(LSV1,LSV1)=0 1-Cos_Sim(LSV1,LSV2)=0.21-Cos_Sim(LSV1,LSV2)=0.2 1-Cos_Sim(LSV1,LSV3)=0.31-Cos_Sim(LSV1,LSV3)=0.3 ……

1-1-Cos_Sim(LSV2,LSV1)=0.2Cos_Sim(LSV2,LSV1)=0.2

1-Cos_Sim(LSV2,LSV2)=01-Cos_Sim(LSV2,LSV2)=0 1-1-Cos_Sim(LSV2,LSV3)=0.87Cos_Sim(LSV2,LSV3)=0.87

……



1-Cos_Sim(LSV3,LSV3)=01-Cos_Sim(LSV3,LSV3)=0 ……

…… …… …… ……

|Wa| = D x D, where D is total number of columns in Rand_Weighteda

Use Diffusion Maps to Find Use Diffusion Maps to Find Red_Rand_Weighted_aRed_Rand_Weighted_a

[[Red_Rand_Weighted_aRed_Rand_Weighted_a,,SSaa,,VVaa,,AAaa] = Diffusion_Map(] = Diffusion_Map(WWa a , , 10, 1, 10, 1, red_dimred_dim), red_dim < D), red_dim < D

Red_Rand_Weighted_a=Red_Rand_Weighted_a=Diffusion Map Diffusion Map Representation of first row of WRepresentation of first row of Waa

Diffusion Map Representation Diffusion Map Representation of second row of Wof second row of Waa

Diffusion Map Representation Diffusion Map Representation of third row of Wof third row of Waa

Col1Col1 Col2Col2 …… ColColred_dimred_dim

0.40.4 0.10.1

0.80.8 0.60.6

0.750.75 0.50.5

……

C Finds Pairwise Similarity C Finds Pairwise Similarity Between Between

Red_Rand_Weighted_a and Red_Rand_Weighted_a and Red_Rand_Weighted_b Red_Rand_Weighted_b Red_RanRed_Rand_Weightd_Weighted_aed_a

Red_RanRed_Rand_Weightd_Weighted_bed_b

Cos_SimCos_Sim

11 11 0.770.77

11 22 0.30.3

…… …… ……

22 11 0.90.9

If Cos_Sim>T, Insert the If Cos_Sim>T, Insert the tuple in Matched tuple in Matched

Red_RanRed_Rand_Weightd_Weighted_aed_a

Red_RanRed_Rand_Weightd_Weighted_bed_b

Cos_SimCos_Sim

11 11 0.770.77

22 11 0.90.9

22 77 0.850.85

…… …… ……

Matched

Matched is returned to both A and B.

A and B remove random vectors from Matched and share their matrices.

Our Protocol (Part1)Our Protocol (Part1)

Our Protocol (Part2)Our Protocol (Part2)


Effect of adding random columns on Effect of adding random columns on the accuracy.the accuracy.


Effect of adding random columns on Effect of adding random columns on the number of suggested matches.the number of suggested matches.

ConclusionsConclusions

Efficient secure SJ semantic protocol for Efficient secure SJ semantic protocol for long string attributes is proposed.long string attributes is proposed.

Diffusion maps is the best method (among Diffusion maps is the best method (among compared) to semantically join long string compared) to semantically join long string attributes when threshold values are attributes when threshold values are used.used.

Mapping into diffusion maps space and Mapping into diffusion maps space and adding random records can hide the adding random records can hide the original data without affecting the original data without affecting the accuracy.accuracy.

Future WorkFuture Work

Potential further works: Potential further works: Compare diffusion maps with more Compare diffusion maps with more

candidate semantic methods for candidate semantic methods for joining long string attributes.joining long string attributes.

Study the performance of the Study the performance of the protocol on huge databases.protocol on huge databases.

Thank You …Thank You …

Dr. Farshad Fotouhi.Dr. Farshad Fotouhi. Dr. Traian Marius Truta.Dr. Traian Marius Truta.

Documents

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department