Upload
barbra-horn
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
A Privacy Preserving Efficient Protocol for Semantic Similarity Join
Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Bilal Hawashin, Farshad Fotouhi Traian Marius
Truta Department of Computer Science Truta Department of Computer Science
Wayne State University Northern Kentucky Wayne State University Northern Kentucky University University
OutlinesOutlines
What is Similarity JoinWhat is Similarity Join Long String ValuesLong String Values Our ContributionOur Contribution Privacy Preserving Protocol For Long String Privacy Preserving Protocol For Long String
ValuesValues Experiments and ResultsExperiments and Results Conclusions/Future WorkConclusions/Future Work Contact InformationContact Information
MotivationMotivation
NameName AddressAddress MajorMajor ……
John John SmithSmith
4115 Main 4115 Main St.St.
BiologyBiology
Mary Mary JonesJones
2619 Ford 2619 Ford Rd.Rd.
Chemical Chemical Eng.Eng.
NameName AddressAddress Monthly Monthly Sal.Sal.
……
Smith, Smith, JohnJohn
4115 Main 4115 Main StreetStreet
16451645
Mary Mary JonsJons
2619 Ford 2619 Ford Rd.Rd.
21002100
Is Natural Join always suitable?
Similarity JoinSimilarity Join
Joining a pair of records if they have Joining a pair of records if they have SIMILAR values in the join attribute.SIMILAR values in the join attribute.
Formally, similarity join consists of Formally, similarity join consists of grouping pairs of records whose grouping pairs of records whose similarity is greater than a threshold, similarity is greater than a threshold, TT. .
Studied widely in the literature, and Studied widely in the literature, and referred to as record linkage, entity referred to as record linkage, entity matching, duplicate detection, citation matching, duplicate detection, citation resolution, …resolution, …
Our Previous Contribution: Our Previous Contribution: Long String Values (ICDM Long String Values (ICDM
MMIS10)MMIS10) The termThe term long string long string refers to the data type refers to the data type
representing any string value with unlimited length.representing any string value with unlimited length. The term The term long attributelong attribute refers to any attribute of long refers to any attribute of long
string data type. string data type. Most tables contain at least one attribute with long Most tables contain at least one attribute with long
string values.string values. Examples are Paper Abstract, Product Description, Examples are Paper Abstract, Product Description,
Movie Summary, User Comment, …Movie Summary, User Comment, … Most of the previous work studied similarity join on Most of the previous work studied similarity join on
short fields.short fields. In our previous work, we showed that using long In our previous work, we showed that using long
attributes as join attributes under supervised attributes as join attributes under supervised learning can enhance the similarity join performance.learning can enhance the similarity join performance.
ExampleExample
P1 TitleP1 Title P1 P1 KwdsKwds
P1 P1 AuthrsAuthrs
P1 P1 AbstractAbstract
……
P2 TitleP2 Title P2 P2 KwdsKwds
P2 P2 AuthrsAuthrs
P2 P2 AbstractAbstract
……
P3 TitleP3 Title P3 P3 KwdsKwds
P3 P3 AuthrsAuthrs
P3 P3 AbstractAbstract
……
……P10 P10 TitleTitle
P10 P10 KwdsKwds
P10 P10 AuthrsAuthrs
P10 P10 AbstractAbstract
……
P11 P11 TitleTitle
P11 P11 KwdsKwds
P11 P11 AuthrsAuthrs
P11 P11 AbstractAbstract
……
……
Our Paper (Motivation)Our Paper (Motivation) Some sources may not allow sharing its Some sources may not allow sharing its
whole data in the similarity join process.whole data in the similarity join process. Solution: Privacy Preserved Similarity Join.Solution: Privacy Preserved Similarity Join.
Using long attributes as join attributes can Using long attributes as join attributes can increase the similarity join accuracy.increase the similarity join accuracy.
Up to our knowledge, all the current Privacy Up to our knowledge, all the current Privacy Preserved SJ algorithms use short attributes.Preserved SJ algorithms use short attributes.
Most of the current privacy preserved SJ Most of the current privacy preserved SJ algorithms ignore the semantic similarities algorithms ignore the semantic similarities among the values.among the values.
Problem FormulationProblem Formulation
Our goal is to find a Privacy Our goal is to find a Privacy Preserved Similarity Join Algorithm Preserved Similarity Join Algorithm when the join attribute is a long when the join attribute is a long attribute and consider the semantic attribute and consider the semantic similarities among such long values.similarities among such long values.
Our Work PlanOur Work Plan
Phase1: Compare multiple similarity Phase1: Compare multiple similarity methods for long attributes when methods for long attributes when similarity thresholds are used.similarity thresholds are used.
Phase2: Use the best method as part Phase2: Use the best method as part in the privacy preserved SJ protocol.in the privacy preserved SJ protocol.
Phase1: Finding Best SJ Phase1: Finding Best SJ Method for Long Strings Method for Long Strings
with Thresholdwith Threshold Candidate Methods:Candidate Methods:
Diffusion Maps.Diffusion Maps. Latent Semantic Indexing.Latent Semantic Indexing. Locality Preserving Projection.Locality Preserving Projection.
Performance Performance MeasurementsMeasurements
F1 Measurement: the harmonic mean F1 Measurement: the harmonic mean between recall R and precision P.between recall R and precision P.
Where recall is the ratio of the Where recall is the ratio of the relevant data among the retrieved relevant data among the retrieved data, and precision is the ratio of the data, and precision is the ratio of the accurate data among the retrieved accurate data among the retrieved data. data.
Performance Performance Measurements(Cont.)Measurements(Cont.)
Preprocessing time is the time needed to read the dataset and generate matrices that could be used later as an input to the semantic operation.
Operation time is the time needed to apply the semantic method.
Matching time is the time required by the third party, C, to find the cosine similarity among the records provided by both A and B in the reduced space and compare the similarities with the predefined similarity threshold.
DatasetsDatasets
IMDB Internet Movies Dataset:IMDB Internet Movies Dataset: Movie Summary FieldMovie Summary Field
Amazon Dataset:Amazon Dataset: Product TitleProduct Title Product DescriptionProduct Description
Phase1 ResultsPhase1 Results
Finding best dimensionality Finding best dimensionality reduction method using Movie reduction method using Movie Summary from IMDB Dataset (Left) Summary from IMDB Dataset (Left) and Product Descriptions from and Product Descriptions from Amazon (Right). Amazon (Right).
Phase2 ResultsPhase2 Results
Preprocessing Time:Preprocessing Time:
Read Dataset (1000 Movie Summaries)
12 Sec.
TF.IDF Weighting 1 Sec.
Reduce Dimensionality using Mean TF.IDF
0.5 Sec.
Find Shared Features
Negligible
Phase2 ResultsPhase2 Results
Operation Time for the best Operation Time for the best performing methods from phase 1.performing methods from phase 1.
Matching Time is negligible.Matching Time is negligible.
Our ProtocolOur Protocol
Both sources A and B share the Both sources A and B share the Threshold value T to decide similar Threshold value T to decide similar pairs later.pairs later.
Our ProtocolOur Protocol
P1 TitleP1 Title P1 P1 AuthorsAuthors
P1 P1 AbstractAbstract
……
P2 TitleP2 Title P2 P2 AuthorsAuthors
P2 P2 AbstractAbstract
……
P3 TitleP3 Title P3 P3 AuthorsAuthors
P3 P3 AbstractAbstract
Source A
Source BPx TitlePx Title Px Px
AuthorsAuthorsPx Px AbstractAbstract
……
Find Term_LSV Find Term_LSV Frequency Matrix for Frequency Matrix for
Each SourceEach SourceLSV1LSV1 LSV2LSV2 LSV3LSV3
ImageImage 44 00 00
ClassifyClassify 55 00 00
SimilaritSimilarityy
00 66 55
JoinJoin 00 66 44
MMa
Find TD_Weighted Matrix Find TD_Weighted Matrix Using TF.IDF WeightingUsing TF.IDF Weighting
LSV1LSV1 LSV2LSV2 LSV3LSV3
ImageImage 0.90.9 00 00
ClassifyClassify 0.70.7 00 00
SimilariSimilarityty
00 0.850.85 0.90.9
JoinJoin 00 0.70.7 0.850.85
WeightedWeightedMMa
TF.IDF WeightingTF.IDF Weighting
TF.IDF weighting of a term W in a long TF.IDF weighting of a term W in a long string value x is given as:string value x is given as:
where tfw,x is the frequency of the term w in the long string value x, and idfw is , where N is the number of long string values in the relation, and nw is the number of long string values in the relation that contains the term w.
MeanTF.IDF Feature MeanTF.IDF Feature SelectionSelection
MeanTF.IDF is an unsupervised feature MeanTF.IDF is an unsupervised feature selection method.selection method.
Every feature (term) is assigned a value Every feature (term) is assigned a value according to its importance. according to its importance.
The Value of a term feature w is given asThe Value of a term feature w is given as
Where TF.IDF(w, x) is the weighting of Where TF.IDF(w, x) is the weighting of feature w in long string value x, and N is feature w in long string value x, and N is the total number of long string values.the total number of long string values.
Apply MeanTF.IDF on WeightedMApply MeanTF.IDF on WeightedMa a
and Get and Get Important Features to Imp_Fe Important Features to Imp_Feaa.. Add Random features to Add Random features to Imp_FeImp_Fea a to to
get get Rand_ Imp_FeRand_ Imp_Fea.a. Rand_ Imp_FeRand_ Imp_Fea a and Rand_ and Rand_ Imp_FeImp_Feb b are are returned to C.returned to C. C Finds the intersection and C Finds the intersection and return the return the shared important features SF shared important features SF to both A to both A and B.and B.
Reduced WeightedM Reduced WeightedM Dimensions in Both Dimensions in Both Sources using SF.Sources using SF.
LSV1LSV1 LSV2LSV2 LSV3LSV3
ImageImage 0.90.9 00 00
SimilariSimilarityty
00 0.850.85 0.90.9
……
SFSFa
Add Random Vectors to Add Random Vectors to SFSF
LSV1LSV1 LSV2LSV2 LSV3LSV3 RandoRandom Colsm Cols
ImageImage 0.90.9 00 00 0.60.6
SimilariSimilarityty
00 0.850.85 0.90.9 0.20.2
……
Rand_Weighted_a
Find WFind Waa (The Kernel) (The Kernel)1-Cos_Sim(LSV1,LSV1)=01-Cos_Sim(LSV1,LSV1)=0 1-Cos_Sim(LSV1,LSV2)=0.21-Cos_Sim(LSV1,LSV2)=0.2 1-Cos_Sim(LSV1,LSV3)=0.31-Cos_Sim(LSV1,LSV3)=0.3 ……
1-1-Cos_Sim(LSV2,LSV1)=0.2Cos_Sim(LSV2,LSV1)=0.2
1-Cos_Sim(LSV2,LSV2)=01-Cos_Sim(LSV2,LSV2)=0 1-1-Cos_Sim(LSV2,LSV3)=0.87Cos_Sim(LSV2,LSV3)=0.87
……
1-1-Cos_Sim(LSV3,LSV1)=0.3Cos_Sim(LSV3,LSV1)=0.3
1-1-Cos_Sim(LSV3,LSV2)=0.87Cos_Sim(LSV3,LSV2)=0.87
1-Cos_Sim(LSV3,LSV3)=01-Cos_Sim(LSV3,LSV3)=0 ……
…… …… …… ……
|Wa| = D x D, where D is total number of columns in Rand_Weighteda
Use Diffusion Maps to Find Use Diffusion Maps to Find Red_Rand_Weighted_aRed_Rand_Weighted_a
[[Red_Rand_Weighted_aRed_Rand_Weighted_a,,SSaa,,VVaa,,AAaa] = Diffusion_Map(] = Diffusion_Map(WWa a , , 10, 1, 10, 1, red_dimred_dim), red_dim < D), red_dim < D
Red_Rand_Weighted_a=Red_Rand_Weighted_a=Diffusion Map Diffusion Map Representation of first row of WRepresentation of first row of Waa
Diffusion Map Representation Diffusion Map Representation of second row of Wof second row of Waa
Diffusion Map Representation Diffusion Map Representation of third row of Wof third row of Waa
Col1Col1 Col2Col2 …… ColColred_dimred_dim
0.40.4 0.10.1
0.80.8 0.60.6
0.750.75 0.50.5
……
C Finds Pairwise Similarity C Finds Pairwise Similarity Between Between
Red_Rand_Weighted_a and Red_Rand_Weighted_a and Red_Rand_Weighted_b Red_Rand_Weighted_b Red_RanRed_Rand_Weightd_Weighted_aed_a
Red_RanRed_Rand_Weightd_Weighted_bed_b
Cos_SimCos_Sim
11 11 0.770.77
11 22 0.30.3
…… …… ……
22 11 0.90.9
If Cos_Sim>T, Insert the If Cos_Sim>T, Insert the tuple in Matched tuple in Matched
Red_RanRed_Rand_Weightd_Weighted_aed_a
Red_RanRed_Rand_Weightd_Weighted_bed_b
Cos_SimCos_Sim
11 11 0.770.77
22 11 0.90.9
22 77 0.850.85
…… …… ……
Matched
Matched is returned to both A and B.
A and B remove random vectors from Matched and share their matrices.
Our Protocol (Part1)Our Protocol (Part1)
Our Protocol (Part2)Our Protocol (Part2)
Phase2 ResultsPhase2 Results
Effect of adding random columns on Effect of adding random columns on the accuracy.the accuracy.
Phase2 ResultsPhase2 Results
Effect of adding random columns on Effect of adding random columns on the number of suggested matches.the number of suggested matches.
ConclusionsConclusions
Efficient secure SJ semantic protocol for Efficient secure SJ semantic protocol for long string attributes is proposed.long string attributes is proposed.
Diffusion maps is the best method (among Diffusion maps is the best method (among compared) to semantically join long string compared) to semantically join long string attributes when threshold values are attributes when threshold values are used.used.
Mapping into diffusion maps space and Mapping into diffusion maps space and adding random records can hide the adding random records can hide the original data without affecting the original data without affecting the accuracy.accuracy.
Future WorkFuture Work
Potential further works: Potential further works: Compare diffusion maps with more Compare diffusion maps with more
candidate semantic methods for candidate semantic methods for joining long string attributes.joining long string attributes.
Study the performance of the Study the performance of the protocol on huge databases.protocol on huge databases.
Thank You …Thank You …
Dr. Farshad Fotouhi.Dr. Farshad Fotouhi. Dr. Traian Marius Truta.Dr. Traian Marius Truta.