Upload
mohammad-ramsey
View
13
Download
0
Embed Size (px)
DESCRIPTION
The Power of Data Mining and Machine Learning Techniques for Network Construction and Analysis. Reda Alhajj University of Calgary, Calgary, Alberta, Canada Global University, Beirut, Lebanon [email protected]. General Overview. The network model provides a powerful platform - PowerPoint PPT Presentation
Citation preview
The Power of Data Mining and Machine Learning Techniques for Network
Construction and Analysis
Reda Alhajj
University of Calgary, Calgary, Alberta, CanadaGlobal University, Beirut, Lebanon
BYU, Provo, USA, March 2013
General Overview
The network model provides a powerful platform to study a group of entities and their relationships
The semantics of the links in the network is determined by considering the application domain to be investigated
A network can be constructed by considering pairwise correlation between entities or by investigating the correlation between two entities based on a global view of the data
Data mining and machine learning techniques allow for better investigation by globally visioning the data to derive the strength of pairwise links
The combination of data mining, machine learning and network analysis would lead to a comprehensive and robust framework for data analysis.
2 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 20133 Reda Alhajj, University of Calgary
Outline of the talk
Background on ARM, Clustering, Network Model, fuzziness
From FPM, ARM and clustering to network
Some Application Domains: database design web mining terror network analysis outlier detection Disease Biomarker Database search
Conclusions and research directions
BYU, Provo, USA, March 2013
Overview of Association Rules Mining
A general model for mining domains where there is many2many relationship between two sets of entities, e.g., baskets and items; documents and words, etc.
Consider a set of items I = {I1 , I2 , I3 ,…, Im }
Consider a database of transactions D where each transaction T is a set of items such that T I
So, if A is a set of items a transaction T is said to contain A if and only if A T
An association rule is an implication or correlation of the form:
A B where A I, B I, and A B =
Support and confidence are the measures generally used to filter the rules
4 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Association Rules Mining: Two Steps
In general association rules mining can be reduced to the following two steps:
1. Find all frequent itemsets Each itemset will occur at least as frequently as a
minimum support count
2. Generate strong association rules from the frequent itemsets These rules will satisfy minimum support and confidence
measures
We use the outcome from the first step in part of the research and the outcome from the second step in another part of the research
5 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Association Rules Mining: Apriori Algorithm
Any subset of a frequent itemset must be frequent Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Minimum support = 2
6 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Association Rule MiningFrequent Closed Itemset
A frequent itemset X is closed if none of its immediate supersets has the same support as the itemset X
Example
Image Reference: http://www.siam.org/meetings/sdm06/proceedings/038lucchesec.pdf
7 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Clustering
It is an unsupervised learning process
It is the process of distributing a given set of data instances into groups such that the similarity of instances is high within each group and low between the groups. Similarity within the cluster (intra-cluster) is measured using
variance average variance or TWCV Similarity across the clusters (inter-cluster) is measure based on
linkage.
For clustering we need to know at least the characteristics of the instances and the similarity measure to be used in the process
Various algorithms exist for clustering, e.g., k-means, DBscan,
Each algorithm has its advantages and disadvantages
8 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Clustering
9 Reda Alhajj, University of Calgary
Example 1
Example 2
BYU, Provo, USA, March 2013
Overview of Social Network Analysis
A social network is a set of entities called actors and the links connecting them. Ex: students enrolled in same courses, people and likes, etc A social network is mostly represented as a graph called sociogram
Social Network Analysis (SNA) is powerful because it has foundations in math/graph theory
SNA provides a set of tools to empirically extend our theoretical intuition of the patterns that compose a social structure.
SNA provides a set of relational methods for systematically understanding and identifying connections among actors.
SNA embodies a range of theories relating types of observable social spaces and their relation to individual and group behavior.
10 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Social Network AnalysisCentrality Measures
Degree Sum of connections (sum of the weights of connections in
case of weighted graphs) from or to an actor
Closeness Distance of one actor to all others in the network
Betweenness The number of shortest paths that passes through an actor
Eigen-vector Measures how importance of an actor
11 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Social Network AnalysisCentrality Measures (example)
The red nodes have the highest degree centrality
The blue node has the highest Closeness and betweenness centrality
Node 7 has the highest degree centrality
Node 8 has the highest betweenness Centrality
Nodes 4 and 5 have the highest Closeness Centrality
Example 1 Example 2
Image Reference:http://www.biomedcentral.com/
Image Reference:http://mande.co.uk/special-issues/network-models/
12 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Social Network AnalysisGraph Clustering Algorithms
MST based clustering First finds a Minimum Spanning Tree (MST) of the graph
Removes edges with the highest weight from the MST to form clusters of vertices (actors)
Edge Betweenness clustering The betweenness of an edge is defined as the extent to
which the edge lies along shortest paths
First computes edge betweenness for all edges in current graph
Removes edges having the highest betweenness from the graph
13 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
One Mode versus Two Mode Networks
Queries (users) versus Tables is a two mode network
Folding is used to produceone mode networks from a two mode network
Folding is simply the multiplicationof the adjacency matrix of the two mode network by its transpose
X Y Z
A 1 0 0
B 1 0 1
C 1 1 0
D 1 0 1
A B C D
X 1 1 1 1
Y 0 0 1 0
Z 0 1 0 1
14 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Fuzzy Sets Generalizes the classical set theory by a characteristic
membership function.
A membership function introduces a grey area between the black and white areas
Consider fuzzy set A, its domain D, and object x.
Membership function µ specifies the degree of membership of x in A:
µA(x): D → [0, 1].
µA(x)= 0 means x does not belong to A.
µA(x)= 1 means x completely belongs to A.
Intermediate values 0< µA(x)<1 represent varying degree of membership.
15 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Income Range Centroid
Quite poor 10-10-30 -Poor 10-30-70 30
Moderate 30-70-120 70
Rich 70-120-120 -
The ranges of fuzzy sets
10K 30K 70K 120Kincome($)
poor moderate richquitepoor
The membership functions found according to the centroids
Example on Membership
1.0
0.5
0.0
Membership
16 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 201317 Reda Alhajj, University of Calgary
From FPM to Network Construction
Given a data set of M instances and N features per instance
Prepare the data for FPM by deciding on the baskets and items. Keep in mind that items are the actors in the network
Apply the FPM algorithm of your choice to find Frequent sets of items; it is possible to narrow down to closed or maximal FP
Construct the network by considering the frequent sets as follows:
Add a link between two actors i and j iff i and j exist together in at least one FP, the weight of the link is set to the number of common FP’s
It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc.
BYU, Provo, USA, March 2013
From FPM to Network Construction
18 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 201319 Reda Alhajj, University of Calgary
From ARM to Network Construction
Given a data set of M instances and N features per instance
Prepare the data for ARM by deciding on the baskets and items. Keep in mind that items are the actors in the network; they will form the antecedents and consequents of the rules
Apply the ARM algorithm of your choice to find all AR’s that satisfy certain criteria
Construct the network by considering the AR’s as follows: Add a link between two actors i and j iff i and j exist together in
at least one AR, the weight of the link is set to the number of common AR’s. It is possible to concentrate on antecedent, consequent or both.
It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc.
BYU, Provo, USA, March 201320 Reda Alhajj, University of Calgary
From ARM to Network Construction
BYU, Provo, USA, March 201321 Reda Alhajj, University of Calgary
From Clustering to Network Construction
Given a data set of M instances and N features per instance
Prepare the data for clustering by deciding on the features to consider in computing the similarity measure
Apply either one clustering algorithm several times by playing with the required input parameters or a number of clustering algorithms to find one clustering solution per run.
Construct the network by considering the clusters as follows: Add a link between two actors i and j iff i and j exist together in
the same cluster in at least one clustering solution, the weight of the link is set to the number of common clusters across the solutions.
It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc.
BYU, Provo, USA, March 2013
Network Construction
Multiple clustering solutions
22 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 201323 Reda Alhajj, University of Calgary
From the Data to Network Construction
Given a data set of M instances and N features per instance
Prepare the data processing by deciding on the features P to consider in the analysis
Construct a MxP matrix A by considering every instance as a row and every feature as a column
Find the transpose of matrix A
Multiply matrix A by its transpose to get the adjacency matrix for the target network.
It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc.
NetDriller : A Powerful Social Network Analysis Tool*
Negar Koochakzadeh, Atieh Sarraf, Keivan Kianmehr, Jon Rokne, Reda Alhajj{nkoochak, sarrafsa}@ucalgary.ca, [email protected], {alhajj, rokne}@ucalgary.ca Social Network Analysis (SNA) is a technique first used in sociology.
Recently computer scientists have realized that this model is general enough to be applied to any domain where the entities and their interconnections can be separated into actors and their links, respectively. Data Mining techniques can strengthen SNA
Searching in the Network: Example1: Find individuals who could monitor the information flow in an organization better than most others. Example 2: Find individuals who have best picture of what is happening in the network as a whole.
Closeness centrality reveals how long it takes information to spread from one individual to others in the network. High scoring individuals in Closeness have the shortest paths to all others in the network.Betweenness centrality indicates the extent that an individual is a broker of indirect connections among all others in a network. Someone with high Betweenness could be thought of as a gatekeeper of information flow. People that occur on many shortest paths among other People have highest Betweenness value.Degree centrality indicates the extent that an individual send or receive information to the neighbors.Eigenvector centrality calculates the principle eigenvector of the network. A node is central to the extent that its neighbors are central.
Fuzzy Query Example: Find individuals with high centralities
Raw Dataset: People and their attributes
Social Network: Based on community detection
Fuzzy Query Result: Color hue shows DofM
Fuzzy Sets: Based on multi-objective GA optimization
age work class education Marital status occupation relationship race sex Hours/week nativecountry
39 State-gov Bachelors Never-married Adm-clerical Not-in-family White Male 40 US50 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Husband White Male 13 Canada52 Self-emp-not-inc HS-grad Married-civ-spouse Exec-managerial Husband White Male 45 US30 State-gov Bachelors Married-civ-spouse Prof-specialty Husband Black Male 40 India25 Self-emp-not-inc HS-grad Never-married Farming-fishing Own-child White Male 35 Iran43 Self-emp-not-inc Masters Divorced Exec-managerial Unmarried White Female 45 US…
1Network Construction
2
* ICDM 2011 IEEE International Conference on Data Mining http://cpsc.ucalgary.ca/~nkoochak/NetDriller/
BYU, Provo, USA, March 201325 Reda Alhajj, University of Calgary
IMPROVING DATABASE PERFORMANCE BY BUILDING AND ANALYZING NETWORK OF TABLES FROM QUERY ACCESS PATTERNS
BYU, Provo, USA, March 2013
Problem Definition
Response time in a distributed or parallel database system is largely determined by how data is organized and stored on different machines/sites.
The goal is to place related data on nearby, or preferably the same, sites to minimize the response time.
The study of data distribution requires solving two problems: 1. The partitioning problem 2. The allocation problem
26 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Queries (users) versus Tables
27 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Overview of the analysis process
Three main steps:
1. Considering tables as items and queries as transactions, extract frequent closed itemsets
A kind of fuzzy sets can be built from the closed itemsets in this step
2. Use the extracted itemsets from the previous step to build the network of tables
3. Use network analysis to extract information about the tables from the network of tables
28 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step1Items and Transactions
Sample database EMPLOYEE (Ssn, Fname, Lname, Dno) DEPARTMENT (Dnumber, Dname) PROJECT (Pnumber, Pname, Plocation, Dno)
Sample query (Q1) SELECT Lname
FROM EMPLOYEE, DEPARTMENTWHERE DNO = Dnumber AND Dname = ‘Reasearch’
Items EMPLOYEE, DEPARTMENT, PROJECT
Transactions Q1: EMPLOYEE, DEPARTMENT
29 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 1Example (Sample Database)
Sample database schema from Fundamentals of Database Systems, Elmasri/Navathe
30 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 1Example (List of Queries)
List of Queries in Transaction FormatQ1 EMPLOYEE DEPARTMENT
Q2 EMPLOYEE DEPARTMENT PROJECT
Q3 EMPLOYEE DEPARTMENT
Q4 EMPLOYEE DEPARTMENT WORKS_ON PROJECT
Q5 EMPLOYEE WORKS_ON PROJECT
Q6 EMPLOYEE DEPARTMENT WORKS_ON PROJECT
Q7 EMPLOYEE DEPENDENT
Q8 EMPLOYEE WORKS_ON PROJECT
Q9 EMPLOYEE DEPENDENT
Q10 EMPLOYEE DEPENDENT
Q11 EMPLOYEE DEPARTMENT
Q12 EMPLOYEE DEPARTMENT
Q13 WORKS_ON PROJECT
Q14 WORKS_ON PROJECT
Q15 EMPLOYEE WORKS_ON PROJECT
31 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 1Example (Closed Itemsets)
List of frequent closed itemsets with min-support-threshold = 2
Note: 1-itemsets are omitted from the results
Itemset FrequencyEMPLOYEE, DEPARTMENT, WORKS_ON, PROJECT 2
EMPLOYEE, WORKS_ON, PROJECT 5
EMPLOYEE, DEPARTMENT, PROJECT 3
EMPLOYEE, PROJECT 6
WORKS_ON, PROJECT 7
EMPLOYEE, DEPARTMENT 7
EMPLOYEE, DEPENDENT 3
32 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step1Example (Fuzzy Sets)
Fuzzy Sets{WORKS_ON: 0.500, PROJECT: 0.304}
{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}
{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}
{EMPLOYEE: 0.231, PROJECT: 0.261}
{EMPLOYEE: 0.269, DEPARTMENT: 0.583}
{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}
{EMPLOYEE: 0.115, DEPENDENT: 1.000}
Itemset Frequency
EMPLOYEE, DEPARTMENT, WORKS_ON, PROJECT
2
EMPLOYEE, WORKS_ON, PROJECT 5
EMPLOYEE, DEPARTMENT, PROJECT 3
EMPLOYEE, PROJECT 6
WORKS_ON, PROJECT 7
EMPLOYEE, DEPARTMENT 7
EMPLOYEE, DEPENDENT 3
33 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Example (Fuzzy Sets)
SUGGESTED ALLOCATION, NO REPLICATION CASE{WORKS_ON: 0.500, PROJECT: 0.304}
{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}
{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}
{EMPLOYEE: 0.231, PROJECT: 0.261}
{EMPLOYEE: 0.269, DEPARTMENT: 0.583, DEPENDENT: 1.000}
{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}
{EMPLOYEE: 0.115}
Fuzzy Sets
{WORKS_ON: 0.500, PROJECT: 0.304}
{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}
{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}
{EMPLOYEE: 0.231, PROJECT: 0.261}
{EMPLOYEE: 0.269, DEPARTMENT: 0.583}
{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}
{EMPLOYEE: 0.115, DEPENDENT: 1.000}
34 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Example (Fuzzy Sets)
SUGGESTED ALLOCATION, REPLICATION CASE; AT MOST THREE REPLICA ALLOWED{WORKS_ON: 0.500, PROJECT: 0.304}
{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}
{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}
{EMPLOYEE: 0.231, PROJECT: 0.261, DEPARTMENT: 0.250}
{EMPLOYEE: 0.269, DEPARTMENT: 0.583, DEPENDENT: 1.000}
{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}
{EMPLOYEE: 0.115, DEPENDENT: 1.000}
35 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step2Building the Network
Each item (table) is a node in the network
An edge exists between two nodes if they appear together in at least one frequent closed itemset
The weight of an edge between two nodes is related to the number of frequent closed itemsets in which corresponding tables appear together
Weight is normalized
36 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 2Example
Network of tables
Note: Table DEPT_LOCATIONS is not included in the graph since this table did not appear in any of the queries
37 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 3Applying Network Analysis
Various network analysis techniques can be used to extract relationships of tables from the social network
Centrality measures can be used to identify the tables that are in relationship with many other tables and consequently play a key role in linking data from different tables together
Graph clustering algorithms can be applied to find groups of tables that are frequently accessed together in queries
38 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 3Example (Centrality Measures)
Tables Degree (unweighted)
Closeness Betweenness
EMPLOYEE 4 0.40 6
DEPARTMENT 3 0.27 4
WORKS_ON 3 0.25 4
PROJECT 3 0.36 4
DEPENDENT 1 0.18 4
39 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Step 3Example (Clustering Results)
Edge betweenness clusters C1: EMPLOYEE, PROJECT, DEPARTMENT C2: WORKS_ON C3: DEPENDENT
MST clusters C1: DEPENDENT C2: EMPLOYEE, WORKS_ON, PROJECT C3: DEPARTMENT
Clustering results may seem meaningless since in this example we have 5 highly correlated nodes in the graph
40 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experiment1Centrality Measures
This experiment has been done on a synthetic dataset of 14 tables (T0 to T13) and 20 queries, min-support-threshold = 2
High degree nodes T10: 6 T14: 4
High closeness nodes T10: 0.25 T14: 0.20
High betweenness nodes T10: 86 T14: 49
41 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experiment1Clustering Result
Edge betweenness clusters C1: T11, T12, T13, T14 C2: T1, T0, T2 C3: T4, T5, T10, T8, T3
MST clusters C1: T11
C2: T4, T3 C3: T5, T10, T12, T13, T8, T14, T1, T0, T2
42 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experiment 2Centrality Measures
The experiment has been done on a synthetic dataset of 14 tables (T0 to T13) and 30 queries, min-support-threshold = 1
High degree nodes T7: 12 T10: 11
High closeness nodes T10: 0.20 T7: 0.19
High betweenness nodes T7: 43 T10: 31
43 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Experiment 2Clustering Result
Edge betweenness clusters C1: T6 C2: T8 C3: T4, T5, T3, T2 C4: T1, T0 C5: T7, T10, T11, T12, T13, T14, T9
MST clusters C1: T6, T8 C2: T11 C3: T7, T9 C4: T10, T12, T13, T14, T1, T0, T2 C5: T4, T5, T3
44 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
To further demonstrate the effectiveness of the proposed approach in practice
we conducted another experiment using a synthetic query set of 1000 queries on 50 tables
finding real data is very hard because this type of data is very sensitive and hence highly confidential.
We have generated the data by restricting the number of tables that could appear in the same query to be at most 20 one query may require accessing at most 20 different
tables, though in practice it is not more than four or five tables.
45 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 201346 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
These are four example communities:
{T6, T8, T9, T22, T23, T24, T33 } –
{ T6, T9, T21, T37, T42, T45} –
{T5, T6, T11, T13, T14, T16, T19 } –
{ T6, T7, T9, T10, T12, T13, T19} .
47 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 201348 Reda Alhajj, University of Calgary
From Frequent Patterns to Network construction
BYU, Provo, USA, March 2013
Overview
Given a dataset, e.g., emails exchanged between a group of people, like employees in the same company
Partition the dataset into groups based on a certain criteria to be studied To study the employees, all emails are grouped such that emails of
the same employee form one group
Decide on the items to be considered in the analysis E.g., each email could be a transaction and words/emails within
the header/text could be items
Mine FP within each group and globally
Find relevant features for each group based on the entropy
49 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 201350 Reda Alhajj, University of Calgary
The Proposed Framework
Feature Extraction Model
Network Creation Model
Mine frequent closed
patterns
Calculate weights offeatures to
create feature vectors
Select suitable features based on entropy ranking
Freq. Closed Pats.
Features
Statistical Analysis Model
Front End Interface and Visualization Tool
BYU, Provo, USA, March 201351 Reda Alhajj, University of Calgary
Feature Extraction Model: The Feature Vector
The feature vector related to entity ej with m features is represented
as -
Fj = ( w(f1), w(f2), …, w(fm) ),
where w(fk) is the weight of the k-th feature, fk in entity ej.
BYU, Provo, USA, March 201352 Reda Alhajj, University of Calgary
Feature Extraction Model: Weight of a Feature
The weights of each feature is calculated using the following formula,
wDj(fk) = supDj(fk)/supD(fk)
where
wDj(fk) is the weight of the feature k for entity ej,
supDj(fk) is frequency of feature fk across dataset Dj of entity ej,
and
supD(fk) is frequency of fk across dataset D of all entities E.
BYU, Provo, USA, March 201353 Reda Alhajj, University of Calgary
Experimental Results: Enron E-mail dataset description
Dataset contains 500,000 e-mail messages over 150 Enron employees.
For this analysis inbox having more than 1000 e-mails were considered.
From each user’s inbox we have chosen 1000 e-mails randomly that makes the e-mail dataset for the corresponding user.
BYU, Provo, USA, March 201354 Reda Alhajj, University of Calgary
Experimental Results: Processing Enron E-mail dataset
Identify itemsets from email dataset –
The stem words appearing in the body and the subject line of the e-mails are considered as items.
E-mail addresses inside the e-mails are identified as items as well.
These items appearing in a single e-mail are considered as a single transaction
This way for each user we make a transactional database of 1000 e-mail transactions for each of the 1000 e-mails in the inbox
From these transactional databases we identify the globally frequent closed itemsets (corresponding to a support of 10%)
Based on entropy ranking we chose top 100 closed itemsets as our feature set.
BYU, Provo, USA, March 201355 Reda Alhajj, University of Calgary
Experimental Results: Euclidean Distance Matrix for Enron Users
buy deanermi
sjone
s
kamiski
keavey
lokeymay
sagersaibi
salisbury
shackleton
thomas
whalley
ybarbo
buy 0.00 0.65 0.57 0.26 0.43 0.41 0.43 0.35 0.32 0.36 0.25 0.22 0.65 0.60 0.59
dean 0.65 0.00 0.13 0.50 0.28 0.50 0.27 0.68 0.40 0.44 0.73 0.64 0.08 0.10 0.13
ermis 0.57 0.13 0.00 0.44 0.22 0.44 0.21 0.61 0.33 0.38 0.65 0.56 0.15 0.14 0.16
jones 0.26 0.50 0.44 0.00 0.27 0.35 0.29 0.38 0.19 0.26 0.36 0.21 0.50 0.47 0.44
kamiski 0.43 0.28 0.22 0.27 0.00 0.31 0.16 0.47 0.17 0.28 0.51 0.39 0.28 0.25 0.25
keavey 0.41 0.50 0.44 0.35 0.31 0.00 0.38 0.25 0.30 0.41 0.45 0.38 0.51 0.47 0.50
lokey 0.43 0.27 0.21 0.29 0.16 0.38 0.00 0.50 0.22 0.25 0.52 0.41 0.27 0.25 0.24
may 0.35 0.68 0.61 0.38 0.47 0.25 0.50 0.00 0.40 0.45 0.35 0.33 0.69 0.65 0.67
sager 0.32 0.40 0.33 0.19 0.17 0.30 0.22 0.40 0.00 0.25 0.44 0.28 0.40 0.36 0.36
saibi 0.36 0.44 0.38 0.26 0.28 0.41 0.25 0.45 0.25 0.00 0.45 0.34 0.43 0.41 0.41
salisbury 0.25 0.73 0.65 0.36 0.51 0.45 0.52 0.35 0.44 0.45 0.00 0.30 0.75 0.70 0.70
shackleton 0.22 0.64 0.56 0.21 0.39 0.38 0.41 0.33 0.28 0.34 0.30 0.00 0.63 0.60 0.59
thomas 0.65 0.08 0.15 0.50 0.28 0.51 0.27 0.69 0.40 0.43 0.75 0.63 0.00 0.09 0.13
whalley 0.60 0.10 0.14 0.47 0.25 0.47 0.25 0.65 0.36 0.41 0.70 0.60 0.09 0.00 0.11
ybarbo 0.59 0.13 0.16 0.44 0.25 0.50 0.24 0.67 0.36 0.41 0.70 0.59 0.13 0.11 0.00
Distance cutoff point 0.30
BYU, Provo, USA, March 201356 Reda Alhajj, University of Calgary
Experimental Results: The Enron E-mail users’ social network based on e-mail usage
BYU, Provo, USA, March 201357 Reda Alhajj, University of Calgary
Five CLUSTERS OF ENRON E-MAIL.
1 saibi
2 buy, salisbury, shakleton, jones
3 dean, ermis, jones, kaminski, lokey, sager, thomas, whalley, ybarbo
4 keavey
5 may
Experimental Results: The Enron E-mail users’ social network based on e-mail usage
BYU, Provo, USA, March 201358 Reda Alhajj, University of Calgary
From Association rules to Network
BYU, Provo, USA, March 201359 Reda Alhajj, University of Calgary
Basic Steps
Given a website The mining process can be applied on three dimensions:
content, structure and log
Actors in the network are the pages.
Construct the adjacency matrix by mining association rules from the transactional database obtained after preprocessing the web log data:
Each transaction is a set of pages accessed together in one session.
FPM algorithm, e.g., Apriori or FP-growth is applied on the derived
transactional data and association rules are derived.
BYU, Provo, USA, March 201360 Reda Alhajj, University of Calgary
Basic Steps
Determine frequent Itemsets
Find association rules
Add items in the rule as node in the graph and connect items in the left side to items in the right side (directed edges)
Use support and confidence to find a combined weight of each added edge
If edge already exist then add the new weight to the existing weight of the edge
Analyze the graph using SNA techniques
BYU, Provo, USA, March 201361 Reda Alhajj, University of Calgary
From Association Rules to Social Network
BYU, Provo, USA, March 201362 Reda Alhajj, University of Calgary
From Association Rules to Social Network
Analyze weblog
Determine frequent sets of pages based on frequency of pages accessed together
Determine rules and keep only those satisfying minimum confidence
Construct network of pages based on rules
BYU, Provo, USA, March 201363 Reda Alhajj, University of Calgary
From Association Rules to Network
Each rule is reflected in the adjacency matrix by incrementing every entry (i; j) such that pages i and j exist in the antecedent and consequent of the rule, respectively.
Entries in the adjacency matrix are normalized by dividing each value by the overall average of the values that exist in the matrix.
The network is analyzed to rank the pages by considering their in-degrees, out-degrees, and betweenness, eigen-vector centrality.
Pages with high betweenness centrality are considered as important to link pages from different communities.
BYU, Provo, USA, March 201364 Reda Alhajj, University of Calgary
From Association Rules to Social Network
analysis was done using the software Visone (http://visone.info/)
Betweeness Centrality measure
BYU, Provo, USA, March 201365 Reda Alhajj, University of Calgary
From Association Rules to Social Network
Closeness Centrality measure
BYU, Provo, USA, March 201366 Reda Alhajj, University of Calgary
From Association Rules to Social Network
Eigenvector Centrality measure
BYU, Provo, USA, March 201367 Reda Alhajj, University of Calgary
From Multi-objective GA based clustering to Network Construction
The case of Genes/Proteins
BYU, Provo, USA, March 201368 Reda Alhajj, University of Calgary
Motivation
In most traditional clustering algorithms, number of clusters is given a-priori.
In fact: the clustering criteria is dependent on more than one objective!
Cluster validation to assess the number of clusters.
Multi-objective clustering must work on small and large data sets.
BYU, Provo, USA, March 201369 Reda Alhajj, University of Calgary
Objective Functions For Clustering
Three objectives:
F1 : minimize the number of clusters
F2 : maximize the heterogeneity between clusters
F3 : maximize the within cluster homogeneity
BYU, Provo, USA, March 201370 Reda Alhajj, University of Calgary
Objective functions
BYU, Provo, USA, March 201371 Reda Alhajj, University of Calgary
Divide and Conquer
Basic Steps:
If the dataset to be clustered is of manageable size then it is clustered as a whole set.
Otherwise
repeat the following steps
Partition the dataset (or set of centroids after the first iteration) into subsets of manageable size
Cluster each subset individually by applying multi-objective GA combined with validity analysis to get the centroids of the obtained clusters
If the set of all centroids is of manageable size then cluster the whole set of centroids and exit the loop
Backtrack to merge clusters that have their centroids ending up in the same final cluster
BYU, Provo, USA, March 201372 Reda Alhajj, University of Calgary
Unique Solution of Compact Clusters
BYU, Provo, USA, March 201373 Reda Alhajj, University of Calgary
From Alternative Solutions to Adjacency Matrix
GenesGenes
Genes
Entry (i,j) specifies number of solutions where Genei and Genej occurred in the same cluster
BYU, Provo, USA, March 201374 Reda Alhajj, University of Calgary
From Adjacency Matrix to Network
BYU, Provo, USA, March 201375 Reda Alhajj, University of Calgary
Criminal and Terror Network Analysis
BYU, Provo, USA, March 2013
Terror Network Analysis by Clustering
We developed a framework that employs clustering, frequent pattern mining and some social network analysis measures to determine the effectiveness of a network.
The clustering and frequent pattern mining techniques start with the adjacency matrix of the network.
For clustering, we utilize entries in the table by considering each row as an object and each column as a feature.
features of a network member are his/her direct neighbors. We maintain the weight of links in case of weighted
network links.
76 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Multi-Objective GA based Clustering
We applied multi-objective GA based clustering
77 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Terror Network Analysis by Clustering & FPM
For Clustering, we consider each row as an instance and each column as a feature
We Cluster instances to find important groups and individuals within the network
For frequent pattern mining, we consider each row of the adjacency matrix as a transaction and each column as an item.
We map entries into a 0/1 scale such that every entry whose value is greater than zero is assigned the value one; entries keep the value zero otherwise.
This way we can apply frequent pattern mining algorithms to determine the most influential members in a network as well as the effect of removing some members or even links between members of a network.
78 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Terror Network Analysis
We investigate the effect of adding some links between members.
We are able to study how the various members in the network change role as the network evolves.
This is measured by applying some SNA measures on the network at each stage during the development.
We report some interesting results related to on various benchmark networks: including 9/11 and Madrid bombing.
79 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 201380 Reda Alhajj, University of Calgary
Database Search
BYU, Provo, USA, March 2013
Problem Definition
You tell the computer what you want in terms that mean something to you; using fuzzy sets
You ask your question from the computer using the fuzzy term
Computer tells you how accurate your results are Degree of membership
81 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Related Work: Database Search
Fuzzy Data Representation Disadvantages:
Existing databases need to be re-structured Prevent traditional users from executing standard
(non-fuzzy) queries
Extending a Query Language to support fuzzy querying without changing the database itself Disadvantages:
Commercially available DBMS’s need to support a new query language
Requires users to learn the new query language
82 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Motivation
Proposing an independent intermediate translation layer to incorporate fuzziness in: the interface/querying facility of database systems to
retrieve more accurate facts Groups within a social network may share the same
intermediate layer Recommendation system based on SNA to help users in
building their intermediate layer
The intermediate layer provides the mapping between fuzziness expected by the user and the actual crisp values stored in the data repository
83 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Methodology
Fuzziness can be specified : Manually: by a human expert Semi-automatically:
A human experts decides on the number of fuzzy sets the intermediate layer defines the fuzzy sets
Fully-automatically: by the intermediate layer
The intermediate layer uses the fuzzy sets specifications to map between fuzziness expected by the user and the actual crisp values stored in the data repository
84 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 2013
Intelligent Database Search
85 Reda Alhajj, University of Calgary
AskFuzzy: Attractive Visual Fuzzy Query Builder*
Fuzzy Query DB
MS
Fuzzy
La
yer
Data Fuzzification
Fuzzy Query Construction
Fuzzy Query Execution
* ICDE 2011 IEEE International Conference on Knowledge Engineeringhttp://cpsc.ucalgary.ca/~nkoochak/AskFuzzy/
1
2
3
• Transferring numeric values to fuzzy sets:Number of Fuzzy sets Fuzzy sets Functions
Manual
Semi-automated
Full-automated
By UserBy User
By System (Initial Fuzzy sets: based on Clustering resultOptimized fuzzy sets: Based on Genetic Algorithm Optimization
By System (Optimization process: Min number of clustersMax cluster quality)
BYU, Provo, USA, March 2013
Conclusions
Data mining and machine learning techniques could be integrated with the network based analysis.
The combination would lead to
A strong framework for data analysis from various perspectives.
Global correlations within the data are considered and hence lead to more realistic results
A variety of application domains could benefit from the integrated setup
87 Reda Alhajj, University of Calgary
BYU, Provo, USA, March 201388 Reda Alhajj, University of Calgary
The End!Thank you for your attention
Reda [email protected]