The Power of Data Mining and Machine Learning Techniques for Network Construction and Analysis

The Power of Data Mining and Machine Learning Techniques for Network

Construction and Analysis

Reda Alhajj

University of Calgary, Calgary, Alberta, CanadaGlobal University, Beirut, Lebanon

[email protected]

BYU, Provo, USA, March 2013

General Overview

The network model provides a powerful platform to study a group of entities and their relationships

The semantics of the links in the network is determined by considering the application domain to be investigated

A network can be constructed by considering pairwise correlation between entities or by investigating the correlation between two entities based on a global view of the data

Data mining and machine learning techniques allow for better investigation by globally visioning the data to derive the strength of pairwise links

The combination of data mining, machine learning and network analysis would lead to a comprehensive and robust framework for data analysis.

2 Reda Alhajj, University of Calgary

BYU, Provo, USA, March 20133 Reda Alhajj, University of Calgary

Outline of the talk

Background on ARM, Clustering, Network Model, fuzziness

From FPM, ARM and clustering to network

Some Application Domains: database design web mining terror network analysis outlier detection Disease Biomarker Database search

Conclusions and research directions


Overview of Association Rules Mining

A general model for mining domains where there is many2many relationship between two sets of entities, e.g., baskets and items; documents and words, etc.

Consider a set of items I = {I1 , I2 , I3 ,…, Im }

Consider a database of transactions D where each transaction T is a set of items such that T I

So, if A is a set of items a transaction T is said to contain A if and only if A T

An association rule is an implication or correlation of the form:

A B where A I, B I, and A B =

Support and confidence are the measures generally used to filter the rules



Association Rules Mining: Two Steps

In general association rules mining can be reduced to the following two steps:

1. Find all frequent itemsets Each itemset will occur at least as frequently as a

minimum support count

2. Generate strong association rules from the frequent itemsets These rules will satisfy minimum support and confidence

measures

We use the outcome from the first step in part of the research and the outcome from the second step in another part of the research



Association Rules Mining: Apriori Algorithm

Any subset of a frequent itemset must be frequent Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!

Minimum support = 2



Association Rule MiningFrequent Closed Itemset

A frequent itemset X is closed if none of its immediate supersets has the same support as the itemset X

Example

Image Reference: http://www.siam.org/meetings/sdm06/proceedings/038lucchesec.pdf



Clustering

It is an unsupervised learning process

It is the process of distributing a given set of data instances into groups such that the similarity of instances is high within each group and low between the groups. Similarity within the cluster (intra-cluster) is measured using

variance average variance or TWCV Similarity across the clusters (inter-cluster) is measure based on

linkage.

For clustering we need to know at least the characteristics of the instances and the similarity measure to be used in the process

Various algorithms exist for clustering, e.g., k-means, DBscan,

Each algorithm has its advantages and disadvantages



Clustering


Example 1

Example 2


Overview of Social Network Analysis

A social network is a set of entities called actors and the links connecting them. Ex: students enrolled in same courses, people and likes, etc A social network is mostly represented as a graph called sociogram

Social Network Analysis (SNA) is powerful because it has foundations in math/graph theory

SNA provides a set of tools to empirically extend our theoretical intuition of the patterns that compose a social structure.

SNA provides a set of relational methods for systematically understanding and identifying connections among actors.

SNA embodies a range of theories relating types of observable social spaces and their relation to individual and group behavior.



Social Network AnalysisCentrality Measures

Degree Sum of connections (sum of the weights of connections in

case of weighted graphs) from or to an actor

Closeness Distance of one actor to all others in the network

Betweenness The number of shortest paths that passes through an actor

Eigen-vector Measures how importance of an actor



Social Network AnalysisCentrality Measures (example)

The red nodes have the highest degree centrality

The blue node has the highest Closeness and betweenness centrality

Node 7 has the highest degree centrality

Node 8 has the highest betweenness Centrality

Nodes 4 and 5 have the highest Closeness Centrality

Example 1 Example 2

Image Reference:http://www.biomedcentral.com/

Image Reference:http://mande.co.uk/special-issues/network-models/



Social Network AnalysisGraph Clustering Algorithms

MST based clustering First finds a Minimum Spanning Tree (MST) of the graph

Removes edges with the highest weight from the MST to form clusters of vertices (actors)

Edge Betweenness clustering The betweenness of an edge is defined as the extent to

which the edge lies along shortest paths

First computes edge betweenness for all edges in current graph

Removes edges having the highest betweenness from the graph



One Mode versus Two Mode Networks

Queries (users) versus Tables is a two mode network

Folding is used to produceone mode networks from a two mode network

Folding is simply the multiplicationof the adjacency matrix of the two mode network by its transpose

X Y Z

A 1 0 0

B 1 0 1

C 1 1 0

D 1 0 1

A B C D

X 1 1 1 1

Y 0 0 1 0

Z 0 1 0 1



Fuzzy Sets Generalizes the classical set theory by a characteristic

membership function.

A membership function introduces a grey area between the black and white areas

Consider fuzzy set A, its domain D, and object x.

Membership function µ specifies the degree of membership of x in A:

µA(x): D → [0, 1].

µA(x)= 0 means x does not belong to A.

µA(x)= 1 means x completely belongs to A.

Intermediate values 0< µA(x)<1 represent varying degree of membership.



Income Range Centroid

Quite poor 10-10-30 -Poor 10-30-70 30

Moderate 30-70-120 70

Rich 70-120-120 -

The ranges of fuzzy sets

10K 30K 70K 120Kincome($)

poor moderate richquitepoor

The membership functions found according to the centroids

Example on Membership

1.0

0.5

0.0

Membership



From FPM to Network Construction

Given a data set of M instances and N features per instance

Prepare the data for FPM by deciding on the baskets and items. Keep in mind that items are the actors in the network

Apply the FPM algorithm of your choice to find Frequent sets of items; it is possible to narrow down to closed or maximal FP

Construct the network by considering the frequent sets as follows:

Add a link between two actors i and j iff i and j exist together in at least one FP, the weight of the link is set to the number of common FP’s

It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc.


From FPM to Network Construction



From ARM to Network Construction


Prepare the data for ARM by deciding on the baskets and items. Keep in mind that items are the actors in the network; they will form the antecedents and consequents of the rules

Apply the ARM algorithm of your choice to find all AR’s that satisfy certain criteria

Construct the network by considering the AR’s as follows: Add a link between two actors i and j iff i and j exist together in

at least one AR, the weight of the link is set to the number of common AR’s. It is possible to concentrate on antecedent, consequent or both.



From ARM to Network Construction


From Clustering to Network Construction


Prepare the data for clustering by deciding on the features to consider in computing the similarity measure

Apply either one clustering algorithm several times by playing with the required input parameters or a number of clustering algorithms to find one clustering solution per run.

Construct the network by considering the clusters as follows: Add a link between two actors i and j iff i and j exist together in

the same cluster in at least one clustering solution, the weight of the link is set to the number of common clusters across the solutions.



Network Construction

Multiple clustering solutions



From the Data to Network Construction


Prepare the data processing by deciding on the features P to consider in the analysis

Construct a MxP matrix A by considering every instance as a row and every feature as a column

Find the transpose of matrix A

Multiply matrix A by its transpose to get the adjacency matrix for the target network.


NetDriller : A Powerful Social Network Analysis Tool*

Negar Koochakzadeh, Atieh Sarraf, Keivan Kianmehr, Jon Rokne, Reda Alhajj{nkoochak, sarrafsa}@ucalgary.ca, [email protected], {alhajj, rokne}@ucalgary.ca Social Network Analysis (SNA) is a technique first used in sociology.

Recently computer scientists have realized that this model is general enough to be applied to any domain where the entities and their interconnections can be separated into actors and their links, respectively. Data Mining techniques can strengthen SNA

Searching in the Network: Example1: Find individuals who could monitor the information flow in an organization better than most others. Example 2: Find individuals who have best picture of what is happening in the network as a whole.

Closeness centrality reveals how long it takes information to spread from one individual to others in the network. High scoring individuals in Closeness have the shortest paths to all others in the network.Betweenness centrality indicates the extent that an individual is a broker of indirect connections among all others in a network. Someone with high Betweenness could be thought of as a gatekeeper of information flow. People that occur on many shortest paths among other People have highest Betweenness value.Degree centrality indicates the extent that an individual send or receive information to the neighbors.Eigenvector centrality calculates the principle eigenvector of the network. A node is central to the extent that its neighbors are central.

Fuzzy Query Example: Find individuals with high centralities

Raw Dataset: People and their attributes

Social Network: Based on community detection

Fuzzy Query Result: Color hue shows DofM

Fuzzy Sets: Based on multi-objective GA optimization

age work class education Marital status occupation relationship race sex Hours/week nativecountry

39 State-gov Bachelors Never-married Adm-clerical Not-in-family White Male 40 US50 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Husband White Male 13 Canada52 Self-emp-not-inc HS-grad Married-civ-spouse Exec-managerial Husband White Male 45 US30 State-gov Bachelors Married-civ-spouse Prof-specialty Husband Black Male 40 India25 Self-emp-not-inc HS-grad Never-married Farming-fishing Own-child White Male 35 Iran43 Self-emp-not-inc Masters Divorced Exec-managerial Unmarried White Female 45 US…

1Network Construction

2

* ICDM 2011 IEEE International Conference on Data Mining http://cpsc.ucalgary.ca/~nkoochak/NetDriller/


IMPROVING DATABASE PERFORMANCE BY BUILDING AND ANALYZING NETWORK OF TABLES FROM QUERY ACCESS PATTERNS


Problem Definition

Response time in a distributed or parallel database system is largely determined by how data is organized and stored on different machines/sites.

The goal is to place related data on nearby, or preferably the same, sites to minimize the response time.

The study of data distribution requires solving two problems: 1. The partitioning problem 2. The allocation problem



Queries (users) versus Tables



Overview of the analysis process

Three main steps:

1. Considering tables as items and queries as transactions, extract frequent closed itemsets

A kind of fuzzy sets can be built from the closed itemsets in this step

2. Use the extracted itemsets from the previous step to build the network of tables

3. Use network analysis to extract information about the tables from the network of tables



Step1Items and Transactions

Sample database EMPLOYEE (Ssn, Fname, Lname, Dno) DEPARTMENT (Dnumber, Dname) PROJECT (Pnumber, Pname, Plocation, Dno)

Sample query (Q1) SELECT Lname

FROM EMPLOYEE, DEPARTMENTWHERE DNO = Dnumber AND Dname = ‘Reasearch’

Items EMPLOYEE, DEPARTMENT, PROJECT

Transactions Q1: EMPLOYEE, DEPARTMENT



Step 1Example (Sample Database)

Sample database schema from Fundamentals of Database Systems, Elmasri/Navathe



Step 1Example (List of Queries)

List of Queries in Transaction FormatQ1 EMPLOYEE DEPARTMENT

Q2 EMPLOYEE DEPARTMENT PROJECT

Q3 EMPLOYEE DEPARTMENT

Q4 EMPLOYEE DEPARTMENT WORKS_ON PROJECT

Q5 EMPLOYEE WORKS_ON PROJECT

Q6 EMPLOYEE DEPARTMENT WORKS_ON PROJECT

Q7 EMPLOYEE DEPENDENT






Q13 WORKS_ON PROJECT

Q14 WORKS_ON PROJECT




Step 1Example (Closed Itemsets)

List of frequent closed itemsets with min-support-threshold = 2

Note: 1-itemsets are omitted from the results

Itemset FrequencyEMPLOYEE, DEPARTMENT, WORKS_ON, PROJECT 2

EMPLOYEE, WORKS_ON, PROJECT 5

EMPLOYEE, DEPARTMENT, PROJECT 3

EMPLOYEE, PROJECT 6

WORKS_ON, PROJECT 7

EMPLOYEE, DEPARTMENT 7

EMPLOYEE, DEPENDENT 3



Step1Example (Fuzzy Sets)

Fuzzy Sets{WORKS_ON: 0.500, PROJECT: 0.304}

{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}

{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}

{EMPLOYEE: 0.231, PROJECT: 0.261}

{EMPLOYEE: 0.269, DEPARTMENT: 0.583}

{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}

{EMPLOYEE: 0.115, DEPENDENT: 1.000}

Itemset Frequency

EMPLOYEE, DEPARTMENT, WORKS_ON, PROJECT

2

EMPLOYEE, WORKS_ON, PROJECT 5

EMPLOYEE, DEPARTMENT, PROJECT 3

EMPLOYEE, PROJECT 6

WORKS_ON, PROJECT 7

EMPLOYEE, DEPARTMENT 7

EMPLOYEE, DEPENDENT 3



Example (Fuzzy Sets)

SUGGESTED ALLOCATION, NO REPLICATION CASE{WORKS_ON: 0.500, PROJECT: 0.304}




{EMPLOYEE: 0.269, DEPARTMENT: 0.583, DEPENDENT: 1.000}


{EMPLOYEE: 0.115}

Fuzzy Sets

{WORKS_ON: 0.500, PROJECT: 0.304}




{EMPLOYEE: 0.269, DEPARTMENT: 0.583}





Example (Fuzzy Sets)

SUGGESTED ALLOCATION, REPLICATION CASE; AT MOST THREE REPLICA ALLOWED{WORKS_ON: 0.500, PROJECT: 0.304}




{EMPLOYEE: 0.269, DEPARTMENT: 0.583, DEPENDENT: 1.000}





Step2Building the Network

Each item (table) is a node in the network

An edge exists between two nodes if they appear together in at least one frequent closed itemset

The weight of an edge between two nodes is related to the number of frequent closed itemsets in which corresponding tables appear together

Weight is normalized



Step 2Example

Network of tables

Note: Table DEPT_LOCATIONS is not included in the graph since this table did not appear in any of the queries



Step 3Applying Network Analysis

Various network analysis techniques can be used to extract relationships of tables from the social network

Centrality measures can be used to identify the tables that are in relationship with many other tables and consequently play a key role in linking data from different tables together

Graph clustering algorithms can be applied to find groups of tables that are frequently accessed together in queries



Step 3Example (Centrality Measures)

Tables Degree (unweighted)

Closeness Betweenness

EMPLOYEE 4 0.40 6

DEPARTMENT 3 0.27 4

WORKS_ON 3 0.25 4

PROJECT 3 0.36 4

DEPENDENT 1 0.18 4



Step 3Example (Clustering Results)

Edge betweenness clusters C1: EMPLOYEE, PROJECT, DEPARTMENT C2: WORKS_ON C3: DEPENDENT

MST clusters C1: DEPENDENT C2: EMPLOYEE, WORKS_ON, PROJECT C3: DEPARTMENT

Clustering results may seem meaningless since in this example we have 5 highly correlated nodes in the graph



Experiment1Centrality Measures

This experiment has been done on a synthetic dataset of 14 tables (T0 to T13) and 20 queries, min-support-threshold = 2

High degree nodes T10: 6 T14: 4

High closeness nodes T10: 0.25 T14: 0.20

High betweenness nodes T10: 86 T14: 49



Experiment1Clustering Result

Edge betweenness clusters C1: T11, T12, T13, T14 C2: T1, T0, T2 C3: T4, T5, T10, T8, T3

MST clusters C1: T11

C2: T4, T3 C3: T5, T10, T12, T13, T8, T14, T1, T0, T2



Experiment 2Centrality Measures

The experiment has been done on a synthetic dataset of 14 tables (T0 to T13) and 30 queries, min-support-threshold = 1

High degree nodes T7: 12 T10: 11

High closeness nodes T10: 0.20 T7: 0.19

High betweenness nodes T7: 43 T10: 31



Experiment 2Clustering Result

Edge betweenness clusters C1: T6 C2: T8 C3: T4, T5, T3, T2 C4: T1, T0 C5: T7, T10, T11, T12, T13, T14, T9

MST clusters C1: T6, T8 C2: T11 C3: T7, T9 C4: T10, T12, T13, T14, T1, T0, T2 C5: T4, T5, T3



To further demonstrate the effectiveness of the proposed approach in practice

we conducted another experiment using a synthetic query set of 1000 queries on 50 tables

finding real data is very hard because this type of data is very sensitive and hence highly confidential.

We have generated the data by restricting the number of tables that could appear in the same query to be at most 20 one query may require accessing at most 20 different

tables, though in practice it is not more than four or five tables.




These are four example communities:

{T6, T8, T9, T22, T23, T24, T33 } –

{ T6, T9, T21, T37, T42, T45} –

{T5, T6, T11, T13, T14, T16, T19 } –

{ T6, T7, T9, T10, T12, T13, T19} .



From Frequent Patterns to Network construction


Overview

Given a dataset, e.g., emails exchanged between a group of people, like employees in the same company

Partition the dataset into groups based on a certain criteria to be studied To study the employees, all emails are grouped such that emails of

the same employee form one group

Decide on the items to be considered in the analysis E.g., each email could be a transaction and words/emails within

the header/text could be items

Mine FP within each group and globally

Find relevant features for each group based on the entropy



The Proposed Framework

Feature Extraction Model

Network Creation Model

Mine frequent closed

patterns

Calculate weights offeatures to

create feature vectors

Select suitable features based on entropy ranking

Freq. Closed Pats.

Features

Statistical Analysis Model

Front End Interface and Visualization Tool


Feature Extraction Model: The Feature Vector

The feature vector related to entity ej with m features is represented

as -

Fj = ( w(f1), w(f2), …, w(fm) ),

where w(fk) is the weight of the k-th feature, fk in entity ej.


Feature Extraction Model: Weight of a Feature

The weights of each feature is calculated using the following formula,

wDj(fk) = supDj(fk)/supD(fk)

where

wDj(fk) is the weight of the feature k for entity ej,

supDj(fk) is frequency of feature fk across dataset Dj of entity ej,

and

supD(fk) is frequency of fk across dataset D of all entities E.


Experimental Results: Enron E-mail dataset description

Dataset contains 500,000 e-mail messages over 150 Enron employees.

For this analysis inbox having more than 1000 e-mails were considered.

From each user’s inbox we have chosen 1000 e-mails randomly that makes the e-mail dataset for the corresponding user.


Experimental Results: Processing Enron E-mail dataset

Identify itemsets from email dataset –

The stem words appearing in the body and the subject line of the e-mails are considered as items.

E-mail addresses inside the e-mails are identified as items as well.

These items appearing in a single e-mail are considered as a single transaction

This way for each user we make a transactional database of 1000 e-mail transactions for each of the 1000 e-mails in the inbox

From these transactional databases we identify the globally frequent closed itemsets (corresponding to a support of 10%)

Based on entropy ranking we chose top 100 closed itemsets as our feature set.


Experimental Results: Euclidean Distance Matrix for Enron Users

buy deanermi

sjone

s

kamiski

keavey

lokeymay

sagersaibi

salisbury

shackleton

thomas

whalley

ybarbo

buy 0.00 0.65 0.57 0.26 0.43 0.41 0.43 0.35 0.32 0.36 0.25 0.22 0.65 0.60 0.59

dean 0.65 0.00 0.13 0.50 0.28 0.50 0.27 0.68 0.40 0.44 0.73 0.64 0.08 0.10 0.13

ermis 0.57 0.13 0.00 0.44 0.22 0.44 0.21 0.61 0.33 0.38 0.65 0.56 0.15 0.14 0.16

jones 0.26 0.50 0.44 0.00 0.27 0.35 0.29 0.38 0.19 0.26 0.36 0.21 0.50 0.47 0.44

kamiski 0.43 0.28 0.22 0.27 0.00 0.31 0.16 0.47 0.17 0.28 0.51 0.39 0.28 0.25 0.25

keavey 0.41 0.50 0.44 0.35 0.31 0.00 0.38 0.25 0.30 0.41 0.45 0.38 0.51 0.47 0.50

lokey 0.43 0.27 0.21 0.29 0.16 0.38 0.00 0.50 0.22 0.25 0.52 0.41 0.27 0.25 0.24

may 0.35 0.68 0.61 0.38 0.47 0.25 0.50 0.00 0.40 0.45 0.35 0.33 0.69 0.65 0.67

sager 0.32 0.40 0.33 0.19 0.17 0.30 0.22 0.40 0.00 0.25 0.44 0.28 0.40 0.36 0.36

saibi 0.36 0.44 0.38 0.26 0.28 0.41 0.25 0.45 0.25 0.00 0.45 0.34 0.43 0.41 0.41

salisbury 0.25 0.73 0.65 0.36 0.51 0.45 0.52 0.35 0.44 0.45 0.00 0.30 0.75 0.70 0.70

shackleton 0.22 0.64 0.56 0.21 0.39 0.38 0.41 0.33 0.28 0.34 0.30 0.00 0.63 0.60 0.59

thomas 0.65 0.08 0.15 0.50 0.28 0.51 0.27 0.69 0.40 0.43 0.75 0.63 0.00 0.09 0.13

whalley 0.60 0.10 0.14 0.47 0.25 0.47 0.25 0.65 0.36 0.41 0.70 0.60 0.09 0.00 0.11

ybarbo 0.59 0.13 0.16 0.44 0.25 0.50 0.24 0.67 0.36 0.41 0.70 0.59 0.13 0.11 0.00

Distance cutoff point 0.30


Experimental Results: The Enron E-mail users’ social network based on e-mail usage


Five CLUSTERS OF ENRON E-MAIL.

1 saibi

2 buy, salisbury, shakleton, jones

3 dean, ermis, jones, kaminski, lokey, sager, thomas, whalley, ybarbo

4 keavey

5 may

Experimental Results: The Enron E-mail users’ social network based on e-mail usage


From Association rules to Network


Basic Steps

Given a website The mining process can be applied on three dimensions:

content, structure and log

Actors in the network are the pages.

Construct the adjacency matrix by mining association rules from the transactional database obtained after preprocessing the web log data:

Each transaction is a set of pages accessed together in one session.

FPM algorithm, e.g., Apriori or FP-growth is applied on the derived

transactional data and association rules are derived.


Basic Steps

Determine frequent Itemsets

Find association rules

Add items in the rule as node in the graph and connect items in the left side to items in the right side (directed edges)

Use support and confidence to find a combined weight of each added edge

If edge already exist then add the new weight to the existing weight of the edge

Analyze the graph using SNA techniques


From Association Rules to Social Network



Analyze weblog

Determine frequent sets of pages based on frequency of pages accessed together

Determine rules and keep only those satisfying minimum confidence

Construct network of pages based on rules


From Association Rules to Network

Each rule is reflected in the adjacency matrix by incrementing every entry (i; j) such that pages i and j exist in the antecedent and consequent of the rule, respectively.

Entries in the adjacency matrix are normalized by dividing each value by the overall average of the values that exist in the matrix.

The network is analyzed to rank the pages by considering their in-degrees, out-degrees, and betweenness, eigen-vector centrality.

Pages with high betweenness centrality are considered as important to link pages from different communities.



analysis was done using the software Visone (http://visone.info/)

Betweeness Centrality measure



Closeness Centrality measure



Eigenvector Centrality measure


From Multi-objective GA based clustering to Network Construction

The case of Genes/Proteins


Motivation

In most traditional clustering algorithms, number of clusters is given a-priori.

In fact: the clustering criteria is dependent on more than one objective!

Cluster validation to assess the number of clusters.

Multi-objective clustering must work on small and large data sets.


Objective Functions For Clustering

Three objectives:

F1 : minimize the number of clusters

F2 : maximize the heterogeneity between clusters

F3 : maximize the within cluster homogeneity


Objective functions


Divide and Conquer

Basic Steps:

If the dataset to be clustered is of manageable size then it is clustered as a whole set.

Otherwise

repeat the following steps

Partition the dataset (or set of centroids after the first iteration) into subsets of manageable size

Cluster each subset individually by applying multi-objective GA combined with validity analysis to get the centroids of the obtained clusters

If the set of all centroids is of manageable size then cluster the whole set of centroids and exit the loop

Backtrack to merge clusters that have their centroids ending up in the same final cluster


Unique Solution of Compact Clusters


From Alternative Solutions to Adjacency Matrix

GenesGenes

Genes

Entry (i,j) specifies number of solutions where Genei and Genej occurred in the same cluster


From Adjacency Matrix to Network


Criminal and Terror Network Analysis


Terror Network Analysis by Clustering

We developed a framework that employs clustering, frequent pattern mining and some social network analysis measures to determine the effectiveness of a network.

The clustering and frequent pattern mining techniques start with the adjacency matrix of the network.

For clustering, we utilize entries in the table by considering each row as an object and each column as a feature.

features of a network member are his/her direct neighbors. We maintain the weight of links in case of weighted

network links.



Multi-Objective GA based Clustering

We applied multi-objective GA based clustering



Terror Network Analysis by Clustering & FPM

For Clustering, we consider each row as an instance and each column as a feature

We Cluster instances to find important groups and individuals within the network

For frequent pattern mining, we consider each row of the adjacency matrix as a transaction and each column as an item.

We map entries into a 0/1 scale such that every entry whose value is greater than zero is assigned the value one; entries keep the value zero otherwise.

This way we can apply frequent pattern mining algorithms to determine the most influential members in a network as well as the effect of removing some members or even links between members of a network.



Terror Network Analysis

We investigate the effect of adding some links between members.

We are able to study how the various members in the network change role as the network evolves.

This is measured by applying some SNA measures on the network at each stage during the development.

We report some interesting results related to on various benchmark networks: including 9/11 and Madrid bombing.



Database Search


Problem Definition

You tell the computer what you want in terms that mean something to you; using fuzzy sets

You ask your question from the computer using the fuzzy term

Computer tells you how accurate your results are Degree of membership



Related Work: Database Search

Fuzzy Data Representation Disadvantages:

Existing databases need to be re-structured Prevent traditional users from executing standard

(non-fuzzy) queries

Extending a Query Language to support fuzzy querying without changing the database itself Disadvantages:

Commercially available DBMS’s need to support a new query language

Requires users to learn the new query language



Motivation

Proposing an independent intermediate translation layer to incorporate fuzziness in: the interface/querying facility of database systems to

retrieve more accurate facts Groups within a social network may share the same

intermediate layer Recommendation system based on SNA to help users in

building their intermediate layer

The intermediate layer provides the mapping between fuzziness expected by the user and the actual crisp values stored in the data repository



Methodology

Fuzziness can be specified : Manually: by a human expert Semi-automatically:

A human experts decides on the number of fuzzy sets the intermediate layer defines the fuzzy sets

Fully-automatically: by the intermediate layer

The intermediate layer uses the fuzzy sets specifications to map between fuzziness expected by the user and the actual crisp values stored in the data repository



Intelligent Database Search


AskFuzzy: Attractive Visual Fuzzy Query Builder*

Fuzzy Query DB

MS

Fuzzy

La

yer

Data Fuzzification

Fuzzy Query Construction

Fuzzy Query Execution

* ICDE 2011 IEEE International Conference on Knowledge Engineeringhttp://cpsc.ucalgary.ca/~nkoochak/AskFuzzy/

1

2

3

• Transferring numeric values to fuzzy sets:Number of Fuzzy sets Fuzzy sets Functions

Manual

Semi-automated

Full-automated

By UserBy User

By System (Initial Fuzzy sets: based on Clustering resultOptimized fuzzy sets: Based on Genetic Algorithm Optimization

By System (Optimization process: Min number of clustersMax cluster quality)


Conclusions

Data mining and machine learning techniques could be integrated with the network based analysis.

The combination would lead to

A strong framework for data analysis from various perspectives.

Global correlations within the data are considered and hence lead to more realistic results

A variety of application domains could benefit from the integrated setup



The End!Thank you for your attention

Reda [email protected]

Documents

The Power of Data Mining and Machine Learning Techniques for Network Construction and Analysis