DATA MINING AND VISUAL ANALYTICS RESEARCH GROUPpafkiet.edu.pk/Portals/0/Graduate School of Science and Engineering... · DATA MINING AND VISUAL ANALYTICS RESEARCH GROUP Introduction

DATA MINING AND VISUAL ANALYTICS

RESEARCH GROUP

Introduction

In recent years, the data mining research community has seen a drift towards the utilization of

visual representation of information for analytical reasoning, knowledge extraction and decision

making. This drift has given birth to a new research domain called Visual Analytics. People also

use the terms such as Information Visualization and Visual Data Mining to refer to this new and

exciting field.

The classical visualization pipeline for visual analytics is shown in Figure 1. From Raw Data to

producing a visual representation for a user to interact and acquire knowledge, there are several

steps as shown in the figure. Research in all these areas is actively persued by scientists around

the world.

Figure 1: Visualization Pipeline for Visual Analytics

The goal of visual analytics is to facilitate the users to interactively search for information,

deduce important facts and identify interesting patterns which in turn, can be used by domain

experts for decision making. Visualization supports this entire process by involving users to

exploit the human capacity to perceive, abstract and understand complex data and information

available to the user.

The use of visualization is fast becoming a crucial analysis technique in a number of different

areas. These areas include but certainly are not limited to:

• Economics: Stock Market Patterns and Analysis.

• Sociology: Social Network Analysis.

• Technology: Exploration of Information on the Web.

• Tranportation: Optimization of Air, Road and Sea travel across the globe.

• Geography: Migration behavior for cities, countries and continents.

• BioInformatics: Analysis and Mining of Biological Networks.

Figure 2 represents visual layouts of data from three of the above mentioned fields. A recent U.S.

report to the funding agencies NIH and NSF provides strong arguments in favor of the

development of visualization as a research field:

“Visualization is indispensable to the solution of complex problems in every sector, from

traditional medical, science and engineering domains to such key areas as financial markets,

national security, and public health. Advances in visualization enable researchers to analyze and

understand unprecedented amounts of experimental, simulated, and observational data and

through this understanding to address problems previously deemed intractable or beyond

imagination.”

[from the Executive summary of (Johnson, Moorhead et al. 2006)]

Figure 2: Molecular Structure, Social Network of Hollywood Actors and

Metabolic Pathways

Aims and Objectives

The Data Mining and Visual Analytics (DaMiVA) research group aims to develop algorithms,

models and systems for technological advancements in the area of Data Mining and Visual

Analytics.

Our goal is to focus on large size relational data and develop high speed and efficient algorithms

for extraction of knowledge, discovering hidden patterns and support interactive data mining

through user interactions.

Often relational data can be represnted through graphs and networks. The term ‘network’ has

different significations for people from different walks of life. The term is used extensively to

represent systems such as social networks, electrical circuits, economic networks, chemical

compounds, transportation systems, epidemic spreading, metabolic pathways, food web, Internet,

world wide web, software classes and so on [6]. Although seemingly diverse, these fields have

strong common methodological foundations and share methods to analyze, model, understand

and organize these networks. We want focus on these real world data sets and address domain

specific issues pertaining to respective fields.

The idea of the research group is to build on the platform provided by Tulip Software. Tulip is

an open source software dedicated to the analysis and visualization of relational data. Tulip aims

to provide the developer with a complete library (in C++), supporting the design of interactive

information visualization applications for relational data that can be tailored for specific

problems. This software is under LGPL licence and can be freely downloaded from its website.

Figure 3: Tulip Software with different views for Data Analysis and Mining

Current Projects

Clustering Dynamic Data: Temporal Behavior of Real World Relational Data

Dynamic Data occurs in real world with great abundance. Every where we look, data is

constantly changing such as stock markets, social networks, online transaction systems, world

wide web. A challenging problem is to study how best to tackle the temporal dimension of data

to extract information and hidden knowledge. There are only a few methods that propose

solution to this problem and we believe that there is a lot of potentional for future research in this

direction. Readers can refer to research papers [3,4,5] for more details.

Desired Skill Set: C/C++, Relational Data Bases, Discrete Methematics, Algorithms and Data

Structures

Statistical Data Mining Techniques and Small World-Scale Free Networks

Statistical Data Mining Techniques have long being used in Data Mining. Due to the recent

interest in the area of Graphs and Networks, the discovery of Small World and Scale Free

properties has revolutionized the field of Network Science. Examples of such networks include

social networks, metabolic networks, world wide web, food web, transportation networks,

chemical r eactions, electrical circuits and so on[6]. Readers can refer to research papers [1,2,6]

for more details.


Structures, Probability and Statistics, Sampling Techniques

Embedded Systems: Optimization of Circuits using Graph Drawing Heuristics

Living in the electronic world, embedded systems can be found all around us. Optimization of

circuit layouts used in embedded systems is a challenging problem. This project is proposed in

conjunction with the College of Engineering and focuses on graph drawing algorithms

addressing classical problems as minimizing edge crossings, minimizing edge lengths and

constrained IC placements.

Desired Skill Set: C/C++, Discrete Methematics, Algorithms and Data Structures, Basic

Electronics

Proteing Interaction Networks: Modeling, Organization and Analysis of these Networks

Recent availability of Protein Sequence Data has catalyzed research in this area. The Uniprot

database contains approximately 11 million protein sequences and is growing exponentially.

Organizing these protein sequences, modeling and understanding the structure of these sequences

is an active area of research. Protein similarity graphs are used where nodes represent individual

proteins and edges represent pairwise sequence similarities between proteins. There are a number

of ways to process these similarity graphs and presents researchers with a challanging problem

with a wide range of applications. Readers can refer to [10,11] for more details.

Desired Skill Set: C/C++, Discrete Methematics, Algorithms and Data Structures

Evaluating Clustering Quality for Graph Mining Algorithms

Clustering graphs is an important research area where many researchers have introduced new

algorithms, each claiming better performance and accuracy. Surprisingly, many of these

algorithms use domain dependent knowledge or the presence of ground truth for evaluating

cluster quality. An open area of research is to put in place a model, which can evaluate cluster

quality in the absence of any prior knowledge. Readers can refer to [7,8,9] for more details.


Structures

Team Members

Faraz Ahmed Zaidi

Assistant Professor

Ph.D. Data Mining and Visual Analytics

M.S. Algorithms, Complexity and Networks

M.S. Software Engineering

B.S. Computer Science

International and External

Our research team has international as well as

departments as well as industrial partners.

and bridge the existing gap between the academia and the industry.

is given below:

Guy Melançon University of Bordeaux I

Celine Rozenblat University of La

César Ducruet University of Paris I, France

Daniel Archambault University College Dublin, Ireland

Arnaud Sallaberry Pikko Software,

and External Collaborations

Our research team has international as well as national collaborations with many research

departments as well as industrial partners. The objective is to closely interact with the industry

and bridge the existing gap between the academia and the industry. The list of these collaborators

University of Bordeaux I and INRIA, France

University of Laussane, Switzerland

University of Paris I, France

University College Dublin, Ireland

Pikko Software, Montpellier, University of Bordeaux I and INRIA

national collaborations with many research

objective is to closely interact with the industry

these collaborators

and INRIA, France

List of Publications by Members of DaMiVA

1. Ducruet, C.; Rozenblat, C. & Zaidi, F. (2010), 'Ports in multi-level maritime networks: evidence from

the Atlantic (1996-2006)', Journal of Transport Geography 18(4), 508 - 518.

2. Gilbert, F.; Simonetto, P.; Zaidi, F.; Jourdan, F. & Bourqui, R. (2010), 'Communities and hierarchical

structures in dynamic social networks: analysis and visualization', Social Network Analysis and

Mining, 1-13.

3. Koenig, P.; Zaidi, F. & Archambault, D. (2010), Interactive Searching and Visualization of Patterns

in Attributed Graphs., in 'Proceedings of Graphics Interface', pp. 113--120.

4. Zaidi, F.; Archambault, D. & Melançon, G. (2010), Evaluating the Quality of Clustering Algorithms

Using Cluster Path Lengths, in 'Advances in Data Mining. Applications and Theoretical Aspects, 10th

Industrial Conference, ICDM', pp. 42-56.

5. Zaidi, F. & Melançon, G. (2010), Organization of Information for the Web using Hierarchical Fuzzy

Clustering Algorithm based on Co-Occurrence Networks, in 'WI-IAT '10: Proceedings of the 2010

IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology',

pp. 421--424.

6. Zaidi, F. & Melançon, G. (2010), Identifying the Presence of Communities in Complex Networks

Through Topological Decomposition and Component Densities, in 'EGC 2010, Extraction et Gestion

de Connaissance', 163-174 .

7. Sallaberry, A.; Zaidi, F.; Pich, C. & Melançon, G. (2010), Interactive Visualization and Navigation of

Web Search Results Revealing Community Structures and Bridges, in 'Proceedings of Graphics

Interface', pp. 105--112.

8. Bourqui, R.; Gilbert, F.; Simonetto, P.; Zaidi, F.; Sharan, U. & Jourdan, F. (2009), Detecting

Structural Changes and Command Hierarchies in Dynamic Social Networks, in 'Social Network

Analysis and Mining, International Conference on Advances in', IEEE Computer Society, Los

Alamitos, CA, USA, pp. 83-88.

9. Zaidi, F.; Sallaberry, A. & Melançon, G. (2009), Revealing Hidden Community Structures and

Identifying Bridges in Complex Networks: An Application to Analyzing Contents of Web Pages for

Browsing, in 'WI-IAT '09: Proceedings of the 2009 IEEE/WIC/ACM International Conference on

Web Intelligence and Intelligent Agent Technology', pp. 198-205.

10. Simonetto, P.; Koenig, P.-Y.; Zaidi, F.; Archambault, D.; Gilbert, F.; Phan Quang, T. T.; Mathiaut,

M.; Lambert, A.; Dubois, J.; Sicre, R.; Brulin, M.; Vieux, R. & Melançon, G. (2009), Solving the

Traffic and Flitter Challenges with Tulip, in 'Proceedings of IEEE Symposium on Vast 2009 IEEE

Symposium on Visual Analytics Science and Technology 2009 ', pp. 247-248 .

11. Bourqui, R.; Zaidi, F.; Gilbert, F.; Sharan, U. & Simonetto, P. (2008), VAST 2008 Challenge: Social

network dynamics using cell phone call patterns, in 'IEEE Symposium on Visual Analytics Science

and Technology'.

References

1. Glymour, C.; Madigan, D.; Pregibon, D. & Smyth, P. (1997), 'Statistical Themes and

Lessons for Data Mining', Data Min. Knowl. Discov 1(1), 11--28.

2. Matloff, N. (2005), A Careful Look at the Use of Statistical Methodology in Data Mining, in

Tsau Young Lin; Setsuo Ohsuga; Churn-Jung Liau; Xiaohua Hu & Shusaku Tsumoto, ed.,

'Foundations of Data Mining and knowledge Discovery', Springer, , pp. 101--117.

3. Bourqui, R.; Gilbert, F.; Simonetto, P.; Zaidi, F.; Sharan, U. & Jourdan, F. (2009), Detecting

Structural Changes and Command Hierarchies in Dynamic Social Networks, in 'Social

Network Analysis and Mining, International Conference on Advances in', IEEE Computer

Society, Los Alamitos, CA, USA, pp. 83-88.

4. Bourqui, R.; Zaidi, F.; Gilbert, F.; Sharan, U. & Simonetto, P. (2008), VAST 2008

Challenge: Social network dynamics using cell phone call patterns, in 'IEEE Symposium on

Visual Analytics Science and Technology'.

5. Gilbert, F.; Simonetto, P.; Zaidi, F.; Jourdan, F. & Bourqui, R. (2010), 'Communities and

hierarchical structures in dynamic social networks: analysis and visualization', Social

Network Analysis and Mining, 1-13.

6. Zaidi, F. Ph.D. Thesis, Analysis, Structure and Organization of Complex Networks. Ecole

doctorale de Mathématiques et Informatique, Université de Bordeaux, France. 2010

7. Schaeffer, S. E. Graph clustering Computer Science Review, 2007, 1, 27-64.

8. Fortunato, S. Community detection in graphs, 2009.

9. Zaidi, F.; Archambault, D. & Melançon, G. (2010), Evaluating the Quality of Clustering

Algorithms Using Cluster Path Lengths, in 'Advances in Data Mining. Applications and

Theoretical Aspects, 10th Industrial Conference, ICDM', pp. 42-56.

10. Apeltsin, L.; Morris, J. H.; Babbitt, P. C. & Ferrin, T. E. Improving the quality of protein

similarity network clustering algorithms using the network edge weight distribution,

Bioinformatics, 2011, 27, 326-333.

11. Hu, H.; Yan, X.; Huang, Y.; Han, J. & Zhou, X. J. Mining coherent dense subgraphs across

massive biological networks for functional discovery, Bioinformatics, 2005, 21, i213-221.

Documents

DATA MINING AND VISUAL ANALYTICS RESEARCH GROUPpafkiet.edu.pk/Portals/0/Graduate School of Science and Engineering... · DATA MINING AND VISUAL ANALYTICS RESEARCH GROUP Introduction