Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
DATA MINING AND VISUAL ANALYTICS
RESEARCH GROUP
Introduction
In recent years, the data mining research community has seen a drift towards the utilization of
visual representation of information for analytical reasoning, knowledge extraction and decision
making. This drift has given birth to a new research domain called Visual Analytics. People also
use the terms such as Information Visualization and Visual Data Mining to refer to this new and
exciting field.
The classical visualization pipeline for visual analytics is shown in Figure 1. From Raw Data to
producing a visual representation for a user to interact and acquire knowledge, there are several
steps as shown in the figure. Research in all these areas is actively persued by scientists around
the world.
Figure 1: Visualization Pipeline for Visual Analytics
The goal of visual analytics is to facilitate the users to interactively search for information,
deduce important facts and identify interesting patterns which in turn, can be used by domain
experts for decision making. Visualization supports this entire process by involving users to
exploit the human capacity to perceive, abstract and understand complex data and information
available to the user.
The use of visualization is fast becoming a crucial analysis technique in a number of different
areas. These areas include but certainly are not limited to:
• Economics: Stock Market Patterns and Analysis.
• Sociology: Social Network Analysis.
• Technology: Exploration of Information on the Web.
• Tranportation: Optimization of Air, Road and Sea travel across the globe.
• Geography: Migration behavior for cities, countries and continents.
• BioInformatics: Analysis and Mining of Biological Networks.
Figure 2 represents visual layouts of data from three of the above mentioned fields. A recent U.S.
report to the funding agencies NIH and NSF provides strong arguments in favor of the
development of visualization as a research field:
“Visualization is indispensable to the solution of complex problems in every sector, from
traditional medical, science and engineering domains to such key areas as financial markets,
national security, and public health. Advances in visualization enable researchers to analyze and
understand unprecedented amounts of experimental, simulated, and observational data and
through this understanding to address problems previously deemed intractable or beyond
imagination.”
[from the Executive summary of (Johnson, Moorhead et al. 2006)]
Figure 2: Molecular Structure, Social Network of Hollywood Actors and
Metabolic Pathways
Aims and Objectives
The Data Mining and Visual Analytics (DaMiVA) research group aims to develop algorithms,
models and systems for technological advancements in the area of Data Mining and Visual
Analytics.
Our goal is to focus on large size relational data and develop high speed and efficient algorithms
for extraction of knowledge, discovering hidden patterns and support interactive data mining
through user interactions.
Often relational data can be represnted through graphs and networks. The term ‘network’ has
different significations for people from different walks of life. The term is used extensively to
represent systems such as social networks, electrical circuits, economic networks, chemical
compounds, transportation systems, epidemic spreading, metabolic pathways, food web, Internet,
world wide web, software classes and so on [6]. Although seemingly diverse, these fields have
strong common methodological foundations and share methods to analyze, model, understand
and organize these networks. We want focus on these real world data sets and address domain
specific issues pertaining to respective fields.
The idea of the research group is to build on the platform provided by Tulip Software. Tulip is
an open source software dedicated to the analysis and visualization of relational data. Tulip aims
to provide the developer with a complete library (in C++), supporting the design of interactive
information visualization applications for relational data that can be tailored for specific
problems. This software is under LGPL licence and can be freely downloaded from its website.
Figure 3: Tulip Software with different views for Data Analysis and Mining
Current Projects
Clustering Dynamic Data: Temporal Behavior of Real World Relational Data
Dynamic Data occurs in real world with great abundance. Every where we look, data is
constantly changing such as stock markets, social networks, online transaction systems, world
wide web. A challenging problem is to study how best to tackle the temporal dimension of data
to extract information and hidden knowledge. There are only a few methods that propose
solution to this problem and we believe that there is a lot of potentional for future research in this
direction. Readers can refer to research papers [3,4,5] for more details.
Desired Skill Set: C/C++, Relational Data Bases, Discrete Methematics, Algorithms and Data
Structures
Statistical Data Mining Techniques and Small World-Scale Free Networks
Statistical Data Mining Techniques have long being used in Data Mining. Due to the recent
interest in the area of Graphs and Networks, the discovery of Small World and Scale Free
properties has revolutionized the field of Network Science. Examples of such networks include
social networks, metabolic networks, world wide web, food web, transportation networks,
chemical r eactions, electrical circuits and so on[6]. Readers can refer to research papers [1,2,6]
for more details.
Desired Skill Set: C/C++, Relational Data Bases, Discrete Methematics, Algorithms and Data
Structures, Probability and Statistics, Sampling Techniques
Embedded Systems: Optimization of Circuits using Graph Drawing Heuristics
Living in the electronic world, embedded systems can be found all around us. Optimization of
circuit layouts used in embedded systems is a challenging problem. This project is proposed in
conjunction with the College of Engineering and focuses on graph drawing algorithms
addressing classical problems as minimizing edge crossings, minimizing edge lengths and
constrained IC placements.
Desired Skill Set: C/C++, Discrete Methematics, Algorithms and Data Structures, Basic
Electronics
Proteing Interaction Networks: Modeling, Organization and Analysis of these Networks
Recent availability of Protein Sequence Data has catalyzed research in this area. The Uniprot
database contains approximately 11 million protein sequences and is growing exponentially.
Organizing these protein sequences, modeling and understanding the structure of these sequences
is an active area of research. Protein similarity graphs are used where nodes represent individual
proteins and edges represent pairwise sequence similarities between proteins. There are a number
of ways to process these similarity graphs and presents researchers with a challanging problem
with a wide range of applications. Readers can refer to [10,11] for more details.
Desired Skill Set: C/C++, Discrete Methematics, Algorithms and Data Structures
Evaluating Clustering Quality for Graph Mining Algorithms
Clustering graphs is an important research area where many researchers have introduced new
algorithms, each claiming better performance and accuracy. Surprisingly, many of these
algorithms use domain dependent knowledge or the presence of ground truth for evaluating
cluster quality. An open area of research is to put in place a model, which can evaluate cluster
quality in the absence of any prior knowledge. Readers can refer to [7,8,9] for more details.
Desired Skill Set: C/C++, Relational Data Bases, Discrete Methematics, Algorithms and Data
Structures
Team Members
Faraz Ahmed Zaidi
Assistant Professor
Ph.D. Data Mining and Visual Analytics
M.S. Algorithms, Complexity and Networks
M.S. Software Engineering
B.S. Computer Science
International and External
Our research team has international as well as
departments as well as industrial partners.
and bridge the existing gap between the academia and the industry.
is given below:
Guy Melançon University of Bordeaux I
Celine Rozenblat University of La
César Ducruet University of Paris I, France
Daniel Archambault University College Dublin, Ireland
Arnaud Sallaberry Pikko Software,
and External Collaborations
Our research team has international as well as national collaborations with many research
departments as well as industrial partners. The objective is to closely interact with the industry
and bridge the existing gap between the academia and the industry. The list of these collaborators
University of Bordeaux I and INRIA, France
University of Laussane, Switzerland
University of Paris I, France
University College Dublin, Ireland
Pikko Software, Montpellier, University of Bordeaux I and INRIA
national collaborations with many research
objective is to closely interact with the industry
these collaborators
and INRIA, France
List of Publications by Members of DaMiVA
1. Ducruet, C.; Rozenblat, C. & Zaidi, F. (2010), 'Ports in multi-level maritime networks: evidence from
the Atlantic (1996-2006)', Journal of Transport Geography 18(4), 508 - 518.
2. Gilbert, F.; Simonetto, P.; Zaidi, F.; Jourdan, F. & Bourqui, R. (2010), 'Communities and hierarchical
structures in dynamic social networks: analysis and visualization', Social Network Analysis and
Mining, 1-13.
3. Koenig, P.; Zaidi, F. & Archambault, D. (2010), Interactive Searching and Visualization of Patterns
in Attributed Graphs., in 'Proceedings of Graphics Interface', pp. 113--120.
4. Zaidi, F.; Archambault, D. & Melançon, G. (2010), Evaluating the Quality of Clustering Algorithms
Using Cluster Path Lengths, in 'Advances in Data Mining. Applications and Theoretical Aspects, 10th
Industrial Conference, ICDM', pp. 42-56.
5. Zaidi, F. & Melançon, G. (2010), Organization of Information for the Web using Hierarchical Fuzzy
Clustering Algorithm based on Co-Occurrence Networks, in 'WI-IAT '10: Proceedings of the 2010
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology',
pp. 421--424.
6. Zaidi, F. & Melançon, G. (2010), Identifying the Presence of Communities in Complex Networks
Through Topological Decomposition and Component Densities, in 'EGC 2010, Extraction et Gestion
de Connaissance', 163-174 .
7. Sallaberry, A.; Zaidi, F.; Pich, C. & Melançon, G. (2010), Interactive Visualization and Navigation of
Web Search Results Revealing Community Structures and Bridges, in 'Proceedings of Graphics
Interface', pp. 105--112.
8. Bourqui, R.; Gilbert, F.; Simonetto, P.; Zaidi, F.; Sharan, U. & Jourdan, F. (2009), Detecting
Structural Changes and Command Hierarchies in Dynamic Social Networks, in 'Social Network
Analysis and Mining, International Conference on Advances in', IEEE Computer Society, Los
Alamitos, CA, USA, pp. 83-88.
9. Zaidi, F.; Sallaberry, A. & Melançon, G. (2009), Revealing Hidden Community Structures and
Identifying Bridges in Complex Networks: An Application to Analyzing Contents of Web Pages for
Browsing, in 'WI-IAT '09: Proceedings of the 2009 IEEE/WIC/ACM International Conference on
Web Intelligence and Intelligent Agent Technology', pp. 198-205.
10. Simonetto, P.; Koenig, P.-Y.; Zaidi, F.; Archambault, D.; Gilbert, F.; Phan Quang, T. T.; Mathiaut,
M.; Lambert, A.; Dubois, J.; Sicre, R.; Brulin, M.; Vieux, R. & Melançon, G. (2009), Solving the
Traffic and Flitter Challenges with Tulip, in 'Proceedings of IEEE Symposium on Vast 2009 IEEE
Symposium on Visual Analytics Science and Technology 2009 ', pp. 247-248 .
11. Bourqui, R.; Zaidi, F.; Gilbert, F.; Sharan, U. & Simonetto, P. (2008), VAST 2008 Challenge: Social
network dynamics using cell phone call patterns, in 'IEEE Symposium on Visual Analytics Science
and Technology'.
References
1. Glymour, C.; Madigan, D.; Pregibon, D. & Smyth, P. (1997), 'Statistical Themes and
Lessons for Data Mining', Data Min. Knowl. Discov 1(1), 11--28.
2. Matloff, N. (2005), A Careful Look at the Use of Statistical Methodology in Data Mining, in
Tsau Young Lin; Setsuo Ohsuga; Churn-Jung Liau; Xiaohua Hu & Shusaku Tsumoto, ed.,
'Foundations of Data Mining and knowledge Discovery', Springer, , pp. 101--117.
3. Bourqui, R.; Gilbert, F.; Simonetto, P.; Zaidi, F.; Sharan, U. & Jourdan, F. (2009), Detecting
Structural Changes and Command Hierarchies in Dynamic Social Networks, in 'Social
Network Analysis and Mining, International Conference on Advances in', IEEE Computer
Society, Los Alamitos, CA, USA, pp. 83-88.
4. Bourqui, R.; Zaidi, F.; Gilbert, F.; Sharan, U. & Simonetto, P. (2008), VAST 2008
Challenge: Social network dynamics using cell phone call patterns, in 'IEEE Symposium on
Visual Analytics Science and Technology'.
5. Gilbert, F.; Simonetto, P.; Zaidi, F.; Jourdan, F. & Bourqui, R. (2010), 'Communities and
hierarchical structures in dynamic social networks: analysis and visualization', Social
Network Analysis and Mining, 1-13.
6. Zaidi, F. Ph.D. Thesis, Analysis, Structure and Organization of Complex Networks. Ecole
doctorale de Mathématiques et Informatique, Université de Bordeaux, France. 2010
7. Schaeffer, S. E. Graph clustering Computer Science Review, 2007, 1, 27-64.
8. Fortunato, S. Community detection in graphs, 2009.
9. Zaidi, F.; Archambault, D. & Melançon, G. (2010), Evaluating the Quality of Clustering
Algorithms Using Cluster Path Lengths, in 'Advances in Data Mining. Applications and
Theoretical Aspects, 10th Industrial Conference, ICDM', pp. 42-56.
10. Apeltsin, L.; Morris, J. H.; Babbitt, P. C. & Ferrin, T. E. Improving the quality of protein
similarity network clustering algorithms using the network edge weight distribution,
Bioinformatics, 2011, 27, 326-333.
11. Hu, H.; Yan, X.; Huang, Y.; Han, J. & Zhou, X. J. Mining coherent dense subgraphs across
massive biological networks for functional discovery, Bioinformatics, 2005, 21, i213-221.