Cluster Graphs

8/3/2019 Cluster Graphs

http://slidepdf.com/reader/full/cluster-graphs 1/15

C-TREND: Temporal Cluster Graphs forIdentifying and Visualizing Trends in

Multiattribute Transactional DataGediminas Adomavicius, Member , IEEE , and Jesse Bockstedt

Abstract—Organizations and firms are capturing increasingly more data about their customers, suppliers, competitors, and business

environment. Most of this data is multiattribute (multidimensional) and temporal in nature. Data mining and business intelligence

techniques are often used to discover patterns in such data; however, mining temporal relationships typically is a complex task. We

propose a new data analysis and visualization technique for representing trends in multiattribute temporal data using a clustering-

based approach. We introduce Cluster-based Temporal Representation of EveNt Data (C-TREND), a system that implements the

temporal cluster graph construct, which maps multiattribute temporal data to a two-dimensional directed graph that identifies trends in

dominant data types over time. In this paper, we present our temporal clustering-based technique, discuss its algorithmic

implementation and performance, demonstrate applications of the technique by analyzing data on wireless networking technologies

and baseball batting statistics, and introduce a set of metrics for further analysis of discovered trends.

Index Terms—Clustering, data and knowledge visualization, data mining, interactive data exploration and discovery, temporal datamining, trend analysis.

Ç

1 INTRODUCTION

BUSINESS intelligence applications represent an importantopportunity for data mining techniques to help firms

gather and analyze information about their performance,customers, competitors, and business environment. Knowl-edge representation and data visualization tools constituteone form of business intelligence techniques that presentinformation to users in a manner that supports business

decision-making processes. In this paper, we develop a newdata analysis and visualization technique that presentscomplex multiattribute temporal data in a cohesive graphi-cal manner by building on well-established data miningmethods. Business intelligence tools gain their strength bysupporting decision-makers, and our technique helps theusers leverage their domain expertise to generate knowl-edge visualization diagrams from complex data and furthercustomize them.

Organizations and firms are capturing increasingly more

data, and this data is often transactional in nature, contain-ing multiple attributes and some measure of time. For

example, through their websites, e-commerce firms capture

the click stream and purchasing behavior of their customers,and manufacturing companies capture logistics data (e.g.,

on the status of orders in production or shipping informa-tion). One of the common analysis tasks for firms is to

determine whether trends exist in their transactional data.For example, a retailer may wish to know if the types of itsregular customers are changing over time, a financialinstitution may wish to determine if the major types of credit card fraud transactions change over time, and awebsite administrator may wish to model changes inwebsite visitors’ behavior over time. Visualizing and

analyzing this type of data can be extremely difficult because it can have numerous attributes (dimensions).Additionally, it is often desired to aggregate over thetemporal dimension (e.g., by day, month, quarter, year, etc.)to match corporate reporting standards. The approach thatwe take in the paper for addressing these types of issues is tomine the data according to specific time periods and thencompare the data mining results across time periods todiscover similarities.

Consider the plot of a retailer’s customers by age andincome over three months in Fig. 1. Xs represent customersin the first month, triangles represent customers in thesecond month, and circles represent customers in the third

month. An analyst may be tasked with the job of discoveringtrends in customer type over these three months. In Fig. 1a,patterns in the data and relationships over time are difficultto identify. However, partitioning the data by time leads tothe identification of clusters within each period. Clustersencapsulate similar data points and identify common typesof customers. Note that in this example, we used only twodimensions (age and income) for more intuitive visualiza-tion. In many real-life applications, the number of dimen-sions could be much higher, which further emphasizes theneed for more advanced trend visualization capabilities.Fig. 1c is a mapping of the multidimensional temporal datainto an intuitive analytical construct that we call a temporal

cluster graph. As will be discussed in the paper, these graphs

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008 721

. The authors are with the Department of Information and Decision Sciences,Carlson School of Management, University of Minnesota, 321 19th

Avenue South, Minneapolis, MN 55455.E-mail: [email protected], [email protected].

Manuscript received 21 June 2007; revised 6 Jan. 2008; accepted 17 Jan. 2008; published online 25 Jan. 2008.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTKDE-2007-06-0281.

Digital Object Identifier no. 10.1109/TKDE.2008.31.1041-4347/08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society



contain important information about the relative proportion

of common transaction types within each time period,

relationships and similarities between common transaction

types across time periods, and trends in common transaction

types over time.

In summary, the main contribution of this paper is thedevelopment of a novel and useful approach for visualiza-

tion and analysis of multiattribute transactional data based

on a new temporal cluster graph construct, as well as the

implementation of this approach as the Cluster-based

Temporal Representation of EveNt Data (C-TREND) sys-

tem. The rest of the paper is organized as follows: Section 2

provides an overview of related work in the temporal data

mining and visualization research streams. Section 3 in-

troduces the temporal cluster graph construct and describes

the technique for mapping multiattribute temporal data to

these graphs. Section 4 discusses the algorithmic imple-

mentation of the proposed technique as the C-TREND

system and includes performance analyses. Section 5

presents an evaluation of the technique using real-world

data on wireless networking technology certifications.

Section 6 gives a discussion of possible applications, trend

metrics, and limitations associated with the proposed

technique and a brief discussion of future work. The

conclusions are provided in Section 7.

2 RELATED WORK

The research field of data mining has developed a number

of methods for identifying patterns in data to provide

insights and decision support to users. Data mining and business intelligence approaches are often used for class

identification and data visualization in knowledge manage-

ment systems [60]. Increasingly, knowledge discovery in

data (KDD) techniques are providing new analytical

structures that complement and sometimes replace existing

human-expert-based techniques to provide improved sup-

port for decision making [5]. Identifying and visualizing

temporal relationships (e.g., trends) in data constitutes an

important problem that is relevant in many business,

scientific, and academic settings. In this section, we provide

a brief review of related research in the temporal data

mining and visualization streams.

2.1 Temporal Data Mining

Temporal data mining is a growing stream in the knowl-edge discovery research field, and the technique wepropose relates well to this stream. The goal of temporaldata mining is to discover relationships among events andtheir sequences that have some form of temporal depen-dency [4]. It also has the capability of mining activity rather

than just states, which can lead to inferences aboutrelationships and cause-effect association [55].

Antunes and Oliveira [4] note that two fundamentalproblems that must be addressed in temporal data miningare the representation of data and data preprocessing. Also,Roddick and Spiliopoulou [55] distinguish between twotactical goals of temporal data mining: prediction of thevalues of a population’s characteristics and the identifica-tion of similarities among members of a population. Theycategorize the temporal data mining research along threedimensions: data type, mining paradigm, and temporalordering. Recent research on the discovery of patterns intemporal data has made significant progress for studying

the evolution of objects within a population [55].Temporal data mining approaches depend on the nature

of the event sequence being studied. Probably the mostcommon form of temporal data mining—time seriesanalysis [15], [38], [55]—is used to mine a sequence of continuous real-valued elements and is often regression based, relying on the prespecified definition of a model.Moreover, standard time series analysis techniques typi-cally are examples of supervised learning; in other words,they estimate the effects of a set of independent variables ona dependent variable.

Much research has focused on applying data miningtechniques to discover patterns in time series data. Berndt

and Clifford [9] use a dynamic programming technique tomatch time series with predefined templates. Keogh andSmyth [40] use a probabilistic approach to quickly identifypatterns in time series by matching known templates to thedata. Povinelli and Feng [51] use concepts from data miningand dynamical systems to develop a new framework andmethod for identifying patterns in time series that aresignificant for characterizing and predicting events (see also[2], [25], [37], [39], and [50]).

Another common area of temporal data mining researchis sequence analysis [49], [67]. Sequence analysis is oftenused when the sequence is composed of a series of nominal symbols [4]; examples of sequences include

genetic codes and the click patterns of website users.Sequence analysis is designed to look for the recurrence of patterns of specific events and typically does not accountfor events described with multiattribute data [4]. Theidentification of patterns in sequences of events (e.g., see[14], [20], [42], [44], and [48]) is an important problem thatfrequently occurs in many disciplines such as molecular biology and telecommunications.

In the business intelligence context, trend discovery may be better addressed using unsupervised learning techniques, because models of trends and specific relationships betweenvariables may not be known. Specifically, clustering is theunsupervised discoveryof groups in a data set [28]. Thebasic

clustering strategies can be separated into hierarchical and

722 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008

Fig. 1. Reducing multiattribute temporal complexity by partitioning datainto time periods and producing a temporal cluster graph.



partitional, and all use some form of a distance or similaritymeasure to determine cluster membership and boundaries[28]. Research on clustering in time series data extendstraditional clustering into the temporal data mining domain.For example, Kakizawa et al. [31] use minimum discrimina-tion information for the classification and clustering of multivariate time series, Oates [47] uses clustering to identify

subsequences in multivariate real-valued time series, Kan-dogan [32] introduces a new visualization technique calledStar Coordinates for identifying trends in clusters, andMolinari et al. [46] introduce a new technique for determin-ing single and multiple temporal clustering for populationswith varying sizes. For thorough bibliographies of recentwork on both the discovery of temporal patterns andtemporal clustering, see [54].

The technique proposed in this paper uses clustersidentified in multiple time periods and identifies trends based on similarities between clusters over time. It is aclustering approach for discovering temporal patterns,which builds on temporal clustering methods and comple-

ments existing temporal data mining approaches.2.2 Data Visualization

Data mining requires the inclusion of the human in the dataexploration process in order to be effective [35]. Visual dataexploration is the process of presenting data in some visualform and allowing the human to interact with the data tocreate insightful representations [35]. It typically follows the“visual information seeking mantra” [58]: overview, zoom and filter, and details on demand. Most formal models of information visualization are concerned with presentationgraphics [10], [43] and scientific visualization [12], [13], [27],[56], [59]. Keim [34], [35] and Keim and Kriegel [36] providetaxonomies of visualization-based data exploration ap-

proaches and note that these approaches can be classified by 1) the type of data, 2) the visualization technique, and3) the interaction techniques.

With the dramatic increase in the amount of data beingcaptured by organizations, multidimensional visualizationtechniques have become an important area of data miningresearch. Representing multidimensional data in a two- orthree-dimensional visual graphic cannot be achievedthrough simple mapping, and many data visualizationtechniques have been developed (e.g., see [16], [19], [61],and [66]).

Two visualization approaches relevant to the presentresearch are hierarchical techniques and graph-based techniques.

Hierarchical techniques subdivide the multidimensionalspace and present the resulting subspaces in a hierarchicalfashion [17], [19]. For well-known examples, see the n-Visionsystem [11], [13], dimensional stacking [41], and treemaps[57]. Graph-based visualization techniques generate largegraphs using layoutalgorithms and abstractiontechniquestoconvey relational meanings clearly and quickly [19]. Forexamples of applications, see [1], [6], [7], [22], and [26].

Interaction techniques provide the user with the ability todynamically change visual representations [35] and canempower the user’s perception of information [19]. Acomprehensive framework for user interface techniquesused in visualization systems can be found in [18].

Interactive filtering involves dynamically partitioning a data

set into segments and focusing on interesting subsets byeither direct selection or specification of subset properties[23], [34], [35], [64]. Interactive zooming is a commontechnique that provides the user with a variable displayof data at different levels of analysis [34], [35]. Zoomingcapabilities mean that the data representation can beautomatically changed to present more or fewer details on

demand. Some examples of applications that use interactivezooming include Table Lens [53] and Pad++ [8].

3 TEMPORAL CLUSTER GRAPHS

In this paper, we present a new data mining technique foridentifying and visualizing trends in multiattribute tempor-al data. We build on both temporal data mining techniquesand visual data exploration techniques and develop a toolthat provides the user with the ability to interact withtemporal cluster graph data visualization. Temporal clustergraphs use hierarchical and graph-based techniques toexplore temporal data and provide interactive filtering and

zooming capabilities for visualization.3.1 Temporal Cluster Graph Definition

We define a new analytical construct called the temporalcluster graph. This graph is obtained as a result of severalsteps. First, transactional data set D is partitioned based ontime periods into t data subsets D1; . . . ; Dt (indexedchronologically), and each Di is a multiattribute data subsetcontaining records with m number of attributes. Datawithin each partition is then clustered using one of thestandard clustering techniques such as k-means or hier-archical approaches [21], and ki represents the number of clusters obtained for the ith data partition. The temporalcluster graph G is a directed graph that consists of a set of

nodes V and a set of directed edges E , i.e., G ¼ ðV ; E Þ. Thetotal set of graph nodes consists of t subsets of nodesV ¼ fV 1; V 2; . . . ; V tg, where each subset corresponds to adata partition and contains ki nodes (i.e., each graph noderepresents a different cluster). The node vi;j 2 fV ig is the

jth node in the ith partition. Nodes are labeled with the sizeof the cluster they represent (i.e., the number of data pointsin that cluster). Edges eðvi;j; viþ1;kÞ 2 E connect nodes inadjacent partitions and are labeled with a distance value between the two nodes, thus representing the similarity between the clusters (nodes) connected by the edge. Avariety of distance metrics (e.g., euclidean and Chebyshev)and cluster distance measures such as max link and average

link [21] could be used to determine this value. Thetemporal cluster graph is a general-purpose construct anddoes not depend on specific choices of the clusteringmethod and distance metrics.

To identify truly temporal trends, edges are directed onlyfrom earlier partitions to later partitions; in other words,eðvi;j; vx;yÞ 62 E , where i ! x. The representation of data inFig. 1c provides an example of a temporal cluster graph thatcontains three data partitions. The first partition containsthree nodes, each representing a data cluster identified inthe customer data for time period 1 and labeled with thecorresponding cluster size. The second and third partitionnodes are determined in the same manner for time periods 2

and 3. The edges connect nodes in adjacent partitions, are

ADOMAVICIUS AND BOCKSTEDT: C-TREND: TEMPORAL CLUSTER GRAPHS FOR IDENTIFYING AND VISUALIZING TRENDS IN... 723



directed from earlier time periods to later time periods, andare labeled with distances between nodes.

3.2 Graph Parameters

Since one of the main goals of a data visualization techniqueis to reduce complexity and present information in a usefuland intuitive manner, it is necessary to provide a means fordisplaying information at different levels of analysis (i.e.,facilitate interactive zooming), as well as filtering spuriousentities from a temporal cluster graph (i.e., facilitateinteractive filtering). We next define three graph parametersthat are designed to meet this need.

3.2.1 Partition Zoom

We refer to the ability to dynamically change the size of theclustering solution in a data partition as the zoom feature.Specifically, each data partition Di has a correspondingki value, where ki refers to the number of clusters estimatedin the clustering solution for that partition. For example, avalue of ki ¼ 5 corresponds to the clustering solution for theith partition that contains exactly five clusters. The ki valuehere is analogous with the k in the case of k-means

clustering, where the user typically specifies a k value upfront, and the data are then partitioned into k clusters.

Temporal cluster graphs provide the user with the controlto adjust and visualize the clustering solution for each timepartition in real time. For example, in a partition with100 data points, changing ki ¼ 3 to ki ¼ 2 would reclusterthe data points from three clusters into two clusters (Fig. 2).

In all clustering methods, there is variability in thesolution based on the user’s understanding of the data andhis or her interpretation of the output. Many clusteringtechniques assume that the number of clusters is knownahead of time (e.g., k-means clustering) and, therefore, acommon problem in cluster analysis is deciding the optimalnumber of clusters that are present in a data set. The zoomfeature allows the users to apply their domain expertise by

adjusting in real time the underlying clustering solutionused to build a trend graph and interactively evaluatemultiple trend views.

3.2.2 Within-Period Trend Strength

As discussed in the previous section, nodes of the trendgraph are created by clustering each data partition toidentify common naturally occurring patterns in data.However, it is possible that not all clusters will be largeenough to be considered relevant to the analysis at hand. Forexample, in a data set of 2,000 data points, a cluster of sizes ¼ 2 (i.e., containing only two data points) would likely bespurious for many practical applications. Additionally, it is

possible to have singleton clusters appear in the final cluster

solution for some clustering approaches such as agglom-

erative hierarchical clustering [21].1 To address these issues

and provide more data visualization control to the user of

temporal cluster graphs, we introduce a user-specified

parameter that can be used to determine if nodes generated

by the clustering solution are “strong” enough to be

included in the trend analysis. For every data partition Di,

the clustering solution contains ki clusters, and some of these

clusters can be filtered out based on the within-period trend

strength parameter .The 2 ½0; 1� parameter is global for the entire data set.

In other words, each data partition utilizes the same value

of . Let V i be the set of clusters in data partition Di, i.e., V icontains the ki clusters identified in the clustering solution

for Di. For every cluster j 2 V i, if the cluster size s j ! jDij,

then cluster j is included as a node in the output graph.

Thus, jDij is the minimum node size threshold for data

partition Di, where jDij is the number of data points in Di.

As an example, consider a partition with 200 data points;

changing ¼ 0 to ¼ 0:05 would filter out any clusters

with fewer than 10 data points or 5 percent of the total data

points in the partition (see Fig. 3).

3.2.3 Cross-Period Trend Strength

In temporal cluster graphs, edges are used to represent

relationships between nodes (clusters) in adjacent time

partitions. Since an edge is possible between any two nodes

in adjacent partitions, it is desirable to limit the edges

included in a graph to those that are incident to “very

similar” nodes, thus representing a trend over time. Because

the concept of what is “very similar” can be domain

specific, we introduce a user-specified cross-period trend

strength parameter that is used to filter out spurious edges

based on their weight.An edge is included in the output graph if it meets two

criteria: 1) the edge is incident to two nodes that are both

included in the output graph (as determined by the

clustering solution and within-period trend strength ),

and 2) the edge weight is less than or equal to a threshold

that depends on the cross-period trend strength . The edge

threshold is calculated by taking the average of the

weights of all the possible edges among the nodes in two

adjacent data partitions (say, partitions i and i þ 1) and

adjusting it by the user-specified parameter:


Fig. 2. Changing the partition zoom level. Fig. 3. Changing the within-period trend strength for a partition with

200 data points.

1. This could also be addressed to some extent by using outlier detection

methods in the data preprocessing phase [52].



i;iþ1 ¼ AvgDistðV i; V iþ1Þ

¼ Xki

p¼1

Xkiþ1

q ¼1

d ðvi;p; viþ1;q Þ

!,kikiþ1:

Here, vi;p is the pth node in the ith partition, andd ðvi;p; viþ1;q Þ is the distance between nodes vi;p and viþ1;q .The calculation employs the average edge weight betweenpartitions i and i þ 1. From the definition of and therelationship between and , it is easy to see that lowering

will result in more edges being filtered out, leaving onlythe strongest trends displayed in the graph. We have that 2 ½0; 1�, and similar to , is a global parameter value forthe entire data set. When ¼ 1, i;iþ1 is equal to the averageedge weight between partitions i and i þ 1 and, therefore,only edges with weights below average are included in thegraph. This procedure can accommodate a variety of clustersimilarity comparison measures such as average link andmin link [21], as well as different multidimensional distancemetrics d ðx; yÞ, including

euclidean; i:e:; d ðx; yÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXm

i¼1ðxi À yiÞ2

q ;

Manhattan; i:e:; d ðx; yÞ ¼ Xm

i¼1

xi À yij j;

or Chebyshev; i:e:; d ðx; yÞ ¼ maxi¼1...m

xi À yij j:

As an example, consider two partitions i and j, where theaverage edge weight between these partitions is 1.21.Changing ¼ 1:0 to ¼ 0:5 will filter out any edges withweights greater than ¼ 0:605 (see Fig. 4). Note that if d ðx; yÞ ¼ 0, then clusters x and y are identical with respectto the chosen distance metric and cluster similarity measureand, thus, most likely have a very similar makeup.

Fig. 5 provides an example temporal cluster graph thatwas filtered using ¼ 0:05 and ¼ 0:80. In the firstpartition, no clusters were removed; however, one clusterwas filtered out of each of the second and third partitions.

This is apparent because k2 ¼ 4 and k3 ¼ 3, but only threenodes are displayed in the second partition and only two inthe third partition. Specifically, a cluster of size three and acluster of size four were filtered out of the second and thirdpartitions, respectively. Similarly, in the fourth data parti-tion, two clusters with sizes s1 and s2, where s1 þ s2 ¼ 8 ands1, s2 6, were filtered out. The edges displayed representdistance metrics between nodes x and y in adjacentpartitions, where d ðx; yÞ , and was determined using ¼ 0:80 and the average of the edges between adjacentpartitions. Note that several clear trends are represented bythe output graph. For example, the first row of nodes acrossthe graph maintains a very high level of similarity until the

cluster in the third partition splits into two clusters in the

fourth partition. In later sections of the paper, we provideexamples of temporal cluster graphs generated from real-world data.

4 C-TREND IMPLEMENTATION

4.1 C-TREND Overview

C-TREND is the system implementation of the temporal-cluster-graph-based trend identification and visualizationtechnique; it provides an end user with the ability togenerate graphs from data and adjust the graph parameters.C-TREND consists of two main phases: 1) offline preproces-sing of the data and 2) online interactive analysis and graphrendering (see Fig. 6).

In the preprocessing phase, the data set is partitioned based on time periods, and each partition is clustered usingone of many traditional clustering techniques such as ahierarchical approach [21]. The results of the clustering foreach partition are used to generate two data structures: thenode list and the edge list. Creating these lists in the


Fig. 4. Changing the cross-period trend strength for two adjacent

partitions.

Fig. 5. An example temporal cluster graph with four partitions (with

parameters ¼ 0:05 and ¼ 0:80).

Fig. 6. The C-TREND process.



preprocessing phase allows for more effective (real-time)visualization updates of the C-TREND output graphs.Based on these data structures, graph entities (nodes andedges) are generated and rendered as a temporal clustergraph in the system output window. In the interactiveanalysis phase, C-TREND allows the user to modifyki ði ¼ 1; . . . ; tÞ, , and on demand in real time and, as a

result, update the view of the temporal cluster graph.Note that in this initial implementation, the time partitionsize is set exogenously by the user and stays constantthroughout preprocessing and online interactive analysis.We followed this approach because of the domain-specificnature of time granularity. For example, for analyzingtechnology evolution, the desired time granularity could be a year, whereas financial market trend analysis mayrequire a much smaller time window (e.g., a day or a week).For this reason, we decided to rely on the domain expert tospecify the most appropriate time granularity for a givenapplication. However, an important future extension for thisresearch would be to provide the ability to adjust the timegranularity interactively in real time.

4.2 Preprocessing

4.2.1 Data Clustering and the Dendrogram Data

Structure

An important requirement for real-time graph customiza-tion in C-TREND is the precomputation of multipleclustering solutions from the initial data set. Depending onthe type of clustering algorithm employed, the clustersolution can be stored in a way that maximizes the efficiencyof the output graph customization. Most standard clusteringtechniques are based on measuring distance betweenclusters, and there has been extensive research on clusteringtechniques in the data mining literature (see [21], [28], and[33] for examples). C-TREND can be implemented withmultiple different standard clustering algorithms (e.g.,agglomerative or divisive hierarchical clustering or parti-tion-based clustering) and could be expanded to includenew efficient clustering techniques such as the clustering bymessaging between data points technique recently devel-oped by Frey and Dueck [24]. It can also be implementedusing different cluster distance metrics, including minimumlink (nearest neighbor), maximum link (farthest neighbor),average link, distance between cluster centers (used in thispaper), chi-square measure, and statistical entropy, as wellas different basic distance measures between individual datapoints (e.g., euclidean, Manhattan, and Chebyshev). Speci-fically, for our experiments we utilized agglomerativehierarchical clustering [29], [30] based on the euclideandistance between cluster centers, and the clustering isperformed separately for each partition of the data.Agglomerative hierarchical clustering procedures start withn singleton clusters (i.e., each cluster is a single data point)and successively merge the two “closest” clusters overn À 1 iterations until one comprehensive cluster is as-sembled. We plan to extend the current implementation toprovide flexible options to users for selecting clusteringmethods, metrics, and distance measures.

C-TREND utilizes optimized dendrogram data struc-tures for storing and extracting cluster solutions generated by hierarchical clustering algorithms (see Fig. 7 for adendrogram example). While C-TREND can be extended to

support partition-based clustering methods (e.g., k-means),

hierarchical clustering is particularly well suited for real-time updates because the clustering process has to beperformed only once to create a complete set of solutions(which also makes the zoom operation very efficient).Other clustering methods may require the computation of arange of solutions based on different numbers of antici-pated clusters.

C-TREND produces a dendrogram for each data parti-tion and utilizes a global input value N that represents themaximum-sized cluster solution maintained for each datapartition. For all practical purposes, a useful solution willconsist of a set of N << n clusters (n is number of datapoints in partition i) and, therefore, C-TREND has to storeonly 2N À 1 nodes per partition. We have found thatmaintaining a maximum solution size consisting of N ¼ 50 clusters is more than sufficient for many practicalapplications of data visualization (i.e., visualizing morethan 50 clusters at a time in each partition can beoverwhelming and counterproductive for the end user).

The value of N can also be set by the user before thepreprocessing phase.A dendrogram data structure allows for quick extrac-

tion of any specific clustering solution for each datapartition when the user changes partition zoom level ki. Toobtain a specific clustering solution from the data structurefor data partition Di, C-TREND uses the DENDRO_EX-TRACT algorithm (Algorithm 1), which takes the desirednumber of clusters in the solution ki as an input andreturns the set CurrCl containing the clusters correspond-ing to the ki-sized solution. Cluster attributes such ascenter and size are then accessible from the correspondingdendrogram data structure by referencing the clusters inCurrCl.

Algorithm 1. DENDRO_EXTRACTINPUT: ki—desired number of clusters

i—data partition indicator1 begin

2 if ki ! N then

3 CurrCl ¼ fDendrogramRootig4 while jCurrClj < ki

5 MaxCl ¼ DendrogramRooti À jCurrClj þ 1

6 CurrCl ¼ ðCurrCl n MaxClÞ [fMaxCl:Leftg [ fMaxCl:Rightg

7 return CurrCl8 else request new ki

9 end


Fig. 7. Example dendrogram with N ¼ 10.



DENDRO_EXTRACT starts at the root of the dendro-gram and traverses the dendrogram by splitting thehighest numbered node (where the nodes are numberedaccording to how close they are to the root, as numbered inFig. 7) in the current set of clusters until k clusters areincluded in the set. MaxCl represents the highest elementin the current cluster set CurrCl. It is easy to see that

because of the specific dendrogram structure, it is alwaysthe case that MaxCl ¼ DendrogramRooti À jCurrClj þ 1.Furthermore, the dendrogram data array maintains thesuccessive levels of the hierarchical solution in order;therefore, replacing MaxCl by its children MaxCl:Left (leftchild) and MaxCl:Right (right child) is sufficient foridentifying the next solution level in the dendrogram.DENDRO_EXTRACT is linear in time complexity OðkiÞ,which provides for the real-time extraction of clustersolutions.

4.2.2 Optimal Number of Clusters

To set an initial value ki for each data partition Di, C-TREND

uses the “optimal” number of clusters based on the so-called“elbow” (or “gap”) criterion for comparing the fitness of different cluster solutions. We use the mean squared error(MSE) as our fitness metric [21]. In agglomerative hierarch-ical clustering, n solutions are created containing 1; . . . ; nclusters, respectively. C-TREND determines the largest“jump” in MSE across these n solutions, which points tothe optimal number of clusters. In a plot of the MSE, this jump would be represented by a sharp elbow (i.e., asignificant “flattening” in the solution fitness increase).

The “elbow” criterion is discussed in further detail in [3],[62], and [63], and the related “gap” statistic is discussed in[65]. For example, the basic method for determining theoptimal number of clusters used by Sugar and James [63] is

to 1) calculate a distortion metric for several k-sizedclustering solutions (i.e., for different k values), 2) transformthe distortion metric based on the data dimensionality, and3) determine the largest jump in the transformed distortionmetric, which indicates the optimal number of clusters. Inour experiments, we used MSE as the transformed distor-tion metric, but this is just one possible solution. Anothercommon method is to use the amalgamation coefficient [3].Milligan and Cooper [45] provide a review of several othermetrics used to determine the optimal number of clustersfor a clustering solution, and C-TREND can be extended tosupport any of these alternative approaches.

4.2.3 Node Lists and Edge Lists The last step in preprocessing is the generation of the nodelist, which contains all possible nodes and their sizes, andthe edge list, which contains all possible edges and theirweights, for the entire data set. Creating these lists in thepreprocessing phase allows for more effective (real-time)visualization updates of the C-TREND graphs.

Each data partition possesses an array-based dendro-gram data structure containing all its possible clusteringsolutions. The node list is simply an aggregate list of alldendrogram data structures indexed for optimal nodelookup.

The edge list is generated based on the node list, since an

edge is possible between any two nodes in adjacent data

partitions. Therefore, the edge list is essentially a list of ordered pairs, where each pair represents adjacent nodesthat define an edge. Since each dendrogram contains2N À 1 nodes (i.e., N leaves and N À 1 internal nodes),there are ð2N À 1Þ2 possible edges between two adjacent

data partitions. If the data set possesses t data partitions,then there are ðt À 1ÞÃð2N À 1Þ2 possible edges for the entiredata set. Note that we use t À 1 because the first partitiononly possesses outgoing edges and the last partition onlypossesses incoming edges. Therefore, the time complexity of the edge list generation would have an asymptotic upper bound of OðN 2tmÞ, where m is the number of attributes inthe data (needed to calculate edge weights). By calculatingall possible edges and their weights in the preprocessingphase, C-TREND can achieve real-time functionality in theanalysis phase, as the output graph parameters are beinginteractively adjusted.

Table 1 shows the preprocessing time required togenerate edge lists of varying size. It is easy to see thateven in some extreme cases where N ¼ 100 or 500, edge listpreprocessing takes less than 20 seconds. These experi-ments were performed by implementing the edge listgeneration procedure in the C programming language,testing on a PC with a Pentium 4 3.4-GHz processor with1 Gbyte of RAM.

It should be noted that the results reported in Table 1were calculated holding the number of attributes in the dataconstant at 10. Since this process requires the calculation of a distance metric for each edge, the time it takes to generatethe edge list should increase linearly with the number of attributes in the data. To demonstrate that this is indeed thecase, Fig. 8 contains a plot of the increase in edge list

generation time as the number of attributes is beingincreased from 10 to 100, holding N and t constant.

4.3 Interactive Data Visualization

C-TREND utilizes a series of validation flags to maintainand update the displayed state of the output trend graph.Combinations of the validation flags are used to determinewhether or not each possible edge and node should bedisplayed in the graph, and as these flags change, thedisplayed components of the graph also change.

Each cluster in the node list (dendrogram data struc-tures) possesses two flags: k-pass and -pass. These flags areused to indicate whether the cluster should be included in

the output graph based on the ki value and the value,


TABLE 1Edge List Creation Times



respectively. Specifically, when ki is changed, the dendro-gram data structure is updated so that only the clusters thatshould be extracted for the clustering solution of size ki

have a valid k-pass flag. Similarly, when is changed, thedendrogram data structure is updated so that only the

clusters that are large enough to pass the node filter basedon are assigned a valid -pass flag. The nodes that have both valid k-pass and -pass flags make up the set of nodesthat are both large enough and in the desired clusteringsolution and therefore are included in the output graph.

In our implementation, a list of all possible edges andtheir weights is generated during preprocessing. Each edgein the list possesses a -pass flag. When is changed, alledges with a passing weight (based on ) are assigned avalid -pass flag, and all others are assigned an invalid flag.Only edges that have a valid -pass flag and are incident totwo valid nodes (nodes with valid k-pass and -pass flags)are included in the output graph. Using the implementation

described above, C-TREND can update output graphs based on user changes to the ki, , and parameters veryefficiently. Changing any one parameter requires only oneoperation to update the corresponding flag in the datastructure for a given node or edge.

The performance of the interactive data visualizationprocess depends on three basic operations: changing ki

(adjusting the partition zoom level) for any of the datapartitions, changing (filtering based on within-periodtrend strength) for the entire graph, and changing (filtering based on cross-period trend strength) for the entire graph.Each of these operations requires a set of calculations to beperformed on the node and edge lists, and we will show that

for most practical purposes, these operations are performedefficiently enough to provide “real-time” graph generationand modification. By implementing the k-pass, -pass, and -pass validation flags, we create some independence in theparameters. The parameters can be adjusted independently,and graph elements are rendered only if they are valid basedon the status of the flags. As flags are changed and elements become valid, they are rendered in real time.

The computational complexity of the parameter-changingoperations can be easily calculated based on the maximalnumber of clusters N and the number of data partitions t.Changing requires C-TREND to scan through all possiblenodes to determine if each node should have a valid or

invalid -pass flag. If each partition contains 2N À 1 possible

nodes and there are t partitions, changing has anasymptotic upper bound of OðNtÞ. Similarly, is also agraph-wide parameter, and changing requires C-TRENDto scan through all possible edges to determine if each edgeshould possess a valid or invalid -pass flag. Since there areðt À 1Þð2N À 1Þ2 possible edges, changing is OðN 2tÞ.

Changing ki is extremely efficient, since each data

partition Di has its own corresponding ki value and theoperation simply updates the k-pass flags for clusters in thecorresponding data partition. This operation uses theDENDRO_EXTRACT procedure (Algorithm 1) describedin Section 3. Recall that DENDRO_EXTRACT iterates ki

times until it has found the ki clusters that make up theki-sized clustering solution, where each iteration needs onlyconstant-time operations—thus, DENDRO_EXTRACT takesOðkiÞ time. Changing ki simply uses DENDRO_EXTRACTto update the k-pass flags of the data structure, and becauseeach partition has at most 2N À 1 clusters, changing ki

would require at most 2N À 1 operations. Therefore, in theworst case, changing ki is linear in the maximum-sized

clustering solution, i.e., OðN Þ.To optimize changing ki even further, C-TREND uses

simple algorithms to increment and decrement theki value by one. Recalling the notation from Section 3,to increase the number of clusters included in a partition by one (i.e., ki :¼ ki þ 1), the MaxCl in the set of currentclusters, CurrCl, for the ki-sized solution is replaced byits children MaxCl:Left and MaxCl:Right: CurrCl ¼ðCurrCl n MaxClÞ [ ðMaxCl:Left [ MaxCl:RightÞ. A verysimilar approach can be applied to decrement thenumber of clusters included in a partition: CurrCl ¼ðCurrCl [ NextClÞ n ðNextCl:Left [ NextCl:RightÞ, whereNextCl is the cluster that is the next highest in the

dendrogram hierarchy after MaxCl. These simple opti-mizations that exploit the dendrogram data structurereduce the complexity of the changing ki operation toOð1Þ for increments and decrements of ki.

To validate the theoretical efficiencies discussed above,we implemented C-TREND parameter adjustment opera-tions in the C programming language and measured thetime required to perform several operations on a PC with aPentium 4, 3.4-GHz processor with 1 Gbyte of RAM. Table 2contains the results of this analysis. We performed theanalysis on six parameter configurations, ranging in thenumber of data partitions ðtÞ and the maximum-sizedcluster solution ðN Þ for each partition. For each experiment,

we ran the operation for five different randomly generateddata sets, and Table 2 displays the average time to performeach operation. Notice that the changing ki and increment-ing/decrementing ki operations took less than 1 s

regardless of the size of the data set used. Additionally,changing also took less than 1 s for each data setanalyzed. Only the changing operation required proces-sing times greater than 1 s. For the data set with amaximum-sized clustering solution of 500, changing took0.192 seconds. Recall that the changing operation has thecomputational complexity of OðN 2tÞ and that for mostpractical purposes, it is sufficient to have N 50. Thissuggests that changing should take less than 0.084 seconds

for typical uses of C-TREND (the last row in Table 2 shows


Fig. 8. Times to produce an edge list based on number of attributes.



results for an even more complex situation where N ¼ 100

and t ¼ 100). Additionally, the increase in processing time

for these operations from 0.007 seconds to 0.192 seconds

corresponding to the increase of N ¼ 100 to N ¼ 500 agrees

with our theoretical calculations of OðN 2tÞ. The experi-

mental results in Table 2 support our claims that C-TREND

output graph generation and modification can be done inreal time and can provide instant visual feedback to the user

when parameters are changed. We should also note that,

unlike the edge list generation procedure in preprocessing,

none of the parameter changing operations will be

dependent on the number of data attributes. This is because

changing parameters requires only simple lookups and

comparisons, and no distance metric calculations are

required.To further demonstrate the scalability of the C-TREND

technique, we report a second set of experimental results in

Figs. 9 and 10. We demonstrate the efficiency of C-TREND

in its most computationally costly operation, updating , forvarious large data sets. In Fig. 9, we show that, while

increasing the maximum-sized clustering solution N ,

C-TREND can still update relatively quickly. In fact, for

a data set with 150 data partitions and N ¼ 200, is

updated in less than 0.6 seconds. Fig. 10 demonstrates that a

dramatic increase in the number of data partitions can also

be handled by C-TREND in a real-time fashion. In fact, for a

very large data set containing 10,000 partitions with N ¼ 50,

is updated in less than 2 seconds.

5 EVALUATION: CASE STUDY ON WIRELESS

NETWORKING CERTIFICATIONS

C-TREND and temporal cluster graphs provide a versatiletechnique for identifying and representing trends in

temporal multiattribute data. As mentioned earlier, twoimportant parameters are used to filter out spurious graphentities: within-period trend strength and cross-periodtrend strength . Temporal cluster graphs can be based onmultiple clustering methods, distance metrics, and clustersimilarity measures. To demonstrate the use of temporalcluster graphs for identifying trends in real-world transac-tional attribute-value data, we present the analysis of morethan 2,400 certifications for new wireless networkingtechnologies based on the IEEE 802.11 standard that areawarded by the Wi-Fi Alliance (wi-fi.org).2

Wi-Fi Alliance certifications are awarded for 10 differenttechnology categories: access points, cellular convergence

products, compact flash adapters, embedded clients, Ether-net client devices, external cards, internal cards, PDAs, USBclient devices, and wireless printers. Products can becertified based on a number of standards, 15 in all,including the IEEE protocol (802.11a, b, g, d, and h),security protocol (e.g., WPA-personal, WPA-enterprise, andWPA2), the authentication protocol (e.g., EAP, EAPTLS,and PEAP), and quality of service (e.g., WMM and WMMPower Save). Each product certification consists of a date of certification and a set of binary attributes indicating thepresence or absence of the standards listed above.

The Wi-Fi certifications data set is a good example for aproof of concept of the temporal cluster graph. The data is

multiattribute and temporal, and analysts interested in theevolution of wireless networking technologies can usetemporal cluster graphs to identify trends in product typesover time. For our analysis, the certification data includedall standard-related attributes, as well as the product type(e.g., PDA and internal card) and product category (i.e.,whether it is a component, device, or infrastructuretechnology) attributes. Certifications were coded intoproduct categories based on the similarity of their product


TABLE 2Experimental Analysis of C-TREND Real-Time Performance

Fig. 9. Time to update graph data structures with a new value (varying

the maximum number of clustersN

).

Fig. 10. Time to update graph data structures with new value (varying

the number of partitions t).

2. From Wi-Fi.org: “The Wi-Fi Alliance is a global, nonprofit industry tradeassociation with more than 200 member companies devoted to promoting the growth of wireless Local Area Networks (WLAN). Our certification programsensure the interoperability of WLAN products from different manufacturers, with

the objective of enhancing the wireless user experience .”



type and functionality; for example, compact flash adapters,internal cards, external cards, and USB clients weregrouped into the component category, because they all act

as components that provide Wi-Fi functionality to otherproducts such as PCs or laptops.

Figs. 11 and 12 present trend graphs for the Wi-Fi datapartitioned into one-year time periods. For each timeperiod, a set of clusters has been identified as nodes. Eachnode is labeled with the size of the cluster and can beintuitively described by its center. For example, in Fig. 11,the cluster labeled 82 in 2001 contains 82 data points andhas a center vector (1, 0, 0, 0, 0, 0.01, 0, 0, 0.46, 0.38, 0, 0.15, 0,0, 1, 0, 0, 0, 0.04, 0.04, 0, 0, 0.04, 0, 0, 0, 0, 0, 0). It should benoted that all attributes for this data set were binary (1 if theproduct possessed the attribute and 0 otherwise), andtherefore, the center values indicate that this cluster is made

up of 100 percent components with 1 percent compact flashcards, 46 percent internal cards, 38 percent external cards,and 15 percent USB client devices. Of these components,100 percent are 802.11b-certified and 4 percent have WPA-personal, WPA-enterprise, and EAPTLS certifications.3

Based on this information, it is clear that all of the Wi-Ficomponents (which are mostly adapter cards) are clusteredtogether at this point in the timeline.

Edges were rendered between nodes in adjacent timeperiods to represent similarities between clusters over time.For example, the edge labeled 0.08 in Fig. 11 indicates thatthe center of the cluster labeled “30” in 2000 is at a distanceof 0.08 from the center of the cluster labeled “56” in 2001.

This suggests that the clusters are extremely similar.Following this same trend into 2002, the next edge has aweight of 0.03. Therefore, by looking at a temporal clustergraph, the user can see that during 2000-2002, 802.11baccess points with very similar technical specificationsconstituted one dominant technology type in the set of allavailable Wi-Fi technologies. After 2002, however, we see alarger deviation with an edge weight of 0.26, whichindicates a reduction in similarity. The trend continues toa node of access points in 2004, labeled “277.” However, the

average technical characteristics of the technologies makingup this node are significantly different than the previousgeneration, as indicated by the similarity measure of 1.6.However, this distance is still less than the threshold foredge weights, which is the average of all edge weights between the two partitions multiplied by ¼ 0:75. Specifi-cally, the C-TREND tool provides an intuition that theintroduction of 802.11g and security technologies led to asignificantly different class of wireless access points in 2004.

Some other clear trends in wireless networking technol-ogies are also visible in Fig. 11. A trend of wireless networkcards converges in 2001 and continues into 2003 when802.11g is introduced. The 802.11b-embedded laptop clientsfirst appear in 2003, and in 2004, internal cards enabled withall possible technical attributes appear (802.11b/g/a/h/dand all security/QoS specifications).

By comparing Figs. 11 and 12, we can see the effect of modifying the zoom level and the and parameters.Fig. 11 presents a trend graph using ¼ 0:02 and ¼ 0:75,which excludes clusters smaller than 2 percent of the totalnumber of data points in a data partition and edges morethan 75 percent of the average edge weight. Fig. 12 has morerelaxed parameters with ¼ 0:0 and ¼ 1:0 and thereforeincludes many additional nodes and edges. The advantageof adjusting and is that it provides the C-TREND userwith the ability to show or hide possible trends according totheir strength. For example, in Fig. 12, the more relaxedparameters allow the C-TREND users to uncover a newcluster of size six in 2002 and identify trends that were not

apparent in Fig. 11. On the other hand, Fig. 12 includesmany clusters of size one with either no adjoining edges oredges with high weights. These are likely to be isolatedevents that do not provide insights on trends and can befiltered out of the graph using more restrictive parameters,as in Fig. 11. The zoom level of the 2001 partition was set toseven clusters in Fig. 12 and five clusters in Fig. 11.Additionally, the zoom level of the 2003 partition was set tofive clusters in Fig. 12 and seven clusters in Fig. 11.Adjusting the zoom level allows the user to apply his or herdomain expertise to select the optimal granularity fordisplaying the most relevant clusters.

The Wi-Fi technology analysis demonstrates the ability

of temporal cluster graphs to identify and represent


3. In this paper, we do not provide such detailed information within thegraph figures themselves because of the space limitations; however, thisinformation is readily accessible to C-TREND users (e.g., by clicking on any

node in the graph).

Fig. 11. Filtered C-TREND temporal cluster graph of trends in Wi-Fi

technologies (with parameters ¼ 0:02 and ¼ 0:75).

Fig. 12. Unfiltered C-TREND temporal cluster graph of trends in Wi-Fi

technologies (with default parameters ¼ 0:0 and ¼ 1:0).



trends in multiattribute data. One would expect the Wi-Fidata to contain clear trends based on the versions of the802.11 standard and the technical features. C-TRENDproduced temporal cluster graphs that identified thesechanges in the data, presented them in an intuitive anduseful manner, and also provided additional insights.One could use C-TREND to further explore the Wi-Fi

trends using different time windows and attribute sets(not provided here because of the space limitations).

6 DISCUSSION

6.1 Trend Metrics

As demonstrated above, temporal cluster graphs provide anovel approach for identifying and visualizing trends inmultiattribute transactional data. These graphs help the uservisualize relationships between dominant transaction typesin a data set over time. To provide additional analyticalpower to the users, we next discuss a set of trend metrics thatallow the users to analyze the trend directionality, as well as

other trend characteristics, and, as a result, better under-stand the significance and scale of the patterns identified.

Before we define the trend metrics, we first present somenotation. As before, let G ¼ ðV ; E Þ be a temporal clustergraph, where V is the set of all nodes, and E is the set of alledges in the graph. Trend P of length n is defined as a pathof n edges, containing n þ 1 nodes in graph G. In otherwords, P ¼ ðv0; . . . ; vnÞ, where eðviÀ1; viÞ 2 E for i ¼ 1; . . . ; n.The distance between nodes x and y is defined as d ðx; yÞ and,as previously discussed, can be measured using theeuclidean distance between x and y (or any other distancemetric), where each node x is represented by a correspond-ing cluster center x ¼ ðx1; . . . ; xmÞ; here, m denotes the

number of attributes (dimensions) in the data. Therefore, thedistances between nodes represented by edge weights in theoutput graph are the distances between the actual clusters inthe multidimensional space. Below, we present several trendmetrics that are general and can be used with a variety of cluster distance metrics and cluster similarity measures.

Path distance d P measures the total distance traveledalong a trend (path) and is defined as

d P ðv0; vnÞ ¼Xn

i¼1d ðviÀ1; viÞ:

In other words, path distance is the sum of the weights of alledges within a path. Direct distance d D measures theabsolute distance between the head (first) and tail (last)

nodes in a path and is defined as

d Dðv0; vnÞ ¼ d ðv0; vnÞ:

Transitive distance d T measures the maximum direct distance between any two nodes in the path and is defined as

d T ðv0; vnÞ ¼ max0i<jn

d ðvi; v jÞ:

Note that the transitive distance does not necessarily haveto correspond to an edge that exists in the output graph.

It is easy to see from their definitions that the threedistances introduced above have the following property:0 d Dðv0; vnÞ d T ðv0; vnÞ d P ðv0; vnÞ. These distances pro-

vide useful information about paths in a graph. The path

distance indicates the cumulative amount of change along atrend in a multiattribute space, the direct distance providesa comparison between the start and end points of the trend,and the transitive distance indicates the maximum differ-ence in clusters within a trend.

The information provided by these distances shouldprove useful to the analyst, and taking various ratios of these distances provides even more details about the nature

of a trend. For instance, the trend directionality ratiodirðv0; vnÞ is defined as the ratio of the direct distance andthe path distance:

dirðv0; vnÞ ¼ d Dðv0; vnÞ=d P ðv0; vnÞ:

Be cause 0 d Dðv0; vnÞ d P ðv0; vnÞ, w e h av e th atdirðv0; vnÞ 2 ½0; 1�. Specifically, the directionality ratio mea-sures the continuity of consistent directional change in atrend. A value of dirðv0; vnÞ ¼ 1 means that d Dðv0; vnÞ ¼d P ðv0; vnÞ and, therefore, the changes in the cluster centerswere consistently in the same direction throughout the trendfor every attribute. Alternatively, a dirðv0; vnÞ that is closeto zero indicates that d D << d P . In other words, the trend

is not moving consistently in the same direction throughthe multidimensional space. Fig. 13 depicts two scenariosfor illustrating different directionality ratios for pathP ¼ ðc1; c2; c3; c4Þ, where clusters c1, c2, c3, and c4 wouldappear as nodes in partitions t1, t2, t3, and t4 of a graph,respectively. From the figure it is clear that dirðc1; c4Þ isgreater in scenario 1 than in scenario 2, and it thereforefollows that in scenario 1, the path of cluster centers ismoving more consistently in the same direction over timethan in scenario 2.4

A second ratio of interest is the trend expansion ratio,expanðv0; vnÞ, which is defined as the ratio of the transitivedistance and the path distance:

expanðv0; vnÞ ¼ d T ðv0; vnÞ=d P ðv0; vnÞ:

Similar to the directionality ratio, expanðv0; vnÞ 2 ½0; 1�.The expansion ratio is useful for determining the magnitudeof directional change in a trend and, therefore, it comple-ments the directionality ratio. Since the transitive distancemeasures the maximum absolute distance between any twoclusters on a path, a larger transitive distance indicates alonger consistent movement in the same direction within apath. The expansion ratio measures this consistency with


4. Note that Figs. 13 and 14 present distances between clusters within agiven trend in a multidimensional attribute space (in this case, two-dimensional, for more intuitive visualization) and are not temporal cluster

graphs.

Fig. 13. Two scenarios for calculating the trend directionality ratio.



respect to the total path distance. Paths with longer periodsof directional change will have higher expansion ratiovalues, while paths that “wind” back and forth will havelower expansion ratios. Fig. 14 provides an example of twoscenarios for illustrating different expansion ratios for pathP ¼ ðc1; c2; c3; c4Þ. Note that the directionality ratios forscenario 1 and scenario 2 in Fig. 14 would be fairly similar, but comparing expansion ratios captures the difference in

path shape. In scenario 1, the transitive distance is about half of the path distance, indicating a longer period of consistentdirectional change and, therefore, a higher expansion ratio.In scenario 2, the transitive distance is much smaller than inscenario 1, indicating a shorter period of consistent direc-tional change (a winding path) and, therefore, a lowerexpansion ratio.

The C-TREND implementation provides path ðd P Þ,direct ðd DÞ, and transitive ðd T Þ distance metrics and trenddirectionality and expansion ratios for any path selected bythe user through a graphical user interface.

6.2 Applications

In Section 5, we provided a detailed case study demonstrat-ing the use of the temporal clustering technique and theC-TREND implementation to analyze the evolution of wireless networking technologies. Although the proposedtechnique was well suited for the Wi-Fi Alliance certificationdata, it is designed to be used with any set of multiattributedata that can be partitioned and clustered. To demonstratethis, we next present a brief analysis of Major LeagueBaseball (MLB) batting statistics using temporal clustergraphs. The data used for this analysis was obtained fromSean Lahman’s Baseball Archive, available for download athttp://baseball1.com/. Annual batting statistics were col-lected for every MLB player that played in the years 1967-

2006. For the analysis, four attributes were used: hits, homeruns, strike outs, and walks. Each attribute was firstnormalized by the number of at bats (the number of battingappearances) and then normalized by the range of eachvariable. Batters with less than 100 at bats and zero homeruns were removed to reduce the skewness of attributedistributions. The data was partitioned into eight five-yearsubsets (see Fig. 15).

The analysis of the partitioned data reveals someinteresting trends in the baseball batting data. First, thereis a strong trend over the years for average hitters. Theperformance of average hitters has not changed much overthe past 40 years, and this is indicated by the very small

distances between clusters in adjacent time partitions.

Another interesting trend is the periodic appearance of

clusters of power hitters (i.e., hitters with significantly morehome runs). Additionally, subpar hitters are apparent in the

early years of the data set but are either absorbed by other

clusters or are not as prevalent in later years.Fig. 15 also displays trend metrics calculated for all paths

between the node labeled “1338” in the first partition and the

node labeled “1668” in the fourth partition. Three possible

paths exist between these nodes; P 1 is 1338 1379 1560

1668, P 2 is 1338 48 1560 1668, and P 3 is 1338

42 1560 1668. The direct distance is obviously the

same for allpaths ðd D ¼ 0:03Þ; however, the path distance for

path 1 is d P 1 ¼ 0:09, which is significantly smaller than the

other two paths. This indicates that the total distance

traveled along path 1 is shorter than the other paths.Additionally, the directionality ratio for path 1 is dirðP 1Þ ¼

0:33 as compared to dirðP 2Þ ¼ 0:05 and dirðP 3Þ ¼ 0:04. This

suggests that the changes in cluster center location along P 1

are more consistently in the same direction as compared to

P 2 and P 3. The combination of a reasonably large

directionality ratio (0.33) for P 1 and a very small direct

distance (0.03) suggests that there has been little change in

the average batter’s statistics over path P1. Further, the

expansion ratios for each path are expanðP 1Þ ¼ 0:55,

expanðP 2Þ ¼ 0:48, and expanðP 3Þ ¼ 0:51. This suggests that

all three paths contain about the same magnitude of

directional change.


Fig. 14. Two scenarios for calculating the trend expansion ratio.

Fig. 15. C-TREND temporal cluster graph of trends in baseball bating

data partitioned by year (with parameters ¼ 0:01 and ¼ 0:85) with

trend metrics output window (with metrics for several paths).



Finally, all the temporal cluster graphs shown thus farhave used data partitioned along the temporal dimension. Itshould also be noted that it is possible to use the temporal

cluster graphs and the C-TREND application to analyze datapartitioned along other dimensions. Any continuous orordinal categorical attribute can be used as the referencefor data partitioning. As a proof of concept, we partitionedthe same baseball batting data based on the weight of theplayer. Eight groups of players were defined based onequally sized (in terms of the number of batters) partitions,four on each side of the mean player weight. Nodes, edges,and trend metrics were derived in the same manner as wasshown for the temporal partitioning. The resulting graphallows the user to identify and analyze additional trends inthe batting statistics based on players’ weight as shown inFig. 16. In other words, we can see trends in batting statistics

as we move from lightweight players to the heavyweightones. For example, one can notice that there are much fewerabove-average and below-average lightweight hitters interms of their batting performance.

6.3 Practical Significance, Limitations, and FutureResearch Directions

In this paper, we have demonstrated some of the possibleapplications of temporal cluster graphs using two verydifferent data sets: Wi-Fi data and baseball data. Thus, thistechnique can be applied in a wide variety of data analysissettings. For example, in business applications, the C-TREND system can also be used to identify changes in

customer purchasing behaviors over time, visualize trendsin website usage behavior, or identify patterns of credit carduse for fraud detection. Practically any scenario in which ananalyst wishes to visualize changes in dominant data typesover time could utilize temporal cluster graphs. In addition,historical modeling of trends in economic and technologicalchange using temporal cluster graphs could aid in thedevelopment of forecasts. Possible future extensions of thetechnique include hypothesis testing and automated dataanalytics of temporal data using C-TREND as the analyticalengine.

This work provides many directions for future research.Temporal cluster graphs provide a general framework for

developing new trend analysis techniques. We plan to

develop additional functionality in the C-TREND system byextending the set of metrics for measuring trend and graphcharacteristics. We will focus on such issues as measuringtrend strength, comparing trends, and interpreting thegraph structure. Additionally, to enhance visualization andtrend identification, we plan to develop more advancedalgorithms and data structures in order to provide the user

with the ability to dynamically adjust the time partitiongranularity in real time. At present, data partitions aredefined exogenously by the user; however, it may also beadvantageous to use data mining techniques to identifyoptimal data partitions. Also, temporal cluster graphsprovide the initial structure for developing predictivemodels and hypotheses for the existence, birth, death, andcontinuation of trends in data.

7 CONCLUSION

By harnessing computational techniques of data mining, wehave developed a new temporal clustering technique for

discovering, analyzing, and visualizing trends in multi-attribute temporal data. The proposed technique is versa-tile, and the implementation of the technique as theC-TREND system gives significant data representationpower to the user—domain experts have the ability toadjust parameters and clustering mechanisms to fine-tunetrend graphs. We demonstrated that the C-TREND im-plementation is scalable: the time required to adjust trendparameters is quite low even for larger data sets, whichprovides for real-time visualization capabilities. Further-more, the proposed temporal clustering analysis techniqueis applicable in many different data analysis contexts andcan provide insights for analysts performing historical

analyses and generating forecasts.

ACKNOWLEDGMENTS

The authors would like to thank the Digital TechnologyCenter and the Carlson School of Management, Universityof Minnesota, for providing joint financial support of thisresearch. The authors would also like to thank PrasadSriram for his assistance with the development of thegraphical user interface. Also, the research reported in thispaper was supported in part by the US National ScienceFoundation CAREER Grant IIS-0546443.

REFERENCES[1] J. Abello and J. Korn, “MGV: A System of Visualizing Massive

Multi-Digraphs,” IEEE Trans. Visualization and Computer Graphics,vol. 8, no. 1, pp. 21-38, Jan.-Mar. 2001.

[2] R. Agrawal, K.I. Lin, H.S. Sawhney, and K. Shim, “Fast SimilaritySearch in the Presence of Noise, Scaling, and Translation in Time-Series Databases,” Proc. 21st Int’l Conf. Very Large Data Bases(VLDB ’95), pp. 490-501, 1995.

[3] M.S. Aldenderfer and R.K. Blashfield, Cluster Analysis. SagePublications, 1984.

[4] C.M. Antunes and A.L. Oliveira, “Temporal Data Mining: AnOverview,” Proc. ACM SIGKDD Workshop Data Mining, pp. 1-13,Aug. 2001.

[5] C. Apte, B. Liu, E. Pednault, and P. Smyth, “Business Applicationsof Data Mining,” Comm. ACM, vol. 45, no. 8, pp. 49-53, 2002.

[6] G.C. Battista, P. Eades, R. Tamassia, and I.G. Tollis, Graph

Drawing. Prentice Hall, 1999.


Fig. 16. C-TREND temporal cluster graph of trends in baseball bating

data partitioned by weight (with parameters ¼ 0:01 and ¼ 0:75).



[7] B. Becker, R. Kohavi, and D. Sommerfield, “Visualizing the SimpleBayesian Classifier,” Proc. ACM SIGKDD Workshop Issues on theIntegration of Data Mining and Data Visualization, 1997.

[8] B. Bederson, “Pad++: Advances in Multiscale Interfaces,” Proc.Conf. Human Factors in Computing Systems (CHI ’94), p. 315, 1994.

[9] D.J. Berndt and J. Clifford, “Finding Patterns in Time Series: ADynamic Programming Approach,” Advances in Knowledge Dis-covery and Data Mining, pp. 229-248, 1995.

[10] J. Bertin, Semiology of Graphics: Diagrams, Networks, Maps,

W.J. Berg, translator, Univ. of Wisconsin Press, 1983.[11] C.G. Beshers and S.K. Feiner, “Visualizing n-Dimensional Virtual

Worlds within n-Vision,” Computer Graphics, vol. 24, no. 2, pp. 37-38, 1990.

[12] C.G. Beshers and S.K. Feiner, “AutoVisual: Rule-Based Design of Interactive Multivariate Visualizations,” IEEE Computer Graphicsand Applications, vol. 13, no. 4, pp. 41-49, 1993.

[13] C.G. Beshers and S.K. Feiner, “Automated Design of DataVisualizations,” Scientific Visualization—Advances and Applications,L. Rosemblum et al., eds., pp. 88-102, Academic Press, 1994.

[14] C. Bettini, S. Wang, S. Jajodia, and J.L. Lin, “Discovering FrequentEvent Patterns with Multiple Granularities in Time Sequences,”IEEE Trans. Knowledge and Data Eng., vol. 10, no. 2, pp. 222-237,Mar./Apr. 1998.

[15] P. Brockwell and R. Davis, Time Series: Theory and Methods.Springer, 2001.

[16] S. Card, J. Mackinlay, and B. Schneiderman, Readings in Informa-tion Visualization. Morgan Kaufmann, 1999.

[17] C. Chen, Information Visualization and Virtual Environments.Springer, 1999.

[18] M.C. Chuah and S.F. Roth, “On the Semantics of InteractiveVisualization,” Proc. IEEE Symp. Information Visualization (InfoVis’96), pp. 29-36, 1996.

[19] M.C.F. de Oliveira and H. Levkowitz, “From Visual DataExploration to Visual Data Mining: A Survey,” IEEE Trans.Visualization and Computer Graphics, vol. 9, no. 3, pp. 378-394,

July-Sept. 2003.[20] T.G. Dietterich and R.S. Michalski, “Discovering Patterns in

Sequences of Events,” Artificial Intelligence, vol. 25, no. 2,pp. 187-232, 1985.

[21] R. Duda, P. Hart, and D. Stork, Pattern Classification, second ed.Wiley-Interscience, 2000.

[22] S.G. Eick and G.J. Wills, “Navigating Large Networks with

Hierarchies,” Proc. IEEE Conf. Visualization (VIS ’93), pp. 204-210,1993.[23] K. Fishkin and M.C. Stone, “Enhanced Dynamic Queries via

Movable Filters,” Proc. Conf. Human Factors in Computing Systems(CHI ’95), pp. 415-420, 1995.

[24] B.J. Frey and D. Dueck, “Clustering by Passing Messages betweenData Points,” Science, vol. 315, no. 5814, pp. 972-976, 2007.

[25] V. Guralnik and J. Srivastava, “Event Detection from Time SeriesData,” Proc. ACM SIGKDD ’99, pp. 33-42, 1999.

[26] R.J. Hendley, N.S. Drew, A.M. Wood, and R. Beale, “Narcissus:Visualizing Information,” Proc. Int’l Symp. Information Visualization(InfoVis ’95), pp. 90-96, 1995.

[27] W.L. Hibbard, C.R. Dryer, and B.E. Paul, “A Lattice Model of DataDisplay,” Proc. IEEE Conf. Visualization (VIS ’94), pp. 310-317, 1994.

[28] A. Jain, M. Murty, and P. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.

[29] N. Jardine and R. Sibson, “The Construction of Hierarchic and

Non-Hierarchic Classifications,” The Computer J., vol. 11, no. 2,pp. 177-184, 1968.

[30] S.C. Johnson, “Hierarchical Clustering Schemes,” Psychometrika,vol. 32, no. 3, pp. 241-254, 1967.

[31] Y. Kakizawa, R.H. Shumway, and M. Taniguchi, “Discriminationand Clustering for Multivariate Time Series,” J. Am. Statistical

Assoc., vol. 93, no. 441, pp. 328-340, 1998.[32] E. Kandogan, “Visualizing Multi-Dimensional Clusters, Trends,

and Outliers Using Star Coordinates,” Proc. ACM SIGKDD ’01,pp. 107-116, 2001.

[33] L. Kaufman and P. Rousseeuw, Finding Groups in Data: AnIntroduction to Cluster Analysis. John Wiley & Sons, 1990.

[34] D.A. Keim, “Visual Database Exploration Techniques,” Proc. ACMSIGKDD Tutorial, 1997.

[35] D.A. Keim, “Information Visualization and Visual Data Mining,”IEEE Trans. Visualization and Computer Graphics, vol. , no. 1, pp. 1-8,

2002.

[36] D.A. Keim and H.P. Kriegel, “Visualization Techniques forMining Large Databases: A Comparison,” IEEE Trans. Knowledgeand Data Eng., vol. 8, no. 6, pp. 923-936, Dec. 1996.

[37] E. Keogh, “A Fast and Robust Method for Pattern Matching inTime Series Databases,” Proc. Ninth Int’l Conf. Tools with ArtificialIntelligence (TAI), 1997.

[38] E. Keogh and S. Kasetty, “On the Need for Time Series DataMining Benchmarks: A Survey and Empirical Demonstration,”Data Mining and Knowledge Discovery, vol. 7, no. 4, pp. 349-371,

2003.[39] E. Keogh and M. Pazzani, “An Enhanced Representation of Time

Series Which Allows Fast and Accurate Classification, Clustering,and Relevance Feedback,” Proc. ACM SIGKDD ’98, R. Agrawal,P. Stolorz, and G. Piatetsky-Shapiro, eds., pp. 239-241, 1998.

[40] E. Keogh and P. Smyth, “A Probabilistic Approach to Fast PatternMatching in Time Series Databases,” Proc. ACM SIGKDD, 1997.

[41] J. LeBlanc, M.O. Ward, and N. Wittels, “Exploring n-DimensionalDatabases,” Proc. IEEE Conf. Visualization (VIS ’90), pp. 230-237,1990.

[42] Y. Li, X.S. Wang, and S. Jajodia, “Discovering Temporal Patternsin Multiple Granularities,” Proc. Int’l Workshop Temporal, Spatialand Spatio-Temporal Data Mining (TSDM), 2000.

[43] J.D. Mackinlay, “Automating the Design of Graphical Presenta-tions of Relational Information,” ACM Trans. Graphics, vol. 5, no. 2,pp. 110-141, 1986.

[44] H. Mannila, H. Toivonen, and A.I. Verkamo, “DiscoveringFrequent Episodes in Sequences,” Proc. ACM SIGKDD ’95,pp. 210-215, 1995.

[45] G.W. Milligan and M.C. Cooper, “An Examination of Proceduresfor Determining the Number of Clusters in a Data Set,”Psychometrika, vol. 50, no. 2, pp. 159-179, 1985.

[46] N. Molinari, C. Bonaldi, and J.P. Daures, “Multiple TemporalCluster Detection,” Biometrics, vol. 57, no. 2, pp. 577-583, 2001.

[47] T. Oates, “Identifying Distinctive Subsequences in MultivariateTime Series by Clustering,” Proc. ACM SIGKDD ’99, pp. 322-326,1999.

[48] B. Padmanabhan and A. Tzuhilin, “Pattern Discovery in TemporalDatabases: A Temporal Logic Approach,” Proc. ACM SIGKDD,1996.

[49] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U.Dayal, and M.-C. Hsu, “Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach,” IEEE Trans. Knowledge andData Eng., vol. 16, no. 10, pp. 1-17, Oct. 2004.

[50] R.J. Povinelli, “Identifying Temporal Patterns for Characterizationand Prediction of Financial Time Series Events,” Proc. Int’lWorkshop Temporal, Spatial and Spatio-Temporal Data Mining (TSDM’00), pp. 46-61, 2000.

[51] R.J. Povinelli and X. Feng, “A New Temporal Pattern Identifica-tion Method for Characterization and Prediction of Complex TimeSeries Events,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 2,pp. 339-352, Mar./Apr. 2003.

[52] D. Pyle, Data Preparation for Data Mining. Morgan Kaufman, 1999.[53] R. Rao and S.K. Card, “The Table Lens: Merging Graphical and

Symbolic Representation in an Interactive Focus-Context Visuali-zation for Tabular Information,” Proc. Conf. Human Factors inComputing Systems (CHI ’94), pp. 318-322, 1994.

[54] J. Roddick, K. Hornsby, and M. Spiliopoulou, “An UpdatedBibliography of Temporal, Spatial, and Spatio-Temporal Data

Mining Research,” Proc. Int’l Workshop Temporal, Spatial and Spatio-Temporal Data Mining (TSDM), 2000.

[55] J. Roddick and M. Spiliopoulou, “A Survey of Temporal Knowl-edge Discovery Paradigms and Methods,” IEEE Trans. Knowledgeand Data Eng., vol. 14, no. 4, pp. 750-767, July/Aug. 2002.

[56] S.F. Roth and J. Mattis, “Data Characterization for IntelligentGraphics Presentations,” Proc. Conf. Human Factors in ComputingSystems (CHI ’90), pp. 193-200, 1990.

[57] B. Schneiderman, “Tree Visualization with Treemaps: A 2D Space-Filling Approach,” ACM Trans. Graphics, vol. 11, no. 1, pp. 92-99,1992.

[58] B. Schneiderman, “The Eyes Have It: A Task by Data TypeTaxonomy for Information Visualizations,” Proc. IEEE Symp.Visual Languages, 1996.

[59] H. Senay and E.A. Ignatius, “Knowledge-Based System forVisualization Design,” IEEE Computer Graphics and Applications,

vol. 14, no. 6, pp. 36-47, 1994.




[60] M. Shaw, C. Subramaniam, G.W. Tan, and M. Welge, “KnowledgeManagement and Data Mining for Marketing,” Decision SupportSystems, vol. 31, no. 1, pp. 127-137, 2001.

[61] B. Spence, Information Visualization. Pearson Education HigherEducation, 2000.

[62] C.A. Sugar, “An Application of Cluster Analysis to HealthServices Research: Empirically Defined Health States for Depres-sion from the SF-12,” technical report, Dept. of Statistics, StanfordUniv., 1999.

[63] C.A. Sugar and G.M. James, “Finding the Number of Clusters in aData Set: An Information Theoretic Approach,” J. Am. Statistical Assoc., vol. 98, pp. 750-763, 2003.

[64] D. Tang, C. Stolte, and P. Hanrahan, “Polaris: A System for Query,Analysis and Visualization of Multi-Dimensional RelationalDatabases,” IEEE Trans. Visualization and Computer Graphics,vol. 8, no. 1, pp. 52-65, 2002.

[65] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the Numberof Clusters in a Data Set via the Gap Statistic,” J. Royal StatisticalSoc., vol. 63, no. 2, pp. 411-423, 2001.

[66] C. Ware, Information Visualization: Perception for Design. MorganKaufmann, 2000.

[67] M. Zaki, “SPADE: An Efficient Algorithm for Mining FrequentSequences,” Machine Learning, vol. 42, no. 1-2, pp. 31-60, 2001.

Gediminas Adomavicius received the PhDdegree in computer science from New YorkUniversity in 2002. He is an associate professorof information and decision sciences in theCarlson School of Management, University ofMinnesota. His research focuses on personali-zation technologies, data mining, and combina-torial auction mechanisms. He has publishedmore than 40 journal and conference papers in

these areas, including papers in leading compu-ter science and information systems journals such as the IEEE Transactions on Knowledge and Data Engineering , Data Mining and Knowledge Discovery , the ACM Transactions on Information Systems ,Information Systems Research , and the INFORMS Journal on Comput- ing . He served on the program committees of numerous computerscience and information systems conferences. He received the USNational Science Foundation CAREER award in 2006 for his researchon personalization technologies. He is a member of the IEEE.

Jesse Bockstedt received the MS degree inmechanical engineering in 2004. He is currentlyworking toward the PhD degree in informationsystems at the University of Minnesota (ex-pected 2008). He has worked in the servicesresearch group at IBM Research and as amanagement and IT consultant with Accenture

in the telecommunications industry. He haspublished research in Communications of the ACM , Information Technology and Manage-

ment , the International Journal of Electronic Commerce , and MIS Quarterly . His research interests include IT evolution, businessintelligence, the impact of new IT on markets and consumers, and datavisualization.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

Cluster Graphs