Upload
huzefa-johar
View
169
Download
3
Embed Size (px)
DESCRIPTION
This tutorial attempts to bridge the gap between charting principles taught at schools and the real world application of data visualization. It is intended for anyone who aspires to become proficient in using charts for exploratory analysis of data. The paper has been written for the benefit of the general public and may be of interest to students, researchers, executives and even designers of analytics systems.
Citation preview
A beginner’s guide to data visualization | Huzefa Johar
CHART IT RIGHT
Table of Contents Abstract, ....................................................................................................... 3
Why do we need charts? ................................................................................. 4
Understanding the data .................................................................................. 7
Charting homogenous data ........................................................................... 11
Comparison .............................................................................................. 11
Transition ................................................................................................. 12
Composition ............................................................................................. 14
Charting heterogeneous data ......................................................................... 18
Implied Correlation .................................................................................... 18
Direct Correlation ...................................................................................... 20
Abstract Since most business related actions and decisions are based on analysis of data,
professionals are often required to create charts in order to help their companies in
selecting the best course of action. Charting is therefore an essential skill and finds
application in all areas of business including marketing, finance, human resource
management and R&D.
Typical school education only provides an elementary training in data visualization.
So, people are often hard pressed when faced with the prospect of putting their
high school charting skills to practice. This is because charting concepts learnt in
school are quite inadequate in context of present technological environment. In
today’s world, data is readily available and software tools can afford myriad forms
of visual data analysis. Hence, in order to convey data effectively it is essential to
understand the implications of various data visualization methods.
This tutorial attempts to bridge the gap between charting principles taught at
schools and the real world application of data visualization. It is intended for
anyone who aspires to become proficient in using charts for exploratory analysis of
data. The paper has been written for the benefit of the general public and may be
of interest to students, researchers, executives and even designers of analytics
systems.
Why do we need charts? It comes as a mildly surprising actuality that development of data visualization
methods should be so recent an occurrence, when man has had to deal with data
since the very dawn of civilization. For it was only in the 16th century, that
Descartes devised the coordinates system which paved way for the advent of data
visualization techniques. Naturally there must have existed other means of making
sense out of data, for how could man have progressed without the insights – that
are gained by inferring data.
Now, we know that charts have been around for only a couple of centuries, so prior
to the advent of data visualization people used tables to study and analyze data.
The humble table has aided countless generations of men through many a
centuries, and it continues to be of immense value to men of today. In fact, tables
are such an integral part of our lives that we hardly realize the fact that common
place objects such as calendars, train schedules, and bank statements are all
avatars of the humble table.
By definition, data is a structured compilation of related chunks of information. So,
tabulation helps in formulation of data, because it imparts structure to an otherwise
meaningless bundle of isolated informations. Data and tables are therefore very
much the same entity, just like the opposite sides of a coin. Now data analysis is all
about extracting information from data, so given the structure of the data table – it
is easy to pull-out requisite information using the simple technique of referencing.
January 2011
Su Mo Tu We Th Fr Sa
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31
Extracting information from table
Additionally, since data is made up of related bits of information, one can even
manipulate a bunch of closely linked informations to extract what is called – derived
information.
Date Narration Withdrawals Deposits Closing Balance
10/22/2010 ABC Telecom $60 $1340
10/30/2010 Smith & Co $325 $1665
11/12/2010 Mc Arthur Ltd $550 $2215
11/22/2010 ABC Telecom $90 $2125
12/05/2010 Clarkson & Smith $225 $1900
12/17/2010 Ronald Ballantine $475 $2375
12/30/2010 ABC Telecom $65 $2315
Total: $440 $1350
Deriving information from table
What day
is Jan 14th?
As is apparent from the examples, data in its raw tabular form is quite conducive
for referencing and computing. However, raw data is not graspable, so as to say it
is not possible to comprehend it – in its entirety. Just by looking at a table of
revenue and cost figures, one cannot make-out whether the revenue is growing or
declining, and this is where the charts pitch-in. They help in visualizing the data at
many different levels. Just by looking at a chart, one can comment on the collective
properties of all data elements as well as spot anomalies within the data. In fact, by
plotting data in different ways one can uncover many more facets of the data.
Let us look at some examples to see how charts can be used to extract valuable
information from data.
2007 2008 2009
Revenue Cost Profit Revenue Cost Profit Revenue Cost Profit
Q 1 35, 000 30, 000 5, 000 38, 000 30, 000 8, 000 44, 000 33, 000 11, 000
Q 2 37, 000 31,000 6, 000 40, 000 30, 500 9, 500 45, 700 33, 500 12, 200
Q 3 39, 000 31, 500 7, 500 41, 000 31, 500 9, 500 47, 000 35, 000 12, 000
Q 4 41, 000 32, 000 9, 000 42, 000 33, 000 9, 000 48, 300 35, 700 12, 300
152, 000 124, 000 27, 500 161, 000 125, 000 36, 000 185, 000 137, 200 47, 500
The column plot of quarterly revenue reveals the following:
Each year, the quarterly revenue increases progressively from first through
fourth quarter.
Revenue drops steeply in the first quarter of 2008, but it increases gradually
and the final quarter earnings of 2008 overshoot the previous high by a small
margin.
Revenue earned in each quarter of 2009 is significantly greater than
quarterly revenues earned in previous years.
A stacked area chart showing the break-up of annual revenue yields some very
interesting facts:
The sudden steepening of the upper line indicates an increase in the rate of
revenue growth
The gradual widening of the pink region from left to right indicates an
increase in profit margin
The examples clearly demonstrate the effectiveness of charts as a tool for carrying
out exploratory analysis of data. But in order to create worthwhile charts it is
essential to understand the basic properties of data, because implementation of
data visualization techniques is not possible without a fundamental understanding
of data.
Understanding the data The task of transforming data into charts may seem like a straight forward affair,
but actually this requires a little bit of thinking and decision making. The process of
charting begins with evaluation of data, as this helps in choosing a mode of visual
analysis – that is best for conveying the data effectively.
When one looks at data with the intention of conveying it through charts, it is
imperative to think in terms of categories and series. This is because the act of
plotting data onto a chart – involves mapping of data-elements to categorical axis
and numerical axis of the chart.
Original data - yellow represents categorical data and green represents series data
The nature of the data series plays an important role in chart selection. If the data
series are similar in terms of their base unit then they can be plotted against a
common numerical axis, otherwise they must be plotted against separate numerical
axis. In the case of Regional Market Share & Sales chart (shown above), the two
series namely Annual sales and Estimated market size are similar or
homogenous as both are expressed in terms of a monetary unit.
Now, here is an example of data comprising of two dissimilar or heterogeneous
series:
Region Annual Sales Estimated Size of the Market
North Asia $ 20, 000, 000 $ 40, 000, 000
Central Asia $ 15, 000, 000 $ 35, 000, 000
South East Asia $ 54, 000 , 000 $ 75, 000, 000
Equatorial Africa
$ 7, 000, 000 $ 9, 000, 000
African Peninsula
$ 15, 000, 000 $ 21, 000, 000
Eastern Europe $ 40, 000, 000 $ 65, 000, 000
Western Europe $ 80, 000, 000 $ 110, 000, 000
North America $ 130, 000, 000 $ 170, 000, 000
South America $ 60, 000, 000 $ 90, 000, 000
Australia $ 35, 000, 000 $ 50, 000, 000
Categorical axis
Numerical axis
Region Revenue Profit %
North Asia $ 20, 000, 000 21%
Central Asia $ 15, 000, 000 19%
South East Asia $ 54, 000 , 000 20%
Equatorial Africa $ 7, 000, 000 15%
African Peninsula $ 15, 000, 000 16%
Eastern Europe $ 40, 000, 000 21%
Western Europe $ 80, 000, 000 23%
North America $ 130, 000, 000 25%
South America $ 60, 000, 000 19%
Australia $ 35, 000, 000 20%
Here the two series are heterogeneous, because revenue is expressed in terms of
a monetary unit (i.e. $), while profit is expressed in terms of percentage. If the
data comprises of just two heterogeneous series, then a dual Y-axis chart can be
used for representing the data. However, if the data is made-up of more than two
heterogeneous series then each series must be plotted on a separate chart.
Correction needed - both series must be column-
The dual Y-axis chart is extremely convenient for a comparative analysis of two
heterogeneous series. However, the use of this chart must be avoided in situations
where there is significant variation in the range of concerned data series. Since,
comparison becomes difficult when the scales of the two numerical axes are put out
of context with each other — due to the excessive range difference between the
two data series. The Revenue vs Profit chart shown above presents an ideal
example of a situation where use of dual Y-axis chart is inappropriate. Hence in
situations such as these, it is more prudent to plot the two series on different
charts.
`
Depending on the nature of the data series (i.e. whether it is homogenous or
heterogeneous) one can plot various forms of charts to facilitate different types of
analysis. Generally, homogenous data is subjected to comparison, composition and
transition analysis. While heterogeneous data is mostly subjected to correlation
analysis.
There are no rules or principles for determining which analysis is right for the given
data. The selection must be based on conceptual understanding of data, as
illustrated through following examples:
In theory one can modify the above examples by substituting the line chart for a
bar chart and vice versa. However, doing so will either alter the context of analysis
or else render the visualization pointless. In the case of the former example —
concerning yearly output of rice in India, the line chart enables the audience to see
India – Rice Production
Year Tons
2006 800,000,000
2007 840,000,000
2008 850,000,000
2009 862,000,000
2010 910,000,000
Annual Rice Production
Country Tons
China 900,000,000
India 700,000,000
Japan 20,000,000
Indonesia 300,000,000
Thailand 400,000,000
how rice production changes over time. If this data is plotted on a bar chart, then
people will tend to look at it from a different perspective, because unlike the line
plot which indicates transition — the bar plot fosters comparison. Now, the latter
example uses a bar chart to compare the annual rice yield of various countries.
Here, there is no scope for switching the mode of analysis, because the data in
question can only be subjected to comparison analysis. Hence, substitution of the
bar chart with another type of chart will only result in something pointless.
The decision concerning an ideal or appropriate mode of visual analysis is
influenced by the meaning, purpose and scope of data. So, the task of converting
data into charts must begin with close examination of the data. This helps in
determining the best mode of analysis for each of the data series, and based on
this knowledge one can select appropriate chart types for visual representation of
individual series that constitutes the data.
Charting homogenous data Homogenous data can be subjected to three basic forms of analysis which are:
Comparison
Transition
Composition
Let’s take a closer look at each one of them.
Comparison
Comparison analysis brings out the anomalies of data. It helps in spotting the highs
and lows of data, and also brings out hidden exceptions. The bar chart is the most
appropriate form of data visualization for comparison analysis, as it can effectively
portray the differences between plots. The following chart presents a comparison of
equities — based on their average prices.
The chart makes it easier to determine which stocks are doing good, but in this
case the focus is on a single data series. Now if more than one series is plotted on a
bar chart, then each category will have multiple sibling plots — that’ll enable
parallel comparison. Here’s an example:
Here the chart displays three data series, so each category (i.e. country in this
case) represents three data elements — drawn from each series. A multi-series bar
chart supports two forms of comparison, which are:
Linear comparison: Comparison between data elements of same series eg. rice
yield of India vs rice yield of China
Parallel comparison: Comparison between data elements of same category eg.
rice yield of India vs maize yield of India
Bar charts are available in two variations. They can either have horizontal bars or
vertical bars. Vertical bars are more common, and are ideally used when
representing a small set of data. Since in case of a large data set, the category
names tend to get skewed — as a result of limited horizontal space.
Horizontal bar charts are more appropriate for large data sets, as category names
can conveniently be placed in-line with the bars. But, the horizontal bar chart is not
very conducive for parallel comparison — making it unsuitable for multi-series data.
Transition
Transition analysis is applied to sequential data. It helps in understanding the
nature of change — undergone by a sequenced data series.
Examples of sequential data
Line and area charts are used for transition analysis. These charts connect the data
elements serially — enabling the audience to comment on the nature of change or
transition.
Seeing the above chart, one can make following comments about the data:
Profit has grown steadily during 2000 - 2009, without any instance of
decline
Growth rate has been constant throughout, except for a short interval
between 2005 – 2007 when profit grew at an increased rate
Here’s another example:
Year Profit
2000 $150,000
2001 $176,000
2002 $182,000
2003 $195,000
2004 $220000
2005 $235,000
2006 $292,000
2007 $315,000
2008 $327,000
2009 $362,000
Quantity Cost Per Unit
10,000 $9
20,000 $9.6
30,000 $10.2
40,000 $9.8
50,000 $9.5
60,000 $9.5
70,000 $9.5
Chronological Incremental
This chart enables the audience to compare the transition of two data series.
Following points are revealed after a brief examination of the chart:
In case of Product A, the cost of production increase up-till 30,000 units
and then begins to decline — finally attaining consistency at 50,000 units
and above
In case of Product B, the cost of production remains constant up-till 40,000
units. Following which, it increases marginally at 50,000 units and remains
consistent thereafter
Line and area chart are similar in most aspects and can be used interchangeably.
However, the use of area chart must be restricted to single series data. Since in
case of multi-series data, parts of the chart may get obscured — due to overlapping
of data series.
Composition
Composition analysis helps in understanding the proportion of data. Pie chart is the
simplest form of composition chart, and is used for studying the composition of
single-series data. In a pie chart, each data element is represented as a percentage
of the sum of all data elements.
Here are some examples:
Business Capital
A variation of pie chart called - Doughnut Chart
Composition analysis of multi-series data is carried-out using stacked charts. Unlike
the pie chart, which shows linear composition (i.e. composition of a series), stacked
charts help in visualizing parallel composition (i.e. composition of a category).
Stacked charts however, are not particularly meant for composition analysis,
because they combine composition analysis with either comparison or transition
analysis.
Investment
Resort $1,500,000
Construction Co. $3,500,000
NGO $500,000
Shipping Co. $1,000,000
Furniture Manufacturing Unit $1,500,000
Total Investment $8,000,000
Cost Amount
Raw Material $150,000
Energy $80,000
Labor $200,000
Warehousing $40,000
Total Cost $470,000
The stacked bar chart enables comparison of composite data. In this case, it allows
the audience to compare the unit-cost of various products, by presenting it as a
sum of material cost and processing cost. Here the emphasis is on comparison
of absolute values, which makes it difficult to compare the proportion of categories.
So if comparison of categories — in terms of their proportions is essential, then the
data ought to be plotted on a 100% stacked chart.
In a 100% stacked chart, the length of a stacked bar is independent of the absolute
sum of category, and its stacks indicate the proportion of the category — in terms
of percentage.
Another form of stacked chart is the stacked area chart, which helps in analyzing
the transition of composite data.
Product Material Cost Processing Cost
Product A $4 $8
Product B $6 $3
Product C $12 $8
Product D $5 $7
Product E $5 $3
Parallel Composition
In this example, the stacked area chart shows the transition of operational cost,
which is expressed as a sum of fixed and variable cost. The combined presentation
of transition and composition leads to following discoveries:
The operational cost has declined at a gradual rate during 2000 – 2006
There has been a gradual increase in fixed cost, which is accompanied by a
gradual decline in variable cost
The stacked area chart is very effective in depicting the association of overall trend
to the transition pattern of constituent data-series.
Year Fixed Cost Variable cost
2000 $26,000 $85,000
2001 $28,000 $79,000
2002 $32,000 $72,000
2003 $36,000 $65,000
2004 $38,000 $62,000
2005 $43,000 $59,000
2006 $47,000 $51,000
Charting heterogeneous data Analysis of heterogeneous data helps in determining the correlation between two or
more data-series. Correlation refers to the degree of connectedness between data-
series, and it gives an idea of the extent to which — one data-series is dependent
upon the other. There are two ways of analyzing heterogeneous data, these are:
Implied correlation
Direct correlation
Implied Correlation
Implied correlation is a technique of visual analysis, where correlation is indirectly
determined through — comparison of independent plots of two or more
heterogeneous data-series.
The concept is best understood through following example:
Year Units Sold Revenue Profit%
2000 32,000 $700,000 17.5%
2001 37,000 $900,000 18%
2002 46,000 $1,400,000 19%
2003 50,000 $1,600,000 19.5%
2004 57,000 $1,900,000 21%
Here, the three heterogeneous data-series are graphed separately and placed side
by side — to enable comparison of their plot patterns. This gives a crewed idea of
the nature of correlation that exists between the three data-series. In this case all
three charts have rising curves, which suggest a strong degree of correlation
between the data series.
If the data is bivariate (i.e. comprising of only two heterogeneous series) — then a
dual Y-axis chart may be used for implied correlation. Here are some examples:
Some caution needs to be observed while using the dual Y-axis chart for
representing bivariate data. As explained earlier — in Understanding the data,
the use of dual Y-axis chart is inappropriate in situations where there is significant
variation in the range of concerned data series. So, dual Y-axis chart should only be
used when the range difference between the two data series is nominal.
Direct Correlation
Direct correlation is a technique of visual analysis, which is applied to uncategorized
heterogeneous data. It helps in determining the exact level and nature of
dependency — between two or more heterogeneous series.
XY pllot charts are used for studying direct correlation of bivariate and trivariate
data. In these charts, the categorical axis is substituted by an extra numerical axis,
so the postion of a plot corresponds to two numerical values.
The scatter chart is a simple XY plot chart — used for studying bivariate data.
Here’s an example:
The best way to analyze a scatter chart is to draw a line from left to right — in such
a way that it covers maximum number of points.
0
2
4
6
8
10
10 15 20 25 30
Vis
its p
er m
on
th
Age
Age vs Frequency of Supermarket visits
Lifestyle Survey
Age Supermarket visits per month
16 3
18 3
23 8
20 2
24 1
19 3
16 7
22 4
Example of uncategorized data
Here, the downward slope of line suggests inverse proportionality, meaning that
younger people tend to visit the supermarket more often than older ones.
Isolated clusters of data points should also be taken into considerations, as they are
indicators of trends and tendencies.
In this chart, the clusters highlight the following trends:
People in the age group 25 – 28 own upto 4 credit cards
People in the age group 28 – 31 own upto 5 credit cards
People in the age group 37 – 40 own upto 3 credit cards
The other form of XY plot chart is the bubble chart, which is used for analyzing
trivariate data.
0
2
4
6
8
10
10 15 20 25 30Vis
its p
er m
on
th
Age
Age vs Frequency of Supermarket visits
0
1
2
3
4
5
6
16 19 22 25 28 31 34 37 40
No of credit cards per person
Bubble charts are similar to scatter charts, except that in bubble charts — data
points are of variable sizes and appear like bubbles. Here’s an example:
In a bubble chart, the position of a bubble indicates the quantity of two parameters,
while its size indicates the value of a third parameter. The above chart helps in
correlating the age and frequency of supermarket visits to average
expenditure per visit — as indicated by the size of the bubbles. A brief
examination of the chart leads to the discovery that — average expenditure per
visit is directly proportional to age, but inversely proportional to frequency of
visits. The interpretation is as follows:
Those who frequent the supermarket very often tend to spend less money
than the ones who drop-in less frequently
Older the person, greater is the amount spent by him on each visit
Even it the bubbles are scattered, a lot can be learnt about data — just by dividing
the chart into quadrants. Here’s an example:
-2
0
2
4
6
8
10
15 18 21 24 27
Bubble size indicates average expenditure per visit
Lifestyle Survey
Age Supermarket visits per month
Avg expenditure per visit
16 3 $10
18 3 $8
23 8 $50
20 2 $15
24 1 $50
19 3 $9`
16 7 $5
22 4 $6
Example of trivariate data
In this investment portfolio chart, the quadrants enable segmentation of
investments into categories: good, medium, stagnant and bad.