23
A beginners guide to data visualization | Huzefa Johar CHART IT RIGHT

Chart it Right - A beginner's guide to data visualization

Embed Size (px)

DESCRIPTION

This tutorial attempts to bridge the gap between charting principles taught at schools and the real world application of data visualization. It is intended for anyone who aspires to become proficient in using charts for exploratory analysis of data. The paper has been written for the benefit of the general public and may be of interest to students, researchers, executives and even designers of analytics systems.

Citation preview

Page 1: Chart it Right - A beginner's guide to data visualization

A beginner’s guide to data visualization | Huzefa Johar

CHART IT RIGHT

Page 2: Chart it Right - A beginner's guide to data visualization

Table of Contents Abstract, ....................................................................................................... 3

Why do we need charts? ................................................................................. 4

Understanding the data .................................................................................. 7

Charting homogenous data ........................................................................... 11

Comparison .............................................................................................. 11

Transition ................................................................................................. 12

Composition ............................................................................................. 14

Charting heterogeneous data ......................................................................... 18

Implied Correlation .................................................................................... 18

Direct Correlation ...................................................................................... 20

Page 3: Chart it Right - A beginner's guide to data visualization

Abstract Since most business related actions and decisions are based on analysis of data,

professionals are often required to create charts in order to help their companies in

selecting the best course of action. Charting is therefore an essential skill and finds

application in all areas of business including marketing, finance, human resource

management and R&D.

Typical school education only provides an elementary training in data visualization.

So, people are often hard pressed when faced with the prospect of putting their

high school charting skills to practice. This is because charting concepts learnt in

school are quite inadequate in context of present technological environment. In

today’s world, data is readily available and software tools can afford myriad forms

of visual data analysis. Hence, in order to convey data effectively it is essential to

understand the implications of various data visualization methods.

This tutorial attempts to bridge the gap between charting principles taught at

schools and the real world application of data visualization. It is intended for

anyone who aspires to become proficient in using charts for exploratory analysis of

data. The paper has been written for the benefit of the general public and may be

of interest to students, researchers, executives and even designers of analytics

systems.

Page 4: Chart it Right - A beginner's guide to data visualization

Why do we need charts? It comes as a mildly surprising actuality that development of data visualization

methods should be so recent an occurrence, when man has had to deal with data

since the very dawn of civilization. For it was only in the 16th century, that

Descartes devised the coordinates system which paved way for the advent of data

visualization techniques. Naturally there must have existed other means of making

sense out of data, for how could man have progressed without the insights – that

are gained by inferring data.

Now, we know that charts have been around for only a couple of centuries, so prior

to the advent of data visualization people used tables to study and analyze data.

The humble table has aided countless generations of men through many a

centuries, and it continues to be of immense value to men of today. In fact, tables

are such an integral part of our lives that we hardly realize the fact that common

place objects such as calendars, train schedules, and bank statements are all

avatars of the humble table.

By definition, data is a structured compilation of related chunks of information. So,

tabulation helps in formulation of data, because it imparts structure to an otherwise

meaningless bundle of isolated informations. Data and tables are therefore very

much the same entity, just like the opposite sides of a coin. Now data analysis is all

about extracting information from data, so given the structure of the data table – it

is easy to pull-out requisite information using the simple technique of referencing.

January 2011

Su Mo Tu We Th Fr Sa

1

2 3 4 5 6 7 8

9 10 11 12 13 14 15

16 17 18 19 20 21 22

23 24 25 26 27 28 29

30 31

Extracting information from table

Additionally, since data is made up of related bits of information, one can even

manipulate a bunch of closely linked informations to extract what is called – derived

information.

Date Narration Withdrawals Deposits Closing Balance

10/22/2010 ABC Telecom $60 $1340

10/30/2010 Smith & Co $325 $1665

11/12/2010 Mc Arthur Ltd $550 $2215

11/22/2010 ABC Telecom $90 $2125

12/05/2010 Clarkson & Smith $225 $1900

12/17/2010 Ronald Ballantine $475 $2375

12/30/2010 ABC Telecom $65 $2315

Total: $440 $1350

Deriving information from table

What day

is Jan 14th?

Page 5: Chart it Right - A beginner's guide to data visualization

As is apparent from the examples, data in its raw tabular form is quite conducive

for referencing and computing. However, raw data is not graspable, so as to say it

is not possible to comprehend it – in its entirety. Just by looking at a table of

revenue and cost figures, one cannot make-out whether the revenue is growing or

declining, and this is where the charts pitch-in. They help in visualizing the data at

many different levels. Just by looking at a chart, one can comment on the collective

properties of all data elements as well as spot anomalies within the data. In fact, by

plotting data in different ways one can uncover many more facets of the data.

Let us look at some examples to see how charts can be used to extract valuable

information from data.

2007 2008 2009

Revenue Cost Profit Revenue Cost Profit Revenue Cost Profit

Q 1 35, 000 30, 000 5, 000 38, 000 30, 000 8, 000 44, 000 33, 000 11, 000

Q 2 37, 000 31,000 6, 000 40, 000 30, 500 9, 500 45, 700 33, 500 12, 200

Q 3 39, 000 31, 500 7, 500 41, 000 31, 500 9, 500 47, 000 35, 000 12, 000

Q 4 41, 000 32, 000 9, 000 42, 000 33, 000 9, 000 48, 300 35, 700 12, 300

152, 000 124, 000 27, 500 161, 000 125, 000 36, 000 185, 000 137, 200 47, 500

The column plot of quarterly revenue reveals the following:

Each year, the quarterly revenue increases progressively from first through

fourth quarter.

Revenue drops steeply in the first quarter of 2008, but it increases gradually

and the final quarter earnings of 2008 overshoot the previous high by a small

margin.

Revenue earned in each quarter of 2009 is significantly greater than

quarterly revenues earned in previous years.

Page 6: Chart it Right - A beginner's guide to data visualization

A stacked area chart showing the break-up of annual revenue yields some very

interesting facts:

The sudden steepening of the upper line indicates an increase in the rate of

revenue growth

The gradual widening of the pink region from left to right indicates an

increase in profit margin

The examples clearly demonstrate the effectiveness of charts as a tool for carrying

out exploratory analysis of data. But in order to create worthwhile charts it is

essential to understand the basic properties of data, because implementation of

data visualization techniques is not possible without a fundamental understanding

of data.

Page 7: Chart it Right - A beginner's guide to data visualization

Understanding the data The task of transforming data into charts may seem like a straight forward affair,

but actually this requires a little bit of thinking and decision making. The process of

charting begins with evaluation of data, as this helps in choosing a mode of visual

analysis – that is best for conveying the data effectively.

When one looks at data with the intention of conveying it through charts, it is

imperative to think in terms of categories and series. This is because the act of

plotting data onto a chart – involves mapping of data-elements to categorical axis

and numerical axis of the chart.

Original data - yellow represents categorical data and green represents series data

The nature of the data series plays an important role in chart selection. If the data

series are similar in terms of their base unit then they can be plotted against a

common numerical axis, otherwise they must be plotted against separate numerical

axis. In the case of Regional Market Share & Sales chart (shown above), the two

series namely Annual sales and Estimated market size are similar or

homogenous as both are expressed in terms of a monetary unit.

Now, here is an example of data comprising of two dissimilar or heterogeneous

series:

Region Annual Sales Estimated Size of the Market

North Asia $ 20, 000, 000 $ 40, 000, 000

Central Asia $ 15, 000, 000 $ 35, 000, 000

South East Asia $ 54, 000 , 000 $ 75, 000, 000

Equatorial Africa

$ 7, 000, 000 $ 9, 000, 000

African Peninsula

$ 15, 000, 000 $ 21, 000, 000

Eastern Europe $ 40, 000, 000 $ 65, 000, 000

Western Europe $ 80, 000, 000 $ 110, 000, 000

North America $ 130, 000, 000 $ 170, 000, 000

South America $ 60, 000, 000 $ 90, 000, 000

Australia $ 35, 000, 000 $ 50, 000, 000

Categorical axis

Numerical axis

Page 8: Chart it Right - A beginner's guide to data visualization

Region Revenue Profit %

North Asia $ 20, 000, 000 21%

Central Asia $ 15, 000, 000 19%

South East Asia $ 54, 000 , 000 20%

Equatorial Africa $ 7, 000, 000 15%

African Peninsula $ 15, 000, 000 16%

Eastern Europe $ 40, 000, 000 21%

Western Europe $ 80, 000, 000 23%

North America $ 130, 000, 000 25%

South America $ 60, 000, 000 19%

Australia $ 35, 000, 000 20%

Here the two series are heterogeneous, because revenue is expressed in terms of

a monetary unit (i.e. $), while profit is expressed in terms of percentage. If the

data comprises of just two heterogeneous series, then a dual Y-axis chart can be

used for representing the data. However, if the data is made-up of more than two

heterogeneous series then each series must be plotted on a separate chart.

Correction needed - both series must be column-

The dual Y-axis chart is extremely convenient for a comparative analysis of two

heterogeneous series. However, the use of this chart must be avoided in situations

where there is significant variation in the range of concerned data series. Since,

comparison becomes difficult when the scales of the two numerical axes are put out

of context with each other — due to the excessive range difference between the

two data series. The Revenue vs Profit chart shown above presents an ideal

example of a situation where use of dual Y-axis chart is inappropriate. Hence in

situations such as these, it is more prudent to plot the two series on different

charts.

Page 9: Chart it Right - A beginner's guide to data visualization

`

Depending on the nature of the data series (i.e. whether it is homogenous or

heterogeneous) one can plot various forms of charts to facilitate different types of

analysis. Generally, homogenous data is subjected to comparison, composition and

transition analysis. While heterogeneous data is mostly subjected to correlation

analysis.

There are no rules or principles for determining which analysis is right for the given

data. The selection must be based on conceptual understanding of data, as

illustrated through following examples:

In theory one can modify the above examples by substituting the line chart for a

bar chart and vice versa. However, doing so will either alter the context of analysis

or else render the visualization pointless. In the case of the former example —

concerning yearly output of rice in India, the line chart enables the audience to see

India – Rice Production

Year Tons

2006 800,000,000

2007 840,000,000

2008 850,000,000

2009 862,000,000

2010 910,000,000

Annual Rice Production

Country Tons

China 900,000,000

India 700,000,000

Japan 20,000,000

Indonesia 300,000,000

Thailand 400,000,000

Page 10: Chart it Right - A beginner's guide to data visualization

how rice production changes over time. If this data is plotted on a bar chart, then

people will tend to look at it from a different perspective, because unlike the line

plot which indicates transition — the bar plot fosters comparison. Now, the latter

example uses a bar chart to compare the annual rice yield of various countries.

Here, there is no scope for switching the mode of analysis, because the data in

question can only be subjected to comparison analysis. Hence, substitution of the

bar chart with another type of chart will only result in something pointless.

The decision concerning an ideal or appropriate mode of visual analysis is

influenced by the meaning, purpose and scope of data. So, the task of converting

data into charts must begin with close examination of the data. This helps in

determining the best mode of analysis for each of the data series, and based on

this knowledge one can select appropriate chart types for visual representation of

individual series that constitutes the data.

Page 11: Chart it Right - A beginner's guide to data visualization

Charting homogenous data Homogenous data can be subjected to three basic forms of analysis which are:

Comparison

Transition

Composition

Let’s take a closer look at each one of them.

Comparison

Comparison analysis brings out the anomalies of data. It helps in spotting the highs

and lows of data, and also brings out hidden exceptions. The bar chart is the most

appropriate form of data visualization for comparison analysis, as it can effectively

portray the differences between plots. The following chart presents a comparison of

equities — based on their average prices.

The chart makes it easier to determine which stocks are doing good, but in this

case the focus is on a single data series. Now if more than one series is plotted on a

bar chart, then each category will have multiple sibling plots — that’ll enable

parallel comparison. Here’s an example:

Page 12: Chart it Right - A beginner's guide to data visualization

Here the chart displays three data series, so each category (i.e. country in this

case) represents three data elements — drawn from each series. A multi-series bar

chart supports two forms of comparison, which are:

Linear comparison: Comparison between data elements of same series eg. rice

yield of India vs rice yield of China

Parallel comparison: Comparison between data elements of same category eg.

rice yield of India vs maize yield of India

Bar charts are available in two variations. They can either have horizontal bars or

vertical bars. Vertical bars are more common, and are ideally used when

representing a small set of data. Since in case of a large data set, the category

names tend to get skewed — as a result of limited horizontal space.

Horizontal bar charts are more appropriate for large data sets, as category names

can conveniently be placed in-line with the bars. But, the horizontal bar chart is not

very conducive for parallel comparison — making it unsuitable for multi-series data.

Transition

Transition analysis is applied to sequential data. It helps in understanding the

nature of change — undergone by a sequenced data series.

Page 13: Chart it Right - A beginner's guide to data visualization

Examples of sequential data

Line and area charts are used for transition analysis. These charts connect the data

elements serially — enabling the audience to comment on the nature of change or

transition.

Seeing the above chart, one can make following comments about the data:

Profit has grown steadily during 2000 - 2009, without any instance of

decline

Growth rate has been constant throughout, except for a short interval

between 2005 – 2007 when profit grew at an increased rate

Here’s another example:

Year Profit

2000 $150,000

2001 $176,000

2002 $182,000

2003 $195,000

2004 $220000

2005 $235,000

2006 $292,000

2007 $315,000

2008 $327,000

2009 $362,000

Quantity Cost Per Unit

10,000 $9

20,000 $9.6

30,000 $10.2

40,000 $9.8

50,000 $9.5

60,000 $9.5

70,000 $9.5

Chronological Incremental

Page 14: Chart it Right - A beginner's guide to data visualization

This chart enables the audience to compare the transition of two data series.

Following points are revealed after a brief examination of the chart:

In case of Product A, the cost of production increase up-till 30,000 units

and then begins to decline — finally attaining consistency at 50,000 units

and above

In case of Product B, the cost of production remains constant up-till 40,000

units. Following which, it increases marginally at 50,000 units and remains

consistent thereafter

Line and area chart are similar in most aspects and can be used interchangeably.

However, the use of area chart must be restricted to single series data. Since in

case of multi-series data, parts of the chart may get obscured — due to overlapping

of data series.

Composition

Composition analysis helps in understanding the proportion of data. Pie chart is the

simplest form of composition chart, and is used for studying the composition of

single-series data. In a pie chart, each data element is represented as a percentage

of the sum of all data elements.

Here are some examples:

Business Capital

Page 15: Chart it Right - A beginner's guide to data visualization

A variation of pie chart called - Doughnut Chart

Composition analysis of multi-series data is carried-out using stacked charts. Unlike

the pie chart, which shows linear composition (i.e. composition of a series), stacked

charts help in visualizing parallel composition (i.e. composition of a category).

Stacked charts however, are not particularly meant for composition analysis,

because they combine composition analysis with either comparison or transition

analysis.

Investment

Resort $1,500,000

Construction Co. $3,500,000

NGO $500,000

Shipping Co. $1,000,000

Furniture Manufacturing Unit $1,500,000

Total Investment $8,000,000

Cost Amount

Raw Material $150,000

Energy $80,000

Labor $200,000

Warehousing $40,000

Total Cost $470,000

Page 16: Chart it Right - A beginner's guide to data visualization

The stacked bar chart enables comparison of composite data. In this case, it allows

the audience to compare the unit-cost of various products, by presenting it as a

sum of material cost and processing cost. Here the emphasis is on comparison

of absolute values, which makes it difficult to compare the proportion of categories.

So if comparison of categories — in terms of their proportions is essential, then the

data ought to be plotted on a 100% stacked chart.

In a 100% stacked chart, the length of a stacked bar is independent of the absolute

sum of category, and its stacks indicate the proportion of the category — in terms

of percentage.

Another form of stacked chart is the stacked area chart, which helps in analyzing

the transition of composite data.

Product Material Cost Processing Cost

Product A $4 $8

Product B $6 $3

Product C $12 $8

Product D $5 $7

Product E $5 $3

Parallel Composition

Page 17: Chart it Right - A beginner's guide to data visualization

In this example, the stacked area chart shows the transition of operational cost,

which is expressed as a sum of fixed and variable cost. The combined presentation

of transition and composition leads to following discoveries:

The operational cost has declined at a gradual rate during 2000 – 2006

There has been a gradual increase in fixed cost, which is accompanied by a

gradual decline in variable cost

The stacked area chart is very effective in depicting the association of overall trend

to the transition pattern of constituent data-series.

Year Fixed Cost Variable cost

2000 $26,000 $85,000

2001 $28,000 $79,000

2002 $32,000 $72,000

2003 $36,000 $65,000

2004 $38,000 $62,000

2005 $43,000 $59,000

2006 $47,000 $51,000

Page 18: Chart it Right - A beginner's guide to data visualization

Charting heterogeneous data Analysis of heterogeneous data helps in determining the correlation between two or

more data-series. Correlation refers to the degree of connectedness between data-

series, and it gives an idea of the extent to which — one data-series is dependent

upon the other. There are two ways of analyzing heterogeneous data, these are:

Implied correlation

Direct correlation

Implied Correlation

Implied correlation is a technique of visual analysis, where correlation is indirectly

determined through — comparison of independent plots of two or more

heterogeneous data-series.

The concept is best understood through following example:

Year Units Sold Revenue Profit%

2000 32,000 $700,000 17.5%

2001 37,000 $900,000 18%

2002 46,000 $1,400,000 19%

2003 50,000 $1,600,000 19.5%

2004 57,000 $1,900,000 21%

Here, the three heterogeneous data-series are graphed separately and placed side

by side — to enable comparison of their plot patterns. This gives a crewed idea of

Page 19: Chart it Right - A beginner's guide to data visualization

the nature of correlation that exists between the three data-series. In this case all

three charts have rising curves, which suggest a strong degree of correlation

between the data series.

If the data is bivariate (i.e. comprising of only two heterogeneous series) — then a

dual Y-axis chart may be used for implied correlation. Here are some examples:

Some caution needs to be observed while using the dual Y-axis chart for

representing bivariate data. As explained earlier — in Understanding the data,

the use of dual Y-axis chart is inappropriate in situations where there is significant

variation in the range of concerned data series. So, dual Y-axis chart should only be

used when the range difference between the two data series is nominal.

Page 20: Chart it Right - A beginner's guide to data visualization

Direct Correlation

Direct correlation is a technique of visual analysis, which is applied to uncategorized

heterogeneous data. It helps in determining the exact level and nature of

dependency — between two or more heterogeneous series.

XY pllot charts are used for studying direct correlation of bivariate and trivariate

data. In these charts, the categorical axis is substituted by an extra numerical axis,

so the postion of a plot corresponds to two numerical values.

The scatter chart is a simple XY plot chart — used for studying bivariate data.

Here’s an example:

The best way to analyze a scatter chart is to draw a line from left to right — in such

a way that it covers maximum number of points.

0

2

4

6

8

10

10 15 20 25 30

Vis

its p

er m

on

th

Age

Age vs Frequency of Supermarket visits

Lifestyle Survey

Age Supermarket visits per month

16 3

18 3

23 8

20 2

24 1

19 3

16 7

22 4

Example of uncategorized data

Page 21: Chart it Right - A beginner's guide to data visualization

Here, the downward slope of line suggests inverse proportionality, meaning that

younger people tend to visit the supermarket more often than older ones.

Isolated clusters of data points should also be taken into considerations, as they are

indicators of trends and tendencies.

In this chart, the clusters highlight the following trends:

People in the age group 25 – 28 own upto 4 credit cards

People in the age group 28 – 31 own upto 5 credit cards

People in the age group 37 – 40 own upto 3 credit cards

The other form of XY plot chart is the bubble chart, which is used for analyzing

trivariate data.

0

2

4

6

8

10

10 15 20 25 30Vis

its p

er m

on

th

Age

Age vs Frequency of Supermarket visits

0

1

2

3

4

5

6

16 19 22 25 28 31 34 37 40

No of credit cards per person

Page 22: Chart it Right - A beginner's guide to data visualization

Bubble charts are similar to scatter charts, except that in bubble charts — data

points are of variable sizes and appear like bubbles. Here’s an example:

In a bubble chart, the position of a bubble indicates the quantity of two parameters,

while its size indicates the value of a third parameter. The above chart helps in

correlating the age and frequency of supermarket visits to average

expenditure per visit — as indicated by the size of the bubbles. A brief

examination of the chart leads to the discovery that — average expenditure per

visit is directly proportional to age, but inversely proportional to frequency of

visits. The interpretation is as follows:

Those who frequent the supermarket very often tend to spend less money

than the ones who drop-in less frequently

Older the person, greater is the amount spent by him on each visit

Even it the bubbles are scattered, a lot can be learnt about data — just by dividing

the chart into quadrants. Here’s an example:

-2

0

2

4

6

8

10

15 18 21 24 27

Bubble size indicates average expenditure per visit

Lifestyle Survey

Age Supermarket visits per month

Avg expenditure per visit

16 3 $10

18 3 $8

23 8 $50

20 2 $15

24 1 $50

19 3 $9`

16 7 $5

22 4 $6

Example of trivariate data

Page 23: Chart it Right - A beginner's guide to data visualization

In this investment portfolio chart, the quadrants enable segmentation of

investments into categories: good, medium, stagnant and bad.