19
Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 1 Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction ........................................................................................................................ 1 2 Exploring a variables frequency distribution.................................................................... 2 3 Calculating measures of central tendency........................................................................ 16 4 Calculating measures of dispersion (spread) ................................................................... 16 5 Exploring the shape of a variables distribution .............................................................. 17 6 Generating summary statistics ......................................................................................... 17 1 Introduction This guide covers the use of Microsoft Excel (hereafter: Excel) for univariate data exploration. It shows how techniques discussed in Chapter 13 can be applied in Excel. Please refer to Chapter 13 for more details on the specific techniques and their interpretation; the focus here is on how to carry them out in Excel. It covers four topics: 1. Exploring a variables frequency distribution 2. Calculating measures of central tendency 3. Calculating measures of dispersion (spread) 4. Exploring the shape of a variables distribution 5. Generating summary statistics The guide is not written for a specific version of Excel although it includes screenshots for Excel 2010. Most of the functionality referred to in the guide is also available in earlier and later versions, although the user interface has changed somewhat. The guide assumes that you have entered your data and prepared it for analysis as described in the guide Introduction to using Microsoft Excel for quantitative data analysis. It also assumes that you are familiar with basic Excel functionality, including creating and editing charts (for information on how to use functions and the Data Analysis ToolPak see Introduction to using Microsoft Excel for quantitative data analysis).

Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Embed Size (px)

Citation preview

Page 1: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 1

Data exploration with Microsoft Excel: univariate analysis

Contents

1 Introduction ........................................................................................................................ 1

2 Exploring a variable’s frequency distribution .................................................................... 2

3 Calculating measures of central tendency ........................................................................ 16

4 Calculating measures of dispersion (spread) ................................................................... 16

5 Exploring the shape of a variable’s distribution .............................................................. 17

6 Generating summary statistics ......................................................................................... 17

1 Introduction

This guide covers the use of Microsoft Excel (hereafter: Excel) for univariate data

exploration. It shows how techniques discussed in Chapter 13 can be applied in Excel. Please

refer to Chapter 13 for more details on the specific techniques and their interpretation; the

focus here is on how to carry them out in Excel. It covers four topics:

1. Exploring a variable’s frequency distribution

2. Calculating measures of central tendency

3. Calculating measures of dispersion (spread)

4. Exploring the shape of a variable’s distribution

5. Generating summary statistics

The guide is not written for a specific version of Excel although it includes screenshots for

Excel 2010. Most of the functionality referred to in the guide is also available in earlier and

later versions, although the user interface has changed somewhat.

The guide assumes that you have entered your data and prepared it for analysis as described

in the guide Introduction to using Microsoft Excel for quantitative data analysis. It also

assumes that you are familiar with basic Excel functionality, including creating and editing

charts (for information on how to use functions and the Data Analysis ToolPak see

Introduction to using Microsoft Excel for quantitative data analysis).

Page 2: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 2

2 Exploring a variable’s frequency distribution

As explained in Chapter 13, there are two main ways of exploring a variable’s frequency

distribution: frequency tables and graphical displays.

2.1 Creating frequency tables in Excel using pivot tables

Pivot tables are the most efficient way of creating frequency tables in Excel. They can be

extended to more complex analysis such as contingency tables and can be used as the basis

for graphical outputs. They are widely used in business and management, for example for

financial analysis, so you may already be familiar with how they work. If you are new to

pivot tables, it is worth taking some time to learn how to use them as they are a very flexible

analysis tool which makes them ideally suited for data exploration.

We will demonstrate their use in creating frequency tables for a simple dataset about

customers’ shopping habits (Figure 1) (available on the website as a downloadable file

customer satisfaction.xlsx). One of the nominal variables in the dataset records the store

location where the customer shops (north, central or south). Our aim is to create a simple

frequency table showing how many customers shop in each location and what per cent that

represents of the total. We will also add a cumulative per cent column.

Figure 1 – Customer satisfaction dataset

2.1.1 Creating a table of frequency counts

To create a pivot table, carry out the following steps:

Page 3: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 3

1. Check that the data are ready for analysis. Each column requires a unique header,

there should be no missing rows or columns and, if nominal variables are left as text,

spelling should be consistent.

2. Click on any cell in the dataset.

3. Select Insert > PivotTable > PivotTable to open up the Create PivotTable dialogue

box (see Figure 2). Note: Excel writes PivotTable as a single word.

Figure 2 – PivotTable dialogue box

4. In the dialogue box, select the table or range you wish to analyse. If you placed the

cursor in a cell in the dataset before opening up the dialogue box, the dataset should

automatically be selected. If not, select it manually.

5. Choose where you want the PivotTable to be placed. The default is New Worksheet

which is generally the easier option.

6. When ready, click OK. This will insert an empty PivotTable report and a PivotTable

Field List into a new worksheet (Figure 3).

Figure 3 – PivotTable report and Field List

Page 4: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 4

7. The PivotTable Field List consists of two parts. The upper part, the field section, lists

all the field names you can add to the PivotTable. In this case it is all the variable

names (column headers) in the dataset. The lower part, the layout section, contains the

Report Filter area, the Column Labels area, the Row Labels area and the Values area.

You populate the PivotTable report by dragging and dropping fields from the field

section into the appropriate area in the layout section (Figure 4).

Figure 4 – PivotTable Field List

8. To create a frequency table for the variable Store location, start by dragging and

dropping the field ‘Store location’ from the field section into the Row Labels area.

Page 5: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 5

This creates a set of row labels in the table, one for each value in the Store location

variable and a grand total.

9. Next, drag and drop the field ‘Store location’ from the field section into the Values

area. This immediately creates a new column ‘Count of Store location’ in the

PivotTable report that gives the frequency of occurrence of each category in the

variable, as well as a count of the grand total. (Note: count is the default value setting

for nominal data in an Excel pivot table; for metric data, the default is to sum the

values in each category.) The resulting table is shown in Figure 5.

Figure 5 – Frequency table of Store location showing counts only (n = 20)

2.1.2 Adding per cent columns to a frequency table

Now that the basic frequency table has been created, the next step is to add a column showing

the per cent of the total represented by each category. To do this:

1. Drag and drop another copy of the ‘Store location’ field into the Values box in the

PivotTable Field List. This will add another column to the pivot table, called Count of

Store location 2 (Figure 6). (Hint: if your PivotTable Field List has disappeared, click

on the PivotTable report and it will reappear.)

Page 6: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 6

Figure 6 – Adding an additional column

2. Click on the down arrow of the new field in the Values area of the PivotTable Field

List. This opens up a new menu. Choose Value Field Settings to open the Value Field

Settings dialogue box (Figure 7).

Figure 7 – Value Field Settings dialogue box

3. The Value Field Settings dialogue box allows you to manipulate the values displayed

in the pivot table for that field:

a. The Summarise Values By tab allows you to determine the value displayed in

each cell (e.g. count, average, etc.)

b. The Show Values As tab gives further options for displaying the data.

4. Select the Show Values As tab. From the Show Values As drop down box, select %

Column Total.

Page 7: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 7

5. At this point you can change the default name at the column head by typing your

choice of name in the Custom Name field. (Note: you can also change column

headers by typing directly into the PivotTable report.)

6. If desired, click on the Number Format button to open up Excel’s number format

dialogue box if you wish to change the format of the numbers (e.g. to set the number

of decimal places). (Note: you can also change number formats directly in the

PivotTable report using the standard commands available under the Home tab.)

7. Click OK. The resulting frequency table is shown in Figure 8, with the column header

changed to Per cent of total and the number of decimal places set to 0 for the per cent

column.

Figure 8 – Frequency table of Store location showing counts and per cent (n = 20)

2.1.3 Adding a cumulative per cent column to a frequency table

To add a cumulative per cent column to your frequency table, repeat steps 1 to 3 above to

create a new column and open the Value Field Settings dialogue box. In that box, select the

Show Values tab and in the Show Values As drop down box select % Running Total In.

Select ‘Store location’ in the Base field box. As before, you can change the name and set the

number format (Figure 9).

Figure 9 – Value Field Settings for cumulative per cent column

Page 8: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 8

Once complete the pivot table fields can be renamed and reformatted if required. Figure 10

shows the final result. The finished table can be copied and pasted into a word-processing

package for further editing or be used as the basis for generating graphs (charts).

Figure 10 – Frequency table of Store location showing counts, per cent and cumulative per cent (n = 20)

The contents of a PivotTable report can easily be changed by adding, removing or replacing

fields in the Field List. Similarly, you can change how the values are displayed via the Value

Field Settings dialogue box at any time. The down arrow filter on the Row Labels header

(next to Store location in Figure 10) can be used to sort and filter the rows. In the guide Data

exploration with Excel: analysing more than one variable, we will show how pivot tables can

be used to analyse two or more variables.

2.2 Creating frequency tables in Excel using the COUNTIF function

Another, less flexible, way of creating frequency tables is to use Excel’s COUNTIF function.

An example is shown in Figure 11.

Figure 11 – Frequency table created using COUNTIF function

Page 9: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 9

The cells in the Count column are populated using the COUNTIF function. Choose the

destination cell and then select Formulas > More Functions > Statistical > COUNTIF. In the

Function Argument dialogue box (Figure 12), enter the range of the data and the criteria in

the relevant boxes; these tell Excel where to look and what to count. In this case the word

‘North’ is entered in the Criteria box which gives a count of 7, as expected.

Figure 12 – COUNTIF Function Argument dialogue box

Once the individual cells in the count column have been populated, Excel can be used to

calculate the grand total (Hint: use the SUM function) and additional columns calculated for

per cent and cumulative per cent as shown in Figure 11.

2.3 Graphical techniques for exploring frequency distributions

Excel’s suite of chart (graph) tools can be used to explore frequency distributions visually. If

your data are already in a suitable format, for example if you have pre-existing frequency

tables or you have created frequency tables using the COUNTIF function, you can generate

suitable graphs via the Insert tab and select an appropriate chart type, such as a bar or pie

chart. Figure 13 shows a pie chart created from the frequency table in Figure 11. It has been

edited using the Chart Tools in Excel to include per cent labels for the slices, a chart legend

and a suitable title (Hint: click on the chart to activate these tools).

Page 10: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 10

Figure 13 – Pie chart created from pre-existing frequency table

(Note: For convenience of presentation and to make it easier to relate the output to the raw

data, we have created the frequency table and the pie chart in the same worksheet as the main

dataset. For large data sets and to avoid overwriting your data, it is usually better to work in a

separate worksheet when creating output of this kind.)

2.3.1 Using pivot charts to display frequency distributions

Pivot charts provide a very useful way of generating graphical displays of frequency

distributions directly from a dataset. They can be generated either from a pivot table that you

have created or directly using the PivotChart command. We will demonstrate the latter using

the customer satisfaction data and the Store location variable.

To create a pivot chart using the PivotChart command, select Insert > PivotChart. This opens

up the Create PivotChart with PivotTable dialogue box. This is similar to the PivotTable

equivalent (Figure 2) so select the data table/range and the location for the output (New

Worksheet is the default). Click OK.

This opens up a blank PivotChart area, along with a blank PivotTable report and Field List

similar to those you have already seen (Figure 14).

Page 11: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 11

Figure 14 – PivotChart area

To populate the chart area, carry out the following steps:

1. In the PivotTable Field List drag and drop a copy of the ‘Store location’ field into the

Axis Fields (Categories) area.

2. Drag and drop a second copy of the ‘Store location’ field into the Values area (as with

the pivot table, this field contains nominal data so the Excel default value setting is

count).

3. A PivotChart in the form of a bar (Excel: column) chart is created along with a

PivotTable report of the data (Figure 15).

4. This chart can now be formatted using the PivotChart tools if needed.

Figure 15 – Bar chart of shopping by store location (n =20)

If you want to change the type of chart, for example to a pie chart, you can do this via

PivotChart Tools > Design > Change Chart Type > Pie. Select the type of pie chart you want

Page 12: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 12

and click OK. The resulting chart can then be formatted as desired using the PivotChart

Tools. Figure 16 shows a pie chart created in this way after formatting.

Figure 16 – Pie chart of shopping by store location (n = 20)

2.3.2 Creating histograms in Excel

Histograms are a useful way of inspecting the frequency distribution and the shape of the

distribution of metric variables. Excel’s Data Analysis ToolPak contains a function for

generating histograms from your data. To illustrate how this is done, we will create a

histogram for the Satisfaction variable in the customer satisfaction dataset.

1. Select Data > Data Analysis to open the Data Analysis dialogue box. Select

Histogram and click OK. This opens up the Histogram dialogue box (Figure 17).

2. In the Histogram dialogue box enter the desired range in the Input Range box. If you

have included the column header, tick the Labels in first row box. Confirm where you

want the output to go; New Worksheet Ply is the default and is recommended.

3. Select the Chart Output box (leave the others blank). (Note: the histogram function

can also be used to generate Pareto charts and cumulative per cent outputs by

checking the appropriate box but the result will not be a standard histogram.)

Page 13: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 13

Figure 17 – Histogram dialogue box

4. Click OK. The resulting output is shown in Figure 18. It includes both a summary

table and the histogram graph.

Figure 18 – Histogram output

The resulting chart can be edited in Excel as any normal chart. Conventionally there are no

gaps between the bars in histograms. To remove the gaps, right click on a bar and choose

Format Data Series > Series Options and move the Gap Width slider to No Gap. You can also

add an outline to the bar if desired using the Format Data Series tools. A formatted version of

the histogram is shown in Figure 19.

Page 14: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 14

Figure 19 – Formatted histogram

The histogram function automatically selects a bin range for the histogram. In some cases, for

example if the dataset is small, the resulting bin range may not be very informative. It also

groups high values together under the label ‘more’ which makes it harder to spot outliers or

extreme values. Additionally, if you are working with Likert-scale data it is often useful to set

the bin range so that the intervals represent a point on the scale1. You can set your own bin

range intervals as follows:

1. Create a new column in your worksheet (Hint: keep it separate from your main

dataset) showing the bin intervals you want to use. The number entered sets the upper

level of that interval (inclusive). Give the column an appropriate title. If using Likert-

scale data, set the bin range 1, 2…n (n = maximum value of scale). See Figure 20.

Customer satisfaction is measured on a 7-point scale so we have set the bin range 1,

2…7.

1 Note we are treating the Likert data as an interval for the purposes of this illustration (see

Chapter 13 for a discussion).

Page 15: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 15

Figure 20 – Creating intervals for a histogram bin range

2. Open up the Histogram dialogue box (Data > Data Analysis > Histogram > OK).

3. In the Histogram dialogue box enter the desired range in the Input Range box.

4. Now select the bin range (i.e. the new column of bin range intervals that you have

created).

5. If you have included the column header, tick the Labels in first row box (Note: both

the data to be analysed and the bin range must have headers). Confirm where you

want the output to go; New Worksheet Ply is the default and is recommended. Select

the Chart Output box (see Figure 21).

Figure 21 – Histogram dialogue box with bin range specified

6. Click OK.

Page 16: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 16

The resulting output is shown in Figure 22. It can now be formatted as required.

Figure 22 – Histogram with specified bin range

3 Calculating measures of central tendency

Measures of central tendency introduced in Chapter 13 can be calculated using Excel’s

statistical functions (select Formulas > More Functions > Statistical and select chosen

function to open the relevant Function Argument dialogue box). These are shown in Table 1.

See Introduction to using Microsoft Excel for quantitative data analysis (Appendix A) for

more details on how to select and use functions.

Table 1 – Measures of central tendency in Excel’s statistical functions

Function name Description

AVERAGE Returns the arithmetic mean (average) of the given numbers

MEDIAN Returns the median of the given numbers

MODE.SNGL Returns the mode of the given numbers

These can also be calculated using the Descriptive Statistics function in the Data Analysis

ToolPak (see below).

4 Calculating measures of dispersion (spread)

Measures of dispersion (spread) introduced in Chapter 13 can be calculated using Excel’s

statistical functions (select Formulas > More Functions > Statistical and select chosen

function to open the relevant Function Argument dialogue box). These are shown in Table 2;

note that if you are using sample data, you should use STDEV.S and VAR.S for calculating

the standard deviation and variance of your sample. See Introduction to using Microsoft

Page 17: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 17

Excel for quantitative data analysis (Appendix A) for more details on how to select and use

functions.

Table 2 – Measures of dispersion in Excel’s statistical functions

Function name Description

MAX Returns the maximum value of the given numbers

MIN Returns the minimum value of the given numbers

STDEV.P Returns the standard deviation of the given numbers, based on the population

STDEV.S Returns the standard deviation of the given numbers, based on a sample

VAR.P Returns the variance of the given numbers, based on the population

VAR.S Returns the variance of the given numbers, based on a sample

These can also be calculated using the Descriptive Statistics function in the Data Analysis

ToolPak (see below).

5 Exploring the shape of a variable’s distribution

Excel’s Histogram function in the Data Analysis ToolPak described above (select Data >

Data Analysis > Histogram > OK) can be used to generate a histogram for visual evaluation

of the shape of a metric variable’s distribution.

Excel’s statistical functions include functions for calculating skewness and kurtosis (Table 3).

See Using Microsoft Excel for quantitative data analysis guide (Appendix A) for more details

on how to select and use functions.

Table 3 – Measures of dispersion in Excel’s statistical functions

Function name Description

KURT Returns the kurtosis of a dataset

SKEW Returns the skewness of a dataset

These can also be calculated using the Descriptive Statistics function in the Data Analysis

ToolPak (see below).

6 Generating summary statistics

Excel’s Descriptive Statistics routine in the Data Analysis ToolPak provides a quick way of

generating summary statistics for metric variables that includes measures of central tendency,

dispersion and skewness/kurtosis. To calculate descriptive statistics (for convenience this is

Page 18: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 18

repeated from Appendix B to Introduction to using Microsoft Excel for quantitative data

analysis):

Select Data > Data Analysis to open the Data Analysis menu dialogue box (Figure

23).

Figure 23 – Data Analysis menu dialogue box

Select the desired function, in this case Descriptive Statistics, which opens the

relevant dialogue box (Figure 24).

In the dialogue box, enter the desired range in the Input Range box. If you have

included the column header, select the Labels in first row box. Confirm where you

want the output to go. The default setting is New Worksheet Ply which creates a new

worksheet for the output; since most ToolPak outputs are quite large, this is a sensible

option.

Select Summary Statistics to get descriptive statistics for your chosen data; you can

also select an appropriate confidence interval for the mean if desired (the default is

95%).

Page 19: Data exploration with Microsoft Excel: univariate analysisdocuments.routledge-interactive.s3.amazonaws.com/9780415628129... · Data exploration with Microsoft Excel: univariate analysis

Management Research: Applying the Principles © 2015 Susan Rose, Nigel Spinks & Ana Isabel Canhoto 19

Figure 24 – Descriptive Statistics dialogue box

Click OK. The output will be shown in a new worksheet (Figure 25). Note that here

the column widths have been adjusted to make it easier to read.

Figure 25 – Descriptive Statistics output for variable Age

Note also that this output is not dynamically linked to the original dataset so changes to the

dataset will not automatically be updated in the output. You will need to run a new analysis.

Once created, the output can be cut-and-pasted into word-processing software for further

editing.

(Hint: if using the Descriptive Statistics function, you can select multiple adjacent metric

variables and the function will report the output for each one.)