2
Data Cleaning Garbage in. Garbage out. Variable Type ID Text Ordinal Categorical Numerical 5 Data Types that need to be cleaned Duplicate ID Type Missing Out of range Outlier Data Cleaning Best Practice Guided Practice for cleaning data Copy the data separately before doing Make a note of what you've done. Create a document to record conclusions Where is the data come from? Explain analysis results Summary of analysis results. Summary Statistics Ordinal and categorical data 1. Central tendency a. Mean: A central value of a finite set of numbers: specifically, the sum of the values divided by the number of values. b. Median: The middle value of the given list of data, when arranged in an order. c. Mode: The value that appears most often in a set of data values. 2. Dispersion a. Standard deviation: A measure of the amount of variation or dispersion of a set of values. b. Range: A set of data is the difference between the largest and smallest values 3. Statistical dependency a. Correlation: Any statistical relationship, whether causal or not, between two random variables or bivariate data. Lookup_value is the value that will be used to match data. This is usually an identifier (an ID of some kind). It must exist in both worksheets. Table_array is the table from which you want to retrieve data. Col_index_num is the number of the column from the left side of the table_array from which you want to retrieve data. Lock Column with $ e.g. $C4 Range_lookup defines whether or not the lookup_value is an approximate match or an exact match of the value you are comparing it to in the left-most column of the table_array. TRUE: Approximate match is needed.* FALSE: An exact match is required. A categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known. Numerical data Numerical data is a data type expressed in numbers, rather than natural language description. Sometimes called quantitative data, numerical data is always collected in number form. Data Referencing VOOKUP SYNTAX The syntax is: =VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup]) A NULL is any missing value in your data. There are four primary strategies for handling NULL values: Delete them (only with caution). Ignore them (some may have meaning). Impute values (e.g. median or zeros). Find missing values (using reference resources). Guideline: If over 15% of a dataset is filled with NULL values, find new data! ID 1 2 3 4 5 6 7 Age range 10-30 31-50 31-50 51-90 51-90 10-30 10-30 Region North North Northeast North South Northeast South Spending 20000 30000 100000 20000 10000 4000 40000 Name Anthony Brittney Christina Donald Elaine Frank Gary ID 1 2 3 4 5 6 7 Age range 10-30 31-50 31-50 51-90 51-90 10-30 10-30 Region North North Northeast North South Northeast South Spending 20000 30000 100000 20000 10000 4000 40000 Name Anthony Brittney Christina Donald Elaine Frank Gary ID 1 3 5 6 6 6 Bill 80000 20000 10000 0 3000 1000 Card type Primo Super Platinum Primo Super Platinum ID 1 2 3 4 5 6 7 Age range 10-30 31-50 31-50 51-90 51-90 10-30 10-30 Region North North Northeast North South Northeast South Spending east 30000 -100000 20000 10000 4000 4000000 Name Anthony Brittney Christina Donald Elaine Frank Gary ID 1 3 5 6 6 6 Bill 80000 20000 10000 0 3000 1000 Card type Primo Super Platinum Primo Super Platinum Field Accident_Index Accident_Severity Date Region Number_of_ Included_Parties Number_of_ Vehicles Notes Unique identifier Slight, Serious, of Fatal Expect duplicates Likely entered by a human; critical to our analysis The spread of numbers looks concerning at first glance The spread of numbers looks concerning at first glance Action to take Check for and handle duplicates. Checkfor and handle blanks None Check for incorrect and/or missing values. Check for and handle outliers. Check for and handle outliers. Date Type Alphanumeric Text Date Test Numerical Numerical 2021 © TRUE DIGITAL ACADEMY Data Analytics Intermediate I Module 2: Fundamentals of Data in Excel Data Analytics Intermediate 1 Fundamentals of Data in Excel Module 2

Data Analytics Intermediate 1 2 Fundamentals of Data in Excel

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Analytics Intermediate 1 2 Fundamentals of Data in Excel

Data Cleaning

Garbage in. Garbage out.

Variable TypeIDTextOrdinalCategoricalNumerical

5 Data Types that need to be cleaned

Duplicate IDTypeMissingOut of rangeOutlier

Data Cleaning Best Practice

Guided Practice for cleaning data

Copy the data separately before doing Make a note of what you've done.Create a document to record conclusions Where is the data come from? Explain analysis results Summary of analysis results.

Summary Statistics

Ordinal and categorical data

1. Central tendency a. Mean: A central value of a finite set of numbers: specifically, the sum of the values divided by the number of values. b. Median: The middle value of the given list of data, when arranged in an order. c. Mode: The value that appears most often in a set of data values.2. Dispersion a. Standard deviation: A measure of the amount of variation or dispersion of a set of values. b. Range: A set of data is the difference between the largest and smallest values3. Statistical dependency a. Correlation: Any statistical relationship, whether causal or not, between two random variables or bivariate data.

Lookup_value is the value that will be used to match data. This is usually an identifier (an ID of some kind). It must exist in both worksheets.

Table_array is the table from which you want to retrieve data.

Col_index_num is the number of the column from the left side of the table_array from which you want to retrieve data. Lock Column with $ e.g. $C4

Range_lookup defines whether or not the lookup_value is an approximate match or an exact match of the value you are comparing it to in the left-most column of the table_array. TRUE: Approximate match is needed.* FALSE: An exact match is required.

A categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known.

Numerical dataNumerical data is a data type expressed in numbers, rather than natural language description. Sometimes called quantitative data, numerical data is always collected in number form. 

Data ReferencingVOOKUP SYNTAXThe syntax is: =VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])

A NULL is any missing value in your data. There are four primary strategies for handling NULL values: Delete them (only with caution). Ignore them (some may have meaning). Impute values (e.g. median or zeros). Find missing values (using reference resources).

Guideline: If over 15% of a dataset is filled with NULL values, find new data!

ID

1

2

3

4

5

6

7

Age range

10-30

31-50

31-50

51-90

51-90

10-30

10-30

Region

North

North

Northeast

North

South

Northeast

South

Spending

20000

30000

100000

20000

10000

4000

40000

Name

Anthony

Brittney

Christina

Donald

Elaine

Frank

Gary

ID

1

2

3

4

5

6

7

Age range

10-30

31-50

31-50

51-90

51-90

10-30

10-30

Region

North

North

Northeast

North

South

Northeast

South

Spending

20000

30000

100000

20000

10000

4000

40000

Name

Anthony

Brittney

Christina

Donald

Elaine

Frank

Gary

ID

1

3

5

6

6

6

Bill

80000

20000

10000

0

3000

1000

Card type

Primo

Super

Platinum

Primo

Super

Platinum

ID

1

2

3

4

5

6

7

Age range

10-30

31-50

31-50

51-90

51-90

10-30

10-30

Region

North

North

Northeast

North

South

Northeast

South

Spending

east

30000

-100000

20000

10000

4000

4000000

Name

Anthony

Brittney

Christina

Donald

Elaine

Frank

Gary

ID

1

3

5

6

6

6

Bill

80000

20000

10000

0

3000

1000

Card type

Primo

Super

Platinum

Primo

Super

Platinum

FieldAccident_Index

Accident_Severity

Date

Region

Number_of_

Included_Parties

Number_of_

Vehicles

NotesUnique

identifier

Slight, Serious,

of Fatal

Expect

duplicates

Likely entered

by a human; critical

to our analysis

The spread of

numbers looks

concerning at

first glance

The spread of

numbers looks

concerning at

first glance

Action to takeCheck for and

handle duplicates.

Checkfor and

handle blanks

None

Check for

incorrect and/or

missing values.

Check for and

handle outliers.

Check for and

handle outliers.

Date TypeAlphanumeric

Text

Date

Test

Numerical

Numerical

2021 © TRUE DIGITAL ACADEMY Data Analytics Intermediate I Module 2: Fundamentals of Data in Excel

Data Analytics Intermediate 1Fundamentals of Data in Excel

Module

2

Page 2: Data Analytics Intermediate 1 2 Fundamentals of Data in Excel

Example: Thai Demographics and Income dataExample: Thai Demographics and Income data

Histogram

Strong positive correlation Strong negative correlation

Correlation

grouping

aggregation

Positive Correlation The greater population density, the greater income The lower population density, the less income

Negative Correlation The greater population density, the less income The lower population density, the greater income

Statistics in ExcelName Range: Highlight selected table and fill in name range

Mean: =AVERAGE(name range)Median: =MEDIAN(name range)Standard Deviation: =STDEV(name range)Correlation: =CORREL(name range 1, name range 2)Adds-on Data Analytics can help you do data analysis in many ways.

Example How many customers are there in EACH region? What's the average spending in EACH region? What's the average spending in EACH age group?

Click ‘insert’ > ‘Pivot Table’Select table/data sourceSelect variables into column, row, filter, value

Crosstabulation is a technique to find the relationship of multiple variables. Categorical x Numerical = Aggregate statistics Categorical x Categorical = Counts

count

mean

std

min

25%

50%

75%

max

77

23542

6209

13497

19590

22101

26376

45572

ID

1

2

3

4

5

6

7

Age range

10-30

31-50

31-50

51-90

51-90

10-30

10-30

Region

North

North

Northeast

North

South

Northeast

South

Spending

20000

30000

100000

20000

10000

4000

40000

Name

Anthony

Brittney

Christina

Donald

Elaine

Frank

Gary

Region

North

North

North

Spending

20000

30000

20000Region

North

Northeast

South

Spending

23333

52000

25000

Region

Northest

Northest

Spending

100000

4000

Region

South

South

Spending

10000

40000

Income

Den

sity

Income

Positive Correlation = 0.6 Positive Correlation = 0.9

Den

sity

Income

Den

sity

Income

Negative Correlation = 0.6 Negative Correlation = 0.9

Den

sity

Income

Den

sity

Income

Den

sity

2021 © TRUE DIGITAL ACADEMY Data Analytics Intermediate I Module 2: Fundamentals of Data in Excel

Data Analytics Intermediate 1Fundamentals of Data in Excel

Module

2