Deﬁning Exploratory Data Analysis

In [1]:

Defining Exploratory Data AnalysisExploratory Data Analysis – EDA – plays a critical role in understanding the what, why, and how of theproblem statement.

Exploratory data analysis is an approach to analyzing data sets by summarizing their main characteristicswith visualizations.

First and foremost, EDA provides a stage for breaking down problem statements into smaller experimentswhich can help understand the dataset

EDA provides relevant insights which help analysts make key business decisions

The EDA step provides a platform to run all thought experiments and ultimately guides us towards making acritical decision

Agenda = '''Real Time Case Study using numpy, pandas, marplotlib• Descriptive Statistics• Inferential Statistics• Data Cleaning• Mean Value• Standard deviation • Correlation• Outlier'''

The main objective is to cover how to: Read and examine a dataset and classify variables by their type: quantitative vs. categorical Handle categorical variables with numerically coded values Perform univariate and bivariate analysis and derive meaningful insights about the dataset Identify and treat missing values and remove dataset outliers Build a correlation matrix to identify relevant variables

12345678

123456789

101112131415

In [13]:

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlibfrom matplotlib import pyplot as plt

Let’s get started by reading the dataset we’ll be working with and deciphering its variables. We’ll be analyzing a Kaggle data set on a company’s sales and inventory patterns. Kaggle is a great community of data scientists analyzing data together – it’s a great place to find data to practice the skills covered in this post.

The dataset contains a detailed set of products in an inventory and the main problem statement here is to determine the products that should continue to sell, and which products to remove from the inventory. The file contains the observations of both historical sales and active inventory data. The end solution here is to create a model that will predict which products to keep and which to remove from the inventory – we’ll perform EDA on this data to understand the data better.

ContextAttached is a set of products in which we are trying to determine which products we should continue to sell, and which products to remove from our inventory. The file contains BOTH historical sales data AND active inventory, which can be discerned with the column titled "File Type". We suspect that data science applied to the set--such as a decision tree analysis or logistic regression, or some other machine learning model---can help us generate a value (i.e., probability score) for each product, that can be used as the main determinant evaluating the inventory. Each row in the file represents one product. It is important to note that we have MANY products in our inventory, and very few of them tend to sell (only about 10% sell each year) and many of the products only have a single sale in the course of a year. Content The file contains historical sales data (identified with the column titled File_Type) along with current active inventory that is in need of evaluation (i.e., File Type = "Active").

12345

1

23

45

6

1

23

45

12

34

56

789

10

11121314

15

The historical data shows sales for the past 6 months. The binary target (1 = sale, 0 = no sale in past six months) is likely the primary target that should drive the analysis. The other columns contain numeric and categorical attributes that we deem relevant to sales. Note that some of the historical sales SKUs are ALSO included in the active inventory. A few comments about the attributes included, as we realize we may have some attributes that are unnecessary or may need to be explained. SKU_number: This is the unique identifier for each product. Order: Just a sequential counter. Can be ignored. SoldFlag: 1 = sold in past 6 mos. 0 = Not sold MarketingType = Two categories of how we market the product. This should probably be ignored, or better yet, each type should be considered independently. New_Release_Flag = Any product that has had a future release (i.e., Release Number > 1) Inspiration(1) What is the best model to use that will provide us with a probability estimate of a sale for each SKU? We are mainly interested in a relative unit that we can continuously update based on these attributes (and others that we add, as we are able). (2) Is it possible to provide a scored file (i.e., a probability score for each SKU in the file), and to provide an evaluation of the accuracy of the selected model? (3) What are the next steps we should take?

1516

1718

1920

2122

2324252627282930

3132

333435

3637

38394041

In [19]:

Out[19]:

Order File_Type SKU_number SoldFlag SoldCount MarketingType ReleaseNumber New_Release_Flag

0 2 Historical 1737127 0.0 0.0 D 15

1 3 Historical 3255963 0.0 0.0 D 7

2 4 Historical 612701 0.0 0.0 D 0

3 6 Historical 115883 1.0 1.0 D 4

4 7 Historical 863939 1.0 1.0 D 2

# source to get data https://www.kaggle.com/flenderson/sales-analysis sales_data = pd.read_csv("SalesKaggle3.csv")sales_data.head()

Types of variables and descriptive statistics Once we have loaded the dataset into the Python environment, our next step is understanding what these columns actually contain with respect to the range of values, learn which ones are categorical in nature etc. To get a little more context about the data it’s necessary to understand what the columns mean with respect to the context of the business – this helps establish rules for the potential transformations that can be applied to the column values. Here are the definitions for a few of the columns: File_Type: The value “Active" means that the particular product needs investigation SoldFlag: The value 1 = sale, 0 = no sale in past six months SKU_number: This is the unique identifier for each product. Order: Just a sequential counter. Can be ignored. SoldFlag: 1 = sold in past 6 mos. 0 = Not sold MarketingType: Two categories of how we market the product. New_Release_Flag: Any product that has had a future release (i.e., Release Number > 1)

1234

1234

567

89

101112

131415161718192021222324

In [20]:

Mode, Mean and MedianThe describe function returns a pandas series type that provides descriptive statistics which summarize thecentral tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

The three main numerical measures for the center of a distribution are mode,mean(µ), and the median (M).

ModeThe mode is the most frequently occurring value.

MeanThe mean is the average value,while

MedianThe median is the middle value.

Out[20]:

Order SKU_number SoldFlag SoldCount ReleaseNumber New_Release_Flag

count 198917.000000 1.989170e+05 75996.000000 75996.000000 198917.000000 198917.000000

mean 106483.543242 8.613626e+05 0.171009 0.322306 3.412202 0.642248

std 60136.716784 8.699794e+05 0.376519 1.168615 3.864243 0.479340

min 2.000000 5.000100e+04 0.000000 0.000000 0.000000 0.000000

25% 55665.000000 2.172520e+05 0.000000 0.000000 1.000000 0.000000

50% 108569.000000 6.122080e+05 0.000000 0.000000 2.000000 1.000000

75% 158298.000000 9.047510e+05 0.000000 0.000000 5.000000 1.000000

max 208027.000000 3.960788e+06 1.000000 73.000000 99.000000 1.000000

sales_data.describe()1

In [61]:

include=’all’When we call the describe function with include=’all’ argument it displays the descriptive statistics for allthe columns, which includes the categorical columns as well.

In [22]:

Out[61]:

Order File_Type SKU_number SoldFlag SoldCount MarketingType

count 198917.000000 198917 1.989170e+05 75996.000000 75996.000000 198917

unique NaN 2 NaN NaN NaN 2

top NaN Active NaN NaN NaN S

freq NaN 122921 NaN NaN NaN 100946

mean 106483.543242 NaN 8.613626e+05 0.171009 0.322306 NaN

std 60136.716784 NaN 8.699794e+05 0.376519 1.168615 NaN

min 2.000000 NaN 5.000100e+04 0.000000 0.000000 NaN

25% 55665.000000 NaN 2.172520e+05 0.000000 0.000000 NaN

50% 108569.000000 NaN 6.122080e+05 0.000000 0.000000 NaN

75% 158298.000000 NaN 9.047510e+05 0.000000 0.000000 NaN

max 208027.000000 NaN 3.960788e+06 1.000000 73.000000 NaN

(198917, 14)

sales_data.describe(include='all')

Next, we address some of the fundamental questions: The number of entries in the dataset:

print(sales_data.shape)

1

123

1

In [23]:

In [24]:

In [25]:

Order 198917File_Type 2SKU_number 133360SoldFlag 2SoldCount 37MarketingType 2ReleaseNumber 71New_Release_Flag 2StrengthFactor 197424PriceReg 11627ReleaseYear 85ItemCount 501LowUserPrice 12102LowNetPrice 15403dtype: int64

75996122921

print(sales_data.nunique())

print(sales_data[sales_data['File_Type'] == 'Historical']['SKU_number'].countprint(sales_data[sales_data['File_Type'] == 'Active']['SKU_number'].count())

We use the count function to find the number of active and historical cases: we have 122921 active cases which needs to be analyzed. We then Split the dataset into two parts based on the flag type. To do this, we must pass the required condition in square brackets to the sales_data object, which examines all the entries with the condition mentioned and creates a new object with only the required values.

sales_data_hist = sales_data[sales_data['File_Type'] == 'Historical']sales_data_act = sales_data[sales_data['File_Type'] == 'Active']

1

12

1234567

8

12

In [ ]:

In [26]:

In [ ]:

In [ ]:

In [53]:

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1542c358>

To summarize all the operations so far: The dataset contains 198,917 rows and 14 columns with 12 numerical and 2 categorical columns. There are 122,921 actively sold products in the dataset, which is where we’ll focus our analysis.

sales_data['MarketingType'].value_counts().plot.bar(title="Freq dist of Marketing Type"

col_names = ['StrengthFactor','PriceReg', 'ReleaseYear', 'ItemCount', 'LowUserPrice' fig, ax = plt.subplots(len(col_names), figsize=(12,18))

123

1

1

1

1

123

/Users/surendra/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use àrr[tuple(seq)]` instead of àrr[seq]`. In the future this will be interpreted as an array index, àrr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

fig, ax = plt.subplots(len(col_names), figsize=(12,18)) for i, col_val in tuple(enumerate(col_names)): sns.distplot(tuple(sales_data_hist[col_val]), hist=True, ax=ax[i]) ax[i].set_title('Freq dist ' +col_val, fontsize=12) ax[i].set_xlabel(col_val, fontsize=12) ax[i].set_ylabel('Count', fontsize=12) plt.show()

3456789

101112

In [54]:

Out[54]:

<seaborn.axisgrid.PairGrid at 0x1a1862de80>

sales_data_hist = sales_data_hist.drop([ 'Order', 'File_Type','SKU_number','SoldFlag','MarketingType','ReleaseNumber'], axis=1)sns.pairplot(sales_data_hist)

1234

Missing value analysisMissing values in the dataset refer to those fields which are empty or no values assigned to them, theseusually occur due to data entry errors, faults that occur with data collection processes and often whilejoining multiple columns from different tables we find a condition which leads to missing values. There arenumerous ways with which missing values are treated the easiest ones are to replace the missing value withthe mean, median, mode or a constant value (we come to a value based on the domain knowledge) andanother alternative is to remove the entry from the dataset itself.

In our dataset we don’t have missing values, thus we are not performing any operations on the dataset thatsaid here are few sample code snippets that will help you perform missing value treatment in python.

To check if there are any null values in the dataset

data_frame.isnull().values.any()

In [57]:

In [ ]:

#data_frame.isnull().values.any() #data_frame.isnull().sum() #data_frame['col_name'].fillna(0, inplace=True)

12345

1

In [58]:

Percentile based outlier removalThe next step that comes to our mind is the ways by which we can remove these outliers.

One of the most popularly used technique is the Percentile based outlier removal, where we filter outoutliers based on fixed percentile values.

The other techniques in this category include removal based on z-score, constant values etc

col_names = ['StrengthFactor','PriceReg', 'ReleaseYear', 'ItemCount', 'LowUserPrice' fig, ax = plt.subplots(len(col_names), figsize=(8,40)) for i, col_val in enumerate(col_names): sns.boxplot(y=sales_data_hist[col_val], ax=ax[i]) ax[i].set_title('Box plot - {}'.format(col_val), fontsize=10) ax[i].set_xlabel(col_val, fontsize=8) plt.show()

Based on the above definition of how we identify outliers the black dots are outliers in the strength factor attribute and the red colored box is the IQR range.

123456789

1011

1

In [ ]:

The correlation matrixA correlation matrix is a table showing the value of the correlation coefficient

(Correlation coefficients are used in statistics to measure how strong a relationship is between twovariables.)

between sets of variables.

Each attribute of the dataset is compared with the other attributes to find out the correlation coefficient.

This analysis allows you to see which pairs have the highest correlation, the pairs which are highlycorrelated represent the same variance of the dataset thus we can further analyze them to understandwhich attribute among the pairs are most significant for building the model.

def percentile_based_outlier(data, threshold=95): diff = (100 - threshold) / 2 minval, maxval = np.percentile(data, [diff, 100 - diff]) return (data < minval) | (data > maxval) col_names = ['StrengthFactor','PriceReg', 'ReleaseYear', 'ItemCount', 'LowUserPrice', 'LowNetPrice'] fig, ax = plt.subplots(len(col_names), figsize=(8,40)) for i, col_val in enumerate(col_names): x = sales_data_hist[col_val][:1000] sns.distplot(x, ax=ax[i], rug=True, hist=False) outliers = x[percentile_based_outlier(x)] ax[i].plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False) ax[i].set_title('Outlier detection - {}'.format(col_val), fontsize=10) ax[i].set_xlabel(col_val, fontsize=8) plt.show()

123456

789

10111213141516171819

1

In [60]:

Out[60]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1cb7b1d0>

f, ax = plt.subplots(figsize=(10, 8))corr = sales_data_hist.corr()sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)

Above you can see the correlation network of all the variables selected, correlation value lies between -1 to +1. Highly correlated variables will have correlation value close to +1 and less correlated variables will have correlation value close to -1. In this dataset, we don’t see any attributes to be correlated and the diagonal elements of the matrix value are always 1 as we are finding the correlation between the same columns thus the inference here is that all the numerical attributes are important and needs to be considered for building the model.

12345

1

23

45

6

Descriptive statisticsWhen we have a set of observations, it is useful to summarize features of our data into a single statementcalled a descriptive statistic.

As their name suggests, descriptive statistics describe a particular quality of the data they summarize.

In [ ]:

In [ ]:

A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. Most of these are aggregations like sum(), mean(), but some of them, like sumsum(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer DataFrame − “index” (axis=0, default), “columns” (axis=1)

1

1

1

234

In [3]:

Name Age Rating0 Tom 25 4.231 James 26 3.242 Ricky 25 3.983 Vin 23 2.564 Steve 30 3.205 Smith 29 4.606 Jack 23 3.807 Lee 34 3.788 David 40 2.989 Gasper 30 4.8010 Betina 51 4.1011 Andres 46 3.65

import pandas as pdimport numpy as np #Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65} #Create a DataFramedf = pd.DataFrame(d)print(df)

123456789

10111213

In [7]:

Name TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...Age 382Rating 44.92dtype: object

#sum()#Returns the sum of the values for the requested axis. By default, axis is index (axis=0). import pandas as pdimport numpy as np #Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65} #Create a DataFramedf = pd.DataFrame(d)print(df.sum())

123456789

10111213141516

In [5]:

0 29.231 29.242 28.983 25.564 33.205 33.606 26.807 37.788 42.989 34.8010 55.1011 49.65dtype: float64

#axis=1#This syntax will give the output as shown below. import pandas as pdimport numpy as np #Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65} #Create a DataFramedf = pd.DataFrame(d)print(df.sum(1))

123456789

10111213141516

In [6]:

In [8]:

Age 31.833333Rating 3.743333dtype: float64

Age 9.232682Rating 0.661628dtype: float64

#mean()#Returns the average value import pandas as pdimport numpy as np #Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65} #Create a DataFramedf = pd.DataFrame(d)print(df.mean())

#std()#Returns the Bressel standard deviation of the numerical columns. import pandas as pdimport numpy as np #Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65} #Create a DataFramedf = pd.DataFrame(d)print(df.std())

123456789

10111213141516

123456789

10111213141516

In [9]:

Age Ratingcount 12.000000 12.000000mean 31.833333 3.743333std 9.232682 0.661628min 23.000000 2.56000025% 25.000000 3.23000050% 29.500000 3.79000075% 35.500000 4.132500max 51.000000 4.800000

Functions & DescriptionLet us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions − Sr.No. Function Description1 count() Number of non-null observations2 sum() Sum of values3 mean() Mean of Values4 median() Median of Values5 mode() Mode of values6 std() Standard Deviation of the Values7 min() Minimum Value8 max() Maximum Value9 abs() Absolute Value10 prod() Product of Values11 cumsum() Cumulative Sum12 cumprod() Cumulative Product

#Summarizing Data#The describe() function computes a summary of statistics pertaining to the DataFrame columns. #Live Demo import pandas as pdimport numpy as np #Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65} #Create a DataFramedf = pd.DataFrame(d)print(df.describe())

12

3456789

10111213141516

123456789

10111213141516171819

In [10]:

Namecount 12unique 12top Leefreq 1

This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns. 'include' is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. Takes the list of values; by default, 'number'. object − Summarizes String columnsnumber − Summarizes Numeric columnsall − Summarizes all columns together (Should not pass it as a list value)Now, use the following statement in the program and check the output −

import pandas as pdimport numpy as np #Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65} #Create a DataFramedf = pd.DataFrame(d)print(df.describe(include=['object']))

1

23456

123456789

10111213

In [11]:

Descriptive StatisticsDescriptive Statistics refers to a discipline that quantitatively describes the important characteristics of thedataset.

For the purpose of describing properties, it uses measures of central tendency, i.e. mean, median, modeand the measures of dispersion i.e. range, standard deviation, quartile deviation and variance, etc.

The data is summarised by the researcher, in a useful way, with the help of numerical and graphical toolssuch as charts, tables, and graphs, to represent data in an accurate way.

Moreover, the text is presented in support of the diagrams, to explain what they represent.

Name Age Ratingcount 12 12.000000 12.000000unique 12 NaN NaNtop Lee NaN NaNfreq 1 NaN NaNmean NaN 31.833333 3.743333std NaN 9.232682 0.661628min NaN 23.000000 2.56000025% NaN 25.000000 3.23000050% NaN 29.500000 3.79000075% NaN 35.500000 4.132500max NaN 51.000000 4.800000

import pandas as pdimport numpy as np #Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65} #Create a DataFramedf = pd.DataFrame(d)print(df.describe(include='all'))

123456789

10111213

Inferential StatisticsInferential Statistics is all about generalising from the sample to the population, i.e. the results of analysis ofthe sample can be deduced to the larger population, from which the sample is taken.

It is a convenient way to draw conclusions about the population when it is not possible to query each andevery member of the universe.

The sample chosen is a representative of the entire population; therefore, it should contain importantfeatures of the population.

Inferential Statistics is used to determine the probability of properties of the population on the basis of theproperties of the sample, by employing probability theory.

The major inferential statistics are based on the statistical models such as Analysis of Variance, chi-squaretest, student’s t distribution, regression analysis, etc. Methods of inferential statistics:

Estimation of parameters Testing of hypothesis

The difference between descriptive and inferentialstatisticsDescriptive Statistics is a discipline which is concerned with describing the population under study.

Inferential Statistics is a type of statistics; that focuses on drawing conclusions about the population, on thebasis of sample analysis and observation.

Descriptive Statistics collects, organises, analyzes and presents data in a meaningful way.

On the contrary, Inferential Statistics, compares data, test hypothesis and make predictions of the futureoutcomes.

There is a diagrammatic or tabular representation of final result in descriptive statistics whereas the finalresult is displayed in the form of probability.

Descriptive statistics describes a situation while inferential statistics explains the likelihood of theoccurrence of an event.

Descriptive statistics explains the data, which is already known, to summarise sample. Conversely,inferential statistics attempts to reach the conclusion to learn about the population; that extends beyondthe data available.

CorrelationCorrelation is a measure of relationship between variables that is measured on a -1 to 1 scale.

The closer the correlation value is to -1 or 1 the stronger the relationship, the closer to 0, the weaker therelationship.

It measures how change in one variable is associated with change in another variable.

There are a few common types of tests to measure correlation, these are: Pearson, Spearman rank, andKendall Tau.

Each have their own assumptions about the data that needs to be meet in order for the test to be able toaccurately measure the level of correlation.

Each type of correlation test is testing the following hypothesis.

H0 hypothesis: There is not a relationship between variable 1 and variable 2

HA hypothesis: There is a relationship between variable 1 and variable 2

Positive correlation: as one variable increases so does the otherNegative (inverse) correlation: as one variable increases the other variable decreasesNo correlation: there is no association between the changes in the two variables

The strength of the correlation matters. The closer the absolute value is to -1 or 1, the stronger the correlation. r value Strength0.0 – 0.2 Weak correlation0.3 – 0.6 Moderate correlation0.7 – 1.0 Strong correlation

12

3

1

23456

In [ ]:

An outlier is any data point which differs greatly from the rest of the observations in a dataset. Let’s see some real life examples to understand outlier detection: When one student averages over 90% while the rest of the class is at 70% – a clear outlier While analyzing a certain customer’s purchase patterns, it turns out there’s suddenly an entry for a very high value. While most of his/her transactions fall below Rs. 10,000, this entry is for Rs. 1,00,000. It could be an electronic item purchase – whatever the reason, it’s an outlier in the overall data How about Usain Bolt? Those record breaking sprints are definitely outliers when you factor in the majority of athletes

“Outliers are not necessarily a bad thing. These are just observations that are not following the same pattern as the other ones. But it can be the case that an outlier is very interesting. For example, if in a biological experiment, a rat is not dead whereas all others are, then it would be very interesting to understand why. This could lead to new scientific discoveries. So, it is important to detect outliers.” – Pierre Lafaye de Micheaux, Author and Statistician

1

23456

78

910

111213

1

2

3

4

1

Documents

Deﬁning Exploratory Data Analysis