Deriving insights from data using "R"ight way

Deriving insights from Data – The “R”ight way

Colin Shearer once remarked “Find me something interesting in my data is a question from hell”. A lot of

literature is being published today about Big Data and Predictive analytics. In this mad rush of finding

insights from data, many times organization forget the basic paradigm – “Analysis should be guided by

business goals”. Huh, enough of Gyan!! Let me walk the talk.

In this blog, I wish to demonstrate a simple exercise of deriving value from the data for an Automobile

business. The intended audience could be any one ,who brain cells starts to play soccer on hearing

words like “Big Data”, “R”,”Analytics”,”Insights”, in fact anything which makes sense out of data. Nerds,

Geeks and Dodo like me, all can benefit from this blog.

By the time you reach the end of blog and manage to be awake, here is what you will learn:

First thing first – Be clear of what do you want to dig from the Data mine?

Identify the attributes(independent), which are useful and how do they relate to the Objective

of your analysis

Once the above point is answered, get a sense of the data – What is the type of data attributes,

nature, sample values etc

Is the data workable – are there duplicate values, missing values? If so clean the data

Once the data is cleansed, Try to fit a model between the independent variables(“means”) and

the dependent variable( the objective or the “end”)

http://insights.wired.com/profile/ColinShearer

http://bigdatanalytics.files.wordpress.com/2013/10/dataasttro.png

Validate the model, make it accurate so that the values predicted are close to actual data; Run

Diagnostics to validate that model is well-built.

You may use a smaller test data( subset of the actual data set) set to validate your model, before

applying the model on the actual dataset

Automobile industry in India is going through a down turn due to sluggish economy. Pricing a new car in

today’s scenario is one of the most challenging problems for any car manufacturer. So, the Car

manufacturer has collected some the data and wishes to use it to predict the price of new car model or

may be, perform a price correction of the existing models, which have lower sales, possibly due to

pricing issue.

Problem statement:

Given the available data, can predictive analytics be used to establish relationships between how the

various features of car impact the Car price and more importantly how strong is this relationship?

A snap shot of the Automobile price and various attributes data is shown below:

Data Dictionary of the Automobile data:

Price: Suggested retail price of the car

Mileage: Number of miles the car has been driven

Make: Manufacturer of the car such as Saturn, Pontiac, and Chevrolet

Model: Specific models for each car manufacturer such as Ion, Vibe, Cavalier

Trim (of car): Specific type of car model such as SE Sedan 4D, Quad Coupe 2D

Type: Body type such as sedan, coupe, etc.

Cylinder: Number of cylinders in the engine

http://bigdatanalytics.files.wordpress.com/2013/10/1.png



Liter: A more specific measure of engine size

Doors: Number of doors

Cruise: Indicator variable representing whether the car has cruise control (1 = cruise)

Sound: Indicator variable representing whether the car has upgraded speakers (1 = upgraded)

Leather: Indicator variable representing whether the car has leather seats (1 = leather)

I have used here open source platform – Language R to perform the analysis. I would be happy to share

the source code.(mail me @ [email protected])

Data Exploration:

So the first step is to get a sense of how data looks. Here is snapshot of the data:

This is how data looks like. There are 804 rows of data with various columns like Price, Mileage, Make

and so on. Make has sample value like Buick and so on.

This does not gives a clear picture of how Mileage governs the Car Price, but a dense region in the graph

is indicative of the fact that several Cars have close Price~Mileage relationship. In fact the scatter in

http://bigdatanalytics.wordpress.com/[email protected]



some way suggests that apart from mileage, there are some other related variables/attributes ( like

number of Cylinders, Size of engine in Liters and so on), which also influence the price of the Car.

The above graph is indicative of increasing price with increase in number of Cylinders ( With 4 cylinders

the price range is from 1000 to 4000, with 6 cylinders the Price range of cars is from 2000 to 4500 and so

on).

This below graph, is again suggestive of increase in Price with increase in the Engine size (in Liter).

Now connecting the dots, I have just taken a few attributes from the data set to see their relationship

with Price and also how they are interrelated.



A Data Scientists needs to similar Data exploration to better understand the nuisance of the data set

before getting in to further analysis.

Here we added a few more dummy columns to the data set. These dummy columns are required as

there are a few attributes which are Boolean in nature or are discrete with numeric values ( 1,2,3..).

Using such attributes for further analysis, requires transforming them in to dummy variables to so that

0, 1, 2 can be converted in to discrete values, which if not done, will led to bias in the results.

Mathematically 1 is greater than 0, and 2 is greater than 1, but for our analysis case both 0 , 1 or 2 are

just discrete cases where “1” does not have more significance than”0”; they are just different values of

an attribute like Apples and oranges in case of fruits and not small Apple, Big Apple and even Bigger

Apple .

Data Cleansing:

Fortunately there were not any missing values in the data, otherwise the missing values have to be

plugged in one of the many methods. Easiest is to ignore the data tuples, which have missing values or

using average of the remaining values or some more scientific method based on the need.

If there are any duplicate records, even they have to be removed.

Fitting the Model:

Looking at the data, the Price and Mileage are numeric values of high order, where as all other

attributes like Model, trim, type etc also have discrete numeric values, but not of same order as Price

and Mileage. Hence instead of linear model, a Logarithmic model will be a better fit.

Here is the sample code in R Language to generate the model.



The result seems to show some correlation between the various attributes, which have some impact on

the Car price.

R-Squared value close to 1 is indicative of the accuracy of the model.

However at 95% significance level, we see only a few variable (attributes like Mileage, Cylinder and so

on) to be significant. So we have to revisit the model.

So the model can be described as:

Price = Function ( Mileage, Cylinder, Doors and some specific make like Buick, Type like Convertible,

Hatchback and so on)

Now the result indicates that we have a better model, with all the attributes having significant (even

greater than at 95% Significance Level) impact on the car Price.



Model Validation:

To test our model on how well it can predict the Price of the car, we plot the Actual price of the care

(available in the Data set) against the Predicted price (derived from our model).The Green line is the

Actual Price and the Red line is the Predicted Price.

The model seems to be doing good job in predicting the price!!

There are some other tests which are performed to check the soundness of the model. I am just putting

here just a few of them for the reader’s benefit ( to avoid headache;) )



As seen in graph below, the distribution above and below the line is quite the same, implying that the

model is free from Hetroscedasticity (Tongue twister indeed!!) è Error Term variance is constant.

The below graph checks for Normality, Independence, Linearity and Homoscedasticity [ again a tongue

twister ]

I do not intend to go in to depth of these graphs as is the subject is quite dry and honestly speaking, I

too am learning the tricks of this trade!!

Last but not the least, if these test run fine and give good result on the smaller test set, you may run this

on the much bigger actual data set to realize the outcome of your model.

Nevertheless, for those who have started to snooze, it’s time to say “Enjoy your Sleep”!



Technology

Deriving insights from data using "R"ight way