IE 485 - Introduction to Data Mining - R Tutorial · Chapter 1 Introduction to Data Mining with R This document includes R codes and brief discussions that take place in IE 485. I

IE 485 - Introduction to Data Mining - R Tutorial

Mustafa Hekimoglu, PhD

Monday 20th February, 2017

Contents

1 Introduction to Data Mining with R 3

1.1 Databases To be Used In the Course . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 SuperSale Grocery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Savers Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Visualizing Data 7

2.1 Scatter and Line Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Programming with R 10

3.1 Some Data Types and Classes In R . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 Numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.2 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.3 Not Available (NA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.4 Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.5 Other Data Types In R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 If Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Loops in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 For Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.2 While Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Descriptive Modeling Using R 19

4.1 Measures of Dissimilarity Between Objects . . . . . . . . . . . . . . . . . . . . . . 19

4.2 K-Means Algorithm In R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Predictive Modeling 29

5.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1

CONTENTS Monday 20th February, 2017

5.4 Nearest Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2

Chapter 1

Introduction to Data Mining with R

This document includes R codes and brief discussions that take place in IE 485. I believe having

such a document at your deposit will enhance your performance during your homeworks and your

projects.

As we proceed in our course, I will keep updating the document with new discussions and

codes. Hence the readers should keep in mind that this document is still a work-in-progress, and

all your comments and contributions more than welcome.

The document starts with discussions of databases that we use in our course. Later chapters on

visualization, descriptive modeling, fundamental statistics, prescriptive modeling (will) take place.

In this course we will take an integrated approach to data mining applications. In each home-

work, you will receive a database consisting of several tables with many records. Hence, the first

task of each homework question will be retrieving data in the correct form before starting analysis.

1.1 Databases To be Used In the Course

The course assumes an integrated approach to data mining applications. In each homework, you

will receive a database consisting of several tables with many records. Hence, the first task of each

homework question will be retrieving data in the correct form before starting analysis.

As data is stored in databases, data miners should be able to compose appropriate queries

for their analysis purposes. The most widely used query language is Structured Query Language

(SQL). Hence, a data analyst should be able to build good SQL queries to get the right data at the

beginning of his/her analysis. To mimic this process, class participants will consider the following

databases throughout the course.

3

1.1. DATABASES TO BE USED IN THE COURSE Monday 20th February, 2017

1.1.1 SuperSale Grocery

SuperSale grocery market chains are selling fresh and frozen food for their customers 24/7. They

have three big stores located in Adana, Ankara, Istanbul, and Izmir. Thanks to their hardworking

marketing department’s CRM division, they have detailed information about their customers and

they aim to exploit this as much as possible to increase their gross income.

In order to make a data analysis for their sales, customers and products, they share a portion

of their databases with our university. Data is stored into the file Lecture3.db. In the database

three tables (2 tables for master and 1 table for transactional data) are provided.

In the first table (M CUSTOMERS) customer identity numbers, their names, genders, cities

and birth of dates are stored. Due to confidentiality issues, customer names are not provided. In

the second table (M PRODUCTS) product information is provided. The company only provides

information for its product identities and their list prices. The third table (TR SALES) is for sales

transactions which include information for date of sales, receipt id, product id, customer id, total

sold quantity and total amount of sales. Due to an unknown reason, each customer purchases

only one product from all stores of the company. Database relation between fields of tables are

depicted using MS Access Database Tools in Figure 1.1. Note that we use SQLite for storing

and retrieving data. However, MS Access Database Tools provides nice visualization for relations

between fields of different tables.

Figure 1.1: Database Relation Diagram for Supersale Grocery Store

In this figure, matching fields of tables are linked together. For instance, Customer ID field of

M CUSTOMERS match with Customer ID column of TR SALES table. Similarly, Product ID

4


columns in M PRODUCTS and TR SALES tables match.

1.1.2 Savers Bank

Savers bank is a small bank that have operations in major cities of Turkey. Their main customer

base consists of upper class people who are willing to receive premium service as well as good

interest in their investment. In addition to savings account, Savers bank also issues credit cards

for their customers in Istanbul, Ankara, Izmir, Gaziantep, Manisa, Adana, Bursa.

In order to analyze their customer base, their spending behaviors and expenses, they are willing

to employ data mining techniques. The main objective of this study is to increase their customer

satisfaction by proposing well-calibrated services, and increase customer satisfaction. To this end,

Chief Operations Manager of the bank shares a small part of its database with our university.

The database consists of three tables (2 master data and 2 transactional data): M CUSTOMERS,

M CREDIT CARDS, CARD TRANS, SAVINGS TRANS. M CUSTOMERS table stores cus-

tomer information of the bank whereas M CREDIT CARDS table includes detail about customers’

credit card information. CARD TRANS and SAVINGS TRANS tables include transactional data,

e.g. expenditure, spending etc for credit card and savings accounts respectively. The relationship

between the tables and their fields are given in Figure 1.2.

Figure 1.2: Database Relation Diagram for Savers Bank

In the M CUSTOMERS table, customers information are stored in 10 columns. Each customer

is designated with a unique customer identity number, given in the column CUSTOMER ID. In

order to prevent confidentiality of their customers, they provide databases without any name or

5


identity information. However, we are provided with gender, city and marital status information

for the customers of the bank.

Each customer can open two types of accounts: generic and savings account. All customers

have a generic account defined with an account number stored in the column ACCOUNT ID.

However, 55% of customers have credit card and/or savings accounts. Credit card information

are stored in CREDIT CARD HOLD, and CREDIT CARD ID columns. The former is a binary

variable for existence of a credit card for that customer. 1 stands for a customer with credit card

whereas the latter stores credit card numbers.

In M CREDIT CARDS table customer id and credit card number are stored in columns CUS-

TOMER ID, and CREDIT CARD ID. Monthly incomes of each customer also provided in the

database in MONTHLY INCOME column. For issue and expiration dates of each credit cards are

stored as well as cvv numbers in related columns (fields). Also, the bank allows its customers to

choose one of three international credit card companies: Visa, Mastercard, and American Express.

This information is stored in the column CARD TYPE.

CARD TRANS table stands for expenditures and payment of credit cards between January

1st 2013 and 2016. For customers, who use credit cards, their monthly expenditure is deducted

from their monthly salaries. Each salary and expenditure information is stored with a unique

transaction id (TRANS ID), type (expenditure (-) or payment (+)), transaction date, and pay-

ment destination number. The last column includes identity numbers for recipient of credit card

payment.

In SAVINGS TRANS table, customers balance and transactions are provided. For those who

can make some savings transfer their excess amount from their generic accounts to their savings

account from time to time. These transactions are provided with positive amount. For those who

spends more than their incomes can withdraw some money from their savings account to make

their credit card debt. The column BALANCE stores the total amount of money deposited to the

savings account by that date.

6

Chapter 2

Visualizing Data

Data visualization is an important subject due to various reasons: First, the appropriate graphics

can convey a message much more efficiently than verbal statements or mathematical terms. In

journalism this principle is also known as a picture is worth a thousand words. As today’s com-

petitive word, forces professionals to explain their ideas as good as possible, using graphical tools

efficiently is critical for success n business life.

Furthermore, visualizing data might also be helpful in understanding and validating the results

of an analysis. As the human eye is the best graphical chip in the world, often analyzing simple

scatter plots can reveal a lot of information on the subject matter. For instance, residual plots of

a linear regression model, which will be discussed in latex chapters, often reveals the most reliable

information about the model fit data.

Naturally, it is important to decide a plot type appropriate for the aim of the visualization.

Some common plot types and their usages in R are presented in the following sections.

2.1 Scatter and Line Plots

Scatter plots are the most common and very primitive way of visualizing data. Often the variable

of interest is located to the y-axis. We locate an index or another variable is to the x-axis of the

plot. In R these two types of scatter plots are called with plot() function. If a single variable

is given as the parameter, R will locate an index variable to the x-axis of the plot. To plot two

variables to the both axes, one should give x-axis variable as the first, and the y-axis variable as

the second parameter of the plot function. Below we provide two plots of data collected for Black

Cherry Trees by Ryan et al. (1976).

Example: Clustering Supermarkets With K-Means Algorithm

Dataset for Black Cherry Trees are one of the built-in data sets in R that can be reached from

datasets of R. Documentation for this package can checked from this link.

7

https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

2.2. HISTOGRAMS Monday 20th February, 2017

0 5 10 15 20 25 30

6570

7580

85

Height of Black Cherry Trees

Index

Hei

ght (

ft)

(a) Plot with Single Variable

8 10 12 14 16 18 20

1020

3040

5060

70

Volume & Girth of Black Cherry Trees

Girth (inch)

Hei

ght (

ft)

(b) Plot with Two Variables

Figure 2.1: Scatter Plots for Black Cherry Trees

As can be seen in Figure 2.1a, heights of trees are plot with an index on the x-axis. This

plot provides some idea about the variability and mean of the variable. The dependency between

girth and the volume is obvious in Figure 2.1b. Obviously a polynomial regression line (quadratic)

would yield a good fit for this relationship. Recall the quadratic relation between the diameter

and the volume of a cylinder.

1 install.packages{datasets}

2 library(datasets) #install the data set package to the memory

3

4 plot(trees [,2],main="Height of Black Cherry Trees",ylab="Height (ft)")

5 windows()

6 plot(trees[,1],trees [,3],main="Volume & Girth of Black Cherry Trees",

7 ylab="Height (ft)",xlab="Girth (inch)")

In this code, we also put axis labels using xlab and ylab parameters of the plot function. Also

to add a graphic title one should enter a string (in ””) for the main parameter. Furthermore,

the command windows() opens a new window to plot another scatter diagram. This is the most

efficient way to open multiple plots at the same time.

2.2 Histograms

Histograms are nice representations of datasets which conveys very valuable informaton about

the mode or median of distribution as well as tails and the mean. To obtain a single-variable

8

2.2. HISTOGRAMS Monday 20th February, 2017

histogram from a dataset, we execute the function hist() by using the array including values as

the parameter.

9

Chapter 3

Programming with R

R is a medium-level programming language that runs vectors. One can define loops, user-defined

functions, conduct statistical tests and simulations in R. But for all of this flexibility, we should

be able to understand the working mechanics of programming in R.

All programming books on a specific language starts with a chapter on data types that are

recognized by the compiler. I will follow this rich tradition by introducing basic data types and

classes that may help you with initial learning curve of R. However, I should note that concepts that

are introduced here are far from generality as classes and data types recognized by the compiler

of R depends on the library and there are vast amount of libraries and classes out there.

3.1 Some Data Types and Classes In R

3.1.1 Numeric

Numeric is the default computational type in R Becker et al. (1988). When we assign a number

to variable, R defines that variable as numeric. To see that you can execute the code in the first

two lines of Code Block 3.1. The function class() throws you the class of an object defined in the

memory of R.

In some cases, it might be useful to transform an object in a different format into numeric type.

For instance, we define the variable in character type, which is explained below, in the third line of

Code Block 3.1. You cannot execute mathematical operations for this variable, and if you check its

class you’ll see that it is in the character format. Transforming the variable into the numeric format

by means of as.numeric() allows you to use the object a in mathematical computations. In the

fifth line of Code Block 3.1 throws you 22. Furthermore, you can control whether an object is nu-

meric type by executing is.numeric(). This function yields TRUE if the variable’s type is numeric.

10

3.1. SOME DATA TYPES AND CLASSES IN R Monday 20th February, 2017

Code Block 3.1: Histogram for Normal Variables

1 x<− 2; #ASSING A VALUE TO THE VARIABLE x2 class(x) #CHECK THE CLASS OF x.3 a<− ”11”;4 as.numeric(a)∗225 ptitle= ”Histogram for Normal Random Variables”;6 hist(rnorm(10000,0,5),main=ptitle);

In R, another less common data format for numbers is double. Practically numeric and double

types can be used interchangeably.

3.1.2 Character

When you enter a string of textual characters between quotation marks, R takes them in character

format. This format is useful to display messages in the terminal, or modifying plot labels or names

of columns.

In R language, strings are expressed with using quotation marks (”). Assigning a character

array to a variable might be helpful in displaying your program output in the console screen or

setting main or axis titles for plots. In the last two lines of Code Block 3.1, we set a character

array to a variable and assign main title of a plot using this variable. The output of the code block

is given in Figure 3.1. The same method can be applied for assigning titles for both horizontal

and vertical axes using parameters ylab and xlab of the plot function. Such an example si given

in Chapter 2.

3.1.3 Not Available (NA)

R compiler uses NA for missing values. All mathematical operations with NA values returns NA.

Existence of this data type is checked with the function “is.na()”. In Code Block 3.2, we provide

an example of a vector with missing elements. The first two lines exemplify our explanations on

NA and is.na() function. The fifth line of the code block is one of the most convenient use of

vectors in R.

Elements of data frames, vectors and matrices can be reached with [] in R as explained in

Subsection 3.1.4. In the fifth line, we removed the NA element from the vector by negating the

logical vector obtained from is.na() and using it as an indexing variable. This usage will appear

in the following chapters of our course.

11


Histogram for Normal Random Variables

rnorm(10000, 0, 5)

Freq

uenc

y

−15 −10 −5 0 5 10 15

050

010

0015

00

Figure 3.1: Histogram of Normally Distributed Numbers

Code Block 3.2: NA Example

1 f= c(1,3,4,NA,6)2 is.na(f)3 #returns FALSE FALSE FALSE TRUE FALSE45 f[!is.na(f)]∗56 #returns 5 15 20 30

12


Code Block 3.3: Savers Bank Data Retrieval

1 library(”RSQLite”)2 library(DBI)3 db<− dbConnect(SQLite(),dbname=”SaversBank.db”)4 dbListTables(db)5 dbListFields(db,”M CREDIT CARDS”)6 creditcarddata<− dbGetQuery(db,” SELECT CUSTOMER ID,MONTHLY INCOME7 FROM M CREDIT CARDS WHERE MONTHLY INCOME>3000”)8 hist(creditcarddata$MONTHLY INCOME)9 is.data.frame(creditcarddata) #RETURNS TRUE

10 is.data.frame(creditcarddata$MONTHLY INCOME) #RETURN FALSE

3.1.4 Data Frame

Data frames are ”tightly coupled collections of variables” stored in the memory of R. When a

data set is read from a file or retrieved from a database, the data set is stored as data frame

in the memory by default. While they are similar to matrices (or arrays), they also have useful

functionalities that are commonly employed by R programmers.

Data frames actually tables in which each column has specific name stored as an attribute of

the object. This property is exploited by programmers to reach specific columns of data frames

using $ operator. An example of such a usage is presented in the following example for which the

R code is given in Code Block 3.1.4.

Example: Distribution of Savers Bank Credit Card Holders

Managers of Savers Bank decides to adopt an aggressive promotion program in order to increase

monthly spending rate of their customers since credit cards are the primary source of profit for

the company. To do so, the first step is finding the distribution of customers who can afford to

spend more and they aim to reach this customer sample by looking at monthly incomes.

In the Code Block 3.1.4, we first retrieve the data from the database provided by them and

execute the necessary SQL query which filters the customers with monthly incomes larger-than

3000. Then we obtain the histogram of those customers’ monthly incomes using the code in line

8 by reaching the column MONTHLY INCOME of the data frame creditcarddata. The resulting

histogram is given in Figure 3.2. Explanations of lines 9 and 10 are provided below.

Arrays and matrices can be transformed into data frame by executing the function data.frame().

Arrays to be collected in a single data frame are written into the function by using ”,” as separa-

tor. Naturally one important point of collecting individual arrays into a data frame is that arrays

lengths must be equal.

13


Histogram of creditcarddata$MONTHLY_INCOME

creditcarddata$MONTHLY_INCOME

Fre

quen

cy

3000 4000 5000 6000 7000 8000 9000 10000

020

040

060

080

0

Figure 3.2: Histogram of Savers Banks’s Credit Cards Holders with Monthly Income Larger-Than3000

Also you can check whether an object’s type is data frame by executing is.data.frame() func-

tion. Line 9 of the Code Block 3.1.4 returns TRUE as the data set is stored in data frame format.

On the other hand, Line 10 returns of FALSE since R compiler returns an array when a column

of a data frame is reached using $.

3.1.5 Other Data Types In R

In the previous sections, I provided brief explanations about main data types in R. However, as

noted at the beginning of the chapter, there are lots of different types in R. Hence, even years of

programming experience with R do not guarantee you to have a complete information on data

types. Good news is that there is vast amount of information easily accessible on internet for

everyone as long as you can read English. Personally I don’t believe you need to know all data

types and classes by heart. But instead, you should know how to use the documentation of R and

search engines to find solutions for your project.

Furthermore, the most helpful source in developing your own R program is the documentation.

You can reach detailed explanation on inputs, output, and parameters of a function by putting

a question mark before the name of the function and hit enter. For instance typing ?rpois and

14

3.2. IF STATEMENT Monday 20th February, 2017

hitting enter yields you an html file including explanations for a function that simulates a given

number of Poisson random variables for a given distribution parameter λ. Also keep in mind that

majority of R documentation pages include examples at the bottom which explain the usage of

the function in different settings.

3.2 If Statement

Conditional statements are one of the most commonly used code structures of R language. The

structure consists of a conditional statement and executable piece of program to be run if the

condition is satisfied. The syntax starts with an “if” followed by the conditional expression given

in brackets. Then, the executable lines should be given within curly brackets. R compiler checks

the condition and executes starting from the first executable line. If the executable part of the

statement is a single line, then one can express the if statement can be given without any curly

brackets. Such a usage of if statement is given in the first three lines of Code Block 3.4. In the first

line, we set x to 2 which satisfies the condition in line 2. Hence the code writes “x is greater-than

zero” to the console.

Another variant of if statement includes else. This code structure checks the condition state-

ment and executes the codes that follows. If the condition is not satisfied, then the compiler

executes lines following “else”. An example of such a usage of if statement is given between lines

6 and 10. The condition in line 6 fails since x = 2 that is set in the first line of Code Block 3.4.

Then y becomes 0.5. If x were equal to 0, and R compiler would execute lines 9 and 10.

In R four types of fundamental conditional statements can be expressed as follows: >= and

<= check greater-or-equal-to and smaller-or-equal-to conditions respectively. Removal of equality

signs leads to conditions with strict equality. == and ! = stand for “equal to” and “not equal

to” conditions. These fundamental condition statements are used to build composite conditional

statements using the two logical statements: “and”, “or”, which are expressed with && and ||.An example of composite logical statements together with if-else if-else structure is given in lines

between 14 and 17 in Code Block 3.4. Set x = NA. The two conditions of line 14 return FALSE.

The condition in line 16 is TRUE and the command (line 16) writes “Data type is NA!” to the

console. If x = 0, the code returns “”Division by Zero Becomes Infinity!” and 0 is assigned to y.

3.3 Loops in R

In any programming environment, loops are one of the most commonly used structure as majority

of calculations proceed in iterative fashion. In R, you can employ different types of loops using

appropriate syntax. In this section, we will present those loop structures.

15

3.3. LOOPS IN R Monday 20th February, 2017

Code Block 3.4: Three Different If Stamenets

1 x=2;2 if(x>0)3 print(”x is greater−than zero”);4 #The program returns ”Division by Zero Becomes Infinity!”56 if(x!=0)7 { y=1/x;8 } else {9 print(”Division by Zero Becomes Infinity!”);

10 y=0;}11 #For x=0, the program returns ”Division by Zero Becomes Infinity!”1213 x=NA14 if((x!=0)&&(!is.na(x)))15 { y=1/x;16 } else if(is.na(x)){ print(”Data type is NA!”);17 } else {18 print(”Division by Zero Becomes Infinity!”);19 y=0;}20 #For x=NA, the program returns ”Data type is NA!21 #For x=0, the program returns ”Division by Zero Becomes Infinity!”

3.3.1 For Loop

For loop is a structure that executes the code for finitely many times. The code consists of an

(iteration) index, start and end values for that index. The compiler assumes that the end value

is larger-than-equal to the start value and executes codes provided between brackets end− starttimes. In each iteration the index is increased by one. If there is no bracket after the for loop,

then the compiler only iterates the line following the for statement. Such a situation is provided

in the Code Block 3.5.

The code given in lines 2-6 iterates i from 1 to 5 and calculates j by multiplying i by 2 and

printing the value assigned to j to the screen. On the other hand, the code in lines 8-10 prints

only 10 as it iterates i from 1 to 5, multiply it by 2 and assign the result to j. After the iteration

is complete it prints 10 which is the result of the last multiplication.

In our second example, we set a for loop running through each element of a given vector

including names of 8 students in an international class. A curious (and maybe a bit ’nerdy’)

computer programmer is interested in developing a R code that counts the number of students

with a name starting with ’a’. This program is given in Code Block 3.6.

16


Code Block 3.5: For Loop in R

1 #WITH BRACKETS2 for(i in 1:5)3 {4 j=i∗25 print(j); #RETURNS VALUES 1,4,6,106 }7 #WITHOUT BRACKETS8 for(i in 1:5)9 j=i∗2

10 print(j); #RETURNS 10

Code Block 3.6: Name Checking Example

1 namesofstudents<−c(”ali”,”veli”,”huseyin”,”bekir”,”john”,”ahmet”,”micheal”,”ayse”);2 counter=1;3 for(i in 1:length(namesofstudents))4 {5 nm<− namesofstudents[i]6 if(substr(nm,1,1)==”a”)7 counter=counter+1;8 }9 print(j)

10 #CONSOLE OUTPUT:41112 counter=1;13 for(i in namesofstudents)14 {15 if(substr(i,1,1)==”a”)16 counter=counter+1;17 }18 print(j)19 #CONSOLE OUTPUT:4

17


The first line of Code Block 3.6 sets a vector including names of students. A variable is

initialized for desired counting process. In the for loop running from 1 to the length of the vector

namesofstudents, each element of the vector is assigned to a variable nm. If statement checks

whether the first character of each name is ’a’. If true then the counter is increased by one. A

shorter version of the same for loop is given in lines 12-18 of Code Block 3.6 which shows that for

loops can be built using a vector itself instead of trying to reach each element individually using

an index. In this example, for loop assigns each element of the vector namesofstudents, which

renders the fifth line redundant.

18

Chapter 4

Descriptive Modeling Using R

4.1 Measures of Dissimilarity Between Objects

Distance of different objects in a data set can be measured using different criteria. One common

criteria for this purpose is the Eucledean distance, denoted d(i, j).

d(i, j) = (∑k

(xi(k)− xj(k))2)1/2, (4.1)

where xi(k) is the k-th element of vector xi which represents the i-th object in the dataset.

Similarly one can define Manhattan distance, dM(i, j) to measure dissimilarity as follows:

dM(i, j) =∑k

|xi(k)− xj(k)|. (4.2)

Different distance definitions might yield different results in clustering algorithm presented in

the following sections.

Distances of a set of objects can best be expressed with a symmetric matrix, specifically called

distance matrix. In i-th row j-th column of the matrix, there exits the distance between i-th and

j-th elements of the set.

D =

0 d12 d13 ... d1n

d21 0 d23 ... d2n... ...

dn1 dn2 dn3 .... 0

(4.3)

In R, we calculate the distance matrix (in a special form) using the function dist(). In the

19

4.2. K-MEANS ALGORITHM IN R Monday 20th February, 2017

Code Block 4.1: Get Data For Products of SuperSale Groceries

1 library(”RSQLite”)2 library(DBI)3 db<− dbConnect(SQLite(),dbname=”Supersale.db”)4 prod=dbGetQuery(db,”SELECT T1.PRODUCT ID,T1.UNIT PRICE,T2.SALES5 FROM M PRODUCTS T1, (SELECT PRODUCT ID, SUM(SALE QUANT) AS SALES FROM6 TR SALES GROUP BY PRODUCT ID) T2 WHERE T1.PRODUCT ID=T2.PRODUCT ID”)

example below, we calculate and plot distances between first 20 products of SuperSale Grocery.

Example: Distance Between Products of Supersale Groceries

The first three lines of the code block 4.1 is for library loading and connection the the database.

In lines 4-6 we execute the query and retrieve data from the database.

The second line of Code Block 4.2 defines an array from 1 to 20 and plot the first 20 of

products, stored in the data frame prod, retrieved from the database. Line 7 calculates distances

between the first 20 products and assigns them to U. Distances are turned into the matrix form

with data.matrixU and we plot them using the function image(). In that function, we closed labels

of x and y axes by setting the parameters yaxt and xaxt to ’n’. Later lines 15-18 create strings

with product names, and later we put them to x and y axes in lines 20 and 21.

4.2 K-Means Algorithm In R

In order to cluster data sets including multiple dimensions, we use K-means algorithm coded in R.

As indicated in the documentation, K-maens algorithm is called by kmeans(data,center,max.iteration),

and it runs the algorithm developed by Hartigan and Wong (1979). Note that there are multiple

algorithms for partition-based clustering in the literature.

Let’s consider the data set from Supersale Grocery store example to illustrate the mechanics

of K-means algorithm.

Example: Bivariate K-Means Algorithm

Supersale Grocery store is willing to cluster its customers based on their total sales and their

cities. To illustrate the solution in better graphics, we will present results for first 20 products in

their database.

20


Code Block 4.2: Plot Distances

1 #PLOT DATA ONTO TWO AXES2 v=c(1:20);3 windows()4 plot(prod[v,c(2,3)],main=”PRODUCT PRICE AND TOTAL SALES”)56 #CALCULATE DISTANCE7 U=dist(prod[c(1:20),c(2,3)])89 #PLOT DISTANCES

10 windows()11 image(v,v,data.matrix(U),main=”Distance12 Matrx”,xlab=”Products”,ylab=”Products”,xaxt=’n’,yaxt=’n’)1314 #LABELS FOR AXES15 txtvct=array(0,20)16 for(i in v)17 txtvct[i]=sprintf(”prd−%d”,i);18 text(prod[v,2],prod[v,3],txtvct,pos=3)1920 axis(1,at=v,labels=txtvct,cex.axis=0.5,las=3)21 axis(2,at=v,labels=txtvct,cex.axis=0.5,las=1)

4 6 8 10 12

100

150

200

250

PRODUCT PRICE AND TOTAL SALES

UNIT_PRICE

SA

LES

prd−1

prd−2

prd−3prd−4

prd−5prd−6

prd−7

prd−8

prd−9

prd−10

prd−11

prd−12

prd−13

prd−14prd−15

prd−16

prd−17

prd−18

prd−19

prd−20

Figure 4.1: First 20 Products of Supersale Grocery

21


Distance Matrx

Products

Pro

duct

s

prd−

1

prd−

2

prd−

3

prd−

4

prd−

5

prd−

6

prd−

7

prd−

8

prd−

9

prd−

10

prd−

11

prd−

12

prd−

13

prd−

14

prd−

15

prd−

16

prd−

17

prd−

18

prd−

19

prd−

20

prd−1

prd−2

prd−3

prd−4

prd−5

prd−6

prd−7

prd−8

prd−9

prd−10

prd−11

prd−12

prd−13

prd−14

prd−15

prd−16

prd−17

prd−18

prd−19

prd−20

Figure 4.2: Distance Matrix Between Products of Supersale Grocery

We retrieve data from the database using subqueries in lines 1-7 of Code Block 4.3. Data

comes with categorical text, names of cities, and we transform them into categorical variables to

be processed by the K-Means algorithm in R.

We take the second and the third columns of dataset and feed them into the kmeans function

by setting the required number of clusters to 4. The results are taken to the variable kclus. To

obtain cluster of each customer, we use kclus$cluster as presented in the fifth line of Code Block

4.4. By combining these clusters of each observations with total sales and city information, we

form a new data frame with the function data.frame() (Line 5). Line 8-11 plot customers without

any indication of clusters. This plot is given in Figure 4.3.

In Code Block 4.5, we plot clustered customers using different colors. Plot function in R sets

the main title, axes titles and axes limits. However, each time you call this function, it produces a

different plot. To plot clustered products with different colors, we use points() function which has

to follow plot() function to work properly. Axis ticks are labeled with the function axis() and we

add legend to the plot using lines 11-12. Resulting plot is presented in Figure 4.4. In this figure,

clusters are indicated with different colors. For instance, cluster 1 is red, cluster 2 is blue, cluster

3 is black, and cluster 4 is depicted with green.

22


Code Block 4.3: Data Retrieval

1 library(”RSQLite”);2 library(”DBI”);3 #GET PRODUCT DATA FROM THE DATABASE4 prod=dbGetQuery(db,” SELECT T1.CUSTOMER ID,T1.SALES, T2.CITY5 FROM (SELECT CUSTOMER ID,SUM(SALE QUANT) AS SALES FROM TR SALES6 GROUP BY CUSTOMER ID) T1, M CUSTOMERS T2 WHERE7 T1.CUSTOMER ID=T2.CUSTOMER ID”)8 #CHANGE CATEGORICAL TEXT DATA INTO NUMERIC9 prod[(prod[,4]==”IST”),4]=0

10 prod[(prod[,4]==”ANK”),4]=111 prod[(prod[,4]==”IZM”),4]=212 prod[(prod[,4]==”ADN”),4]=3

Code Block 4.4: Cluster Products With K-means

1 #GET TOTAL SALES AND CITY INFO2 cls=prod[,c(2,3)]3 kclus=kmeans(cls,4)4 #WE COLLECT CLUSTER AND OTHER DATA OF EACH CUSTOMER INTO A DATA FRAME5 custcluster=data.frame(kclus$cluster,cls$SALES,cls$CITY)6 #LETS PLOT UNCLUSTERED DATA7 windows()8 plot(custcluster[,c(2,3)],ylab=”CITY”,xlab=”TOTAL SALES”,yaxt=’n’,9 main=”SALES & CITIES OF CUSTOMERS”)

10 #PLOT WITHOUT ANY AXIS TICKS WHICH ARE ADDED LATER!11 axis(2,at=c(1:4),labels=c(”IST”,”ANK”,”IZM”,”ADN”))

Code Block 4.5: Plotting Product Clusters

1 #NOW PLOT CLUSTERED DATA USING DIFFERENT COLORS...2 maxlim1=max(custcluster[,2]);minlim1=min(custcluster[,2]);3 plot(custcluster[(custcluster[,1]==1),c(2,3)],col=”red”,4 xlim=c(minlim1,maxlim1), pch=19, main=”Clusters of Customers”,5 xlab=”TOTAL SALES”,yaxt=’n’,ylab=”CITY”)6 points(custcluster[(custcluster[,1]==2),c(2,3)],col=”blue”,pch=19)7 points(custcluster[(custcluster[,1]==3),c(2,3)],col=”green”,pch=19)8 points(custcluster[(custcluster[,1]==4),c(2,3)],col=”black”,pch=19)9 axis(2,at=c(1:4),labels=c(”IST”,”ANK”,”IZM”,”ADN”))

10 txt1=”Cluster1”; txt2=”Cluster2”; txt3=”Cluster3”; txt4=”Cluster4”;11 legend(800,4,c(txt1,txt2,txt3,txt4),col=c(”red”,”blue”,”green”,”black”)12 ,pch=19)

23


300 400 500 600 700 800 900

SALES & CITIES OF CUSTOMERS

TOTAL SALES

CIT

Y

IST

AN

KIZ

MA

DN

Figure 4.3: Customers in Different Cities

300 400 500 600 700 800 900

Clusters of Customers

TOTAL SALES

CIT

Y

IST

AN

KIZ

MA

DN

Cluster1Cluster2Cluster3Cluster4

Figure 4.4: Clustered Customers in Different Colors

24

4.3. HIERARCHICAL CLUSTERING Monday 20th February, 2017

4.3 Hierarchical Clustering

Hierarchical clustering is widely used method to obtain nested-clusters for a data set. Unlike

K-Means algorithm, hierarchical clustering methods do not require a specific cluster amount to

work. Either by bottom-up or top-down approaches, algorithms groups each cluster by comparing

distances between clusters.

General search algorithm of hierarchical clustering methods rely on comparison of distances

between each clusters and joining closest ones iteratively.

One important concern in hierarchical clustering is finding a representative point for each

cluster including multiple observations. There might be possible approaches to this issue. The

most straightforward way is calculating centroids (center of cluster) which can be formulated as

follows: Let Ci is a cluster including observations {x|x ∈ C}, where x = (x1, x2, ..., xp) is a

p-dimensional vector. Then the centroid of the cluster C, denoted with

x = (x1, x2, ...xp),

where xk is the average of k-th elements of all observations in C. Using centroids of clusters has a

nice interpretation and provides a balanced representation of all elements in a cluster. However,

the computer has to re-calculate distances between the centroids of each cluster in each iteration

as centroids change by addition or deletion of elements from clusters. Yet for small or medium size

data sets, this is not a pressing concern and many hierarchical clustering algorithms use centroids.

In the following example, we present agglomerative (bottom-up) hierarchical clustering of a

sample of black cherry trees.

Example: Hierarchical Clustering of Black Cherry Trees

A built-in data set including girth, height and volume measurings of black cherry trees are consid-

ered in this example. The data set can be reached directly from R by loading (you have to install

it first) the package datasets as in the first line in Code Block 4.6. In the second line we calculate

distances between observations (line 2) and feed them into the function hclust() (line 3) which

clusters observations using a bottom-up algorithm. Plotting the result of a hierarchical clustering

algorithm generates a dendrogram. For the Black Cherry data sets, the dendrogram is given in

Figure 4.5.

For large data sets, it might be more important to get clusters than analyzing the dendrogram

which might be messy and complex. To reach clusters, we should execute the code in the fifth line

25


Code Block 4.6: Hierarchical Clustering for Black Cherry Trees

1 library(datasets)2 dist tree=dist(trees)3 mdist tree=hclust(dist tree)4 plot(mdist tree)5 mdist tree$merge

of Code Block 4.6. The fifth line generates a matrix with two columns representing mergers in its

rows. Negative elements in this matrix stands for singletons (clusters including one observation)

whereas positive runs clusters with more-than-one objects.

3129 30

28 26 2725

23 24 21 2217 18

810 15 16 12 13

9 115 6

20 14 194 7 1

2 3

020

4060

Cluster Dendrogram

hclust (*, "complete")dist_tree

Hei

ght

Figure 4.5: Black Cherry Tree Sample Clustered Hierarchically (Agglomerative)

In the following example, we present a hierarchical clustering for a large data set retrieved

from SuperSale Grocery database.

Hierarchical Clustering Customers of SuperSale Grocery

In Code Block 4.7, we retrieve data from the database of SuperSale Grocery.

26


Code Block 4.7: Data Retrieval from SuperSale Grocery

1 db<− dbConnect(SQLite(),dbname=”Lecture3.db”)2 xx=dbGetQuery(db,” SELECT T1.CUSTOMER ID, SUM(T1.SALE QUANT),3 SUM(T1.TOTAL AMOUNT), T2.GENDER FROM TR SALES T1, M CUSTOMERS T24 WHERE T1.CUSTOMER ID=T2.CUSTOMER ID GROUP BY T1.CUSTOMER ID”);

Code Block 4.8: Hierarchical Clustering for Customers of SuperSale Grocery

1 for(i in 1:length(xx[,1]))2 {3 xx[i,1]=sprintf(”Cust.ID=%d”,i)4 }56 #CHANGE CATEGORICAL TEXT INTO NUMERIC7 xx[(xx[,4]==”F”),4]=18 xx[(xx[,4]==”M”),4]=09

10 #CALCULATE DISTANCE11 distCust=dist(xx[,c(2:4)])12 #HIERARCHICAL CLUSTER13 hclust(distCust)14 clusterd cust=hclust(distCust)15 #PLOT16 plot(clusterd cust)

In Code Block 4.8, we first build strings for customer ids in lines 1-4. Later, we change the

categorical text into numeric as we did before. Distances are calculated in line 12 and a dendrogram

in Figure 4.6 is generated in line 16. An important feature for that dendrogram is that it depicts

the weakness of dendrograms for a large data set as they become inefficient due to limited space

in the x-axis.

27


768

948

848

959

623

811

189

622

815

104

524

367

817 30 513

176

870

371

98 893

689

145

536

128

206

742

205

736 24 995

787

985

347

480

116

958

275

149

807

881

265

407

891

258

360

143

702

999 99 906

586 88 417 81 401

103

533

152

976

672

437

607

194

421

400

484

612

392

726

647

491

583

630

97 829

608

798

318

225

490

300

87 851

783

329

582

515

954

921

941

188

289

777

800

810 74 633 36 122

803

930

295

317

687

755

674

255

858

684

641

166

991

628

905

868

378

806

849

441

802

160

323

894

414

551

746

489

847

239

290

488

769 67 274

538

181

304

213

707

229

520

264

462

349

576

765

428

435

624

778

442

844

278

279

795

799

114 31 700

788

827

629

102

690

113

838

458

713 27 501 17 20 292

625

816

544

560

422

482

718

970

197

568

475

91 627

412

681

987

896

106

779

236

706

552

915

471

874

898

927

997 7

333

645

302

594

617

854

665

813 82 866

539

382

727 3

408

339

908 83 169 96 170

163

429

186

22 377

450

119

939

139

537

730

781

714

801 66 938

961

228

140

842

857

950

208

212

284

887

211

409

256

865 2

261

305

719

936

195

731

822

479

616

338

434

631 64 47 235

148

982

703

196

463

271

704

133

774

266

312 11 40 80

504

697

351

926 52

588

873 26 9 21

310

943

446

366

436

127

259

200

308

395

977

988

500

771

178

841

934

886

465

651 57 717

85 507

872

374

418

762

108

766

105

336

585

973

590

598

892

135

658

432

724

740 43 39

159

656

918

348

525

325

989

115

405

287

172

814

615

216

968

125

505

60 558

773

924

634

673

780

144

563

532

311

322

391

809

986

282

249

693

230

546

455

18 203

303

464

951

120

899 34 729

232

698

443

862

468

993

221

521

445

917

880

126

621

467

784

199

8426

953

1 13 637

362

7949

496

365

075

892

233

494

649

847

468

0 182

681

273

982

445

696

7 35 44 750

451

202

909

578

620

335

705

856

291

514

903

179

337

270

299

547

233

486

626

933

257

883

737

234

619

720

350

670

974

411

545

136

331

509

454

998

636

320

669

688

346

836

324

410

782

10 37 522

952

914

19 966

141

722

51 379

129

928

579

92 759

945

167

675 59 953

712

204

752

376

214

570

794

587

268

512

654

581

888 90

164

162

427

469

397

648

109

642 25 45 426

485

384

871 23 709

155

262 46 753

343

749

224

112

565

476

859

118

316

207

890

907

786

341

923

596

161

715

850

548

497

931

385

981

567

853

652

330

493

925

393

593

557

402

559

294

657

215

193

165

508

466

808

483

732

845

307

597 77 389 42

142

383

353

796

542

676

819

132

404

209

711

793

614

306

901

326

117

260 89 168

107

747

182

226

252

369 16 180

387

187

869

904

481

285

430

540

990 49 321

101

191

174

660

296

897

677

994

743

373

440

716

403

611

519

503

242

355

839

605

146

561

319

653

659

364

760

661

832

328

472

662

666

818

919

949

281

530

394

286

527

433

640

473

61 595

121

425

173

549

911

153

517

602

580

416

511

571

701

453

937

541

691

309

767 56 632

741

867

775

790

885

516

577

978

198

983

415 95 29 996

792

972

365

535

876

15 478

217

247

979

218

692 86 852

288

452

298

664

131

361 75

251

297

572

219

964

529

756

772

609

574

638

910

971

695

846

241

359

610

825

386

980

785

754

882

492

805

245

431

345

751 72 158 38 231

253

855

461

584

603

804

889

342

562

192

370

246

789

834

543

589

912

984

314

406

864 55 506

399 71 613

573

569

770

835

663

861

694

944

396 62 227

156

470

375

837 58 154

123

797

699

748

920 12 110

223

725

942

313

390

668

254

420 73 566

671

147

352

962

683

190

248

646

554

960

240

599

761

332

340

134

791 94 354

460

601

744

678

293

728

449

137

764

643

956 5

635

965 32 604

183

447

682

398

992

618

900

487

526

831 8

591

327

935

667 48 708 33 564

448

735

244

502

250

138

363

184

272

534

130

902

763

222

879

210

655

273

828

358 14 69 929

381

830

528

757

267

283

157

686

151

175

177

696

499

171

495

600 68 550

776 70 833

575

111

556 93 78 185

357

315

821 63 53

344

734

947

424

733

220

372

553

555

860

100

356

723

237

413

877

419

1000 41 380 28 884

277

301

745

975

438

439

840 6

895

843

54 913

388

957

969

238

644

932

263

518

124

276 4

243

444

863

477

649 50

368

496

459

916

639

710

510

738

423

523

606

823

685

878

76 940

150

201

721

875

679

820

280

955

457

65 592

020

0040

0060

0080

00

Cluster Dendrogram

hclust (*, "complete")distCust

Hei

ght

Figure 4.6: Customers of SuperSale Groceries Clustered Hierarchically (Agglomerative)

28

Chapter 5

Predictive Modeling

5.1 Perceptrons

Perceptrons are the earliest classification technqiues. It relies on the idea of finding a threshold

function (of input factors) by checking all objects in the data set iteratively. In each iteration, the

threshold function is adjusted using misclassified objects in the data set. Depending on the form of

threshold function, e.g. linear, quadratic etc, denoted with s(.), adjustment due to misclassification

is applied in different forms. And each misclassification adjustment takes place with a learning

rate λ.

The pseudocode for a generic perceptron algorithm with a threshold function s() and threshold

value thr.val can be given as follows:

Algorithm 1 Generic Algorithm for Perceptron

1: misclassflag=TRUE2: while misclassflag is TRUE do3: misclassflag=FALSE4: for all xi ∈ X do5: Estimate class of xi with s(xi) and assign it to Ci.6: if Ci 6= Ci then7: Update s(.) using xi and λ ;8: misclassflag=TRUE;9: end if

10: end for11: end while

In this chapter, we will consider linear threshold which is expressed as a weighted average of

input columns: s(xi) =∑

j wjxij, where yij is the jth element of input vector i. The algorithm

starts with an initial value of w. For each input vector xi, we calculate the estimated class, Ci

and compare it with the actual class, Ci, of the object i. If there is a misclassification, then

the threshold function s(.) is updated with λ. The update mechanism will be taken as follows:

29

5.1. PERCEPTRONS Monday 20th February, 2017

Code Block 5.1: Linear Perceptron

1 perceptron<−2 function(y,x,thr,lambda=0.01,winit=array(0,length(x[1,])),max.iter=10ˆ6)3 {4 w=winit;5 iteration=1;6 incorrect.class=TRUE;7 while(incorrect.class)8 {9 incorrect.class=FALSE;

10 for(i in 1:length(y))11 {12 C.hat=ifelse((as.numeric(x[i,])%∗%as.numeric(w)>=thr),1,−1)13 if(y[i]!=C.hat)14 {15 w=w + lambda∗x[i,];16 incorrect.class=TRUE;17 }18 }19 iteration=iteration+1;20 if(iteration>max.iter)21 incorrect.class=FALSE;22 print(paste(”Iteration#”,iteration))23 }24 w25 }

Whenever the element xi is misclassified, then the weight vector w = (w1, w2, ..., wp) is updated

using the formula below:

w = w + λ(xi1, xi2, . . . , xip)

The application of the pseudocode 1 to an R program is given in Code Block 5.1, in which we

define a perceptron function that takes actual class variable as y and actual input data frame x.

λ is the learning rate and initial weight values are set to zero. In order to avoid infinite loop, we

introduced another parameter called max.iteration after which we stop the perceptron algorithm

and return the resulting value.

In the R program, the estimation of the class of dependent variable is done at line 13 and the

update is executed in line 16 which are the two core operations of a perceptron algorithm. At each

iteration the algorithm goes through entire data set and check misclassification. The program ends

either there is no misclassification occurs in the for loop (Lines 11-19) or the number of iteration

exceeds the allowed number of iterations given as parameter.

30


We should note that the linear threshold function of perceptron assumes that the two classes

is linearly separable, i.e. there exists a wTx hyperplane that separates the two classes perfectly

(Alpaydin, 2014). If, on the other hand, classes cannot be separated, then perceptron algorithm

may fail to converge a good weight vector. An example of inseparable and separable classes are

given Figures 5.1a and 5.1b.

0 100 200 300 400 500 600

020

0040

0060

0080

0010

000

Inseparable Classes

AVG.TRANSACTION

MO

NTH

LY_I

NC

OM

E

(a) Inseparable Classes

0 100 200 300 400 500 600

020

0040

0060

0080

0010

000

Separable Classes

AVG.TRANSACTION

MO

NTH

LY_I

NC

OM

E

(b) Separable Classes

Figure 5.1: Separable and Inseparable Classes

In Figures 5.1a and 5.1b, data set including customers of SaversBank is plotted. In the left-

hand-side graph, customers are classified according to their credit card types. Blue circles indicate

customers using Visa whereas the red ones are for other types of credit cards (American Express

and Master). On the other hand, if the classification is executed based on cities of customers,

then we end up with a separable classes (Figure 5.1b). The R program getting the dataset and

yielding plots are given in Code Block 5.2.

Before closing our section on perceptron we should make one more comment on the threshold

function s(.). Alpaydin (2014) suggests that the threshold function s(.) should be used as in

Equation 5.1 in order to cover the possibility of having a nonzero intercept for the separating

hyperplane.

s(xi) = w0 +∑j≥1

wjxij. (5.1)

For such a threshold function, the weight vector w should be updated using a (p+1) dimensional

vector having 1 as its first element. The update equation is given below.

w = w + λ(1,xi1,xi2, . . . ,xip).

31


Code Block 5.2: Data Set For (In)Seperable Classes

1 dataset=dbGetQuery(db,2 ”SELECT T4.CCARDNUM,T4.CTYPE,T4.CTY, AVG(T4.AVG TRANS),AVG(T4.INCME)3 FROM (SELECT T2.CARD TYPE CTYPE,T.CREDT CARD NUM CCARDNUM,4 (T.TRANS AMOUNT) AVG TRANS,(T2.MONTHLY INCOME) INCME,T3.CITY CTY5 FROM CARD TRANS T,M CREDIT CARDS T2, (select CITY,CREDIT CARD ID,6 CUSTOMER ID FROM M CUSTOMERS WHERE CREDIT CARD ID IS NOT NULL) T37 WHERE T.CREDT CARD NUM=T2.CREDIT CARD ID AND T.TRANS TYPE=’−’ AND8 T3.CREDIT CARD ID=T2.CREDIT CARD ID AND T3.CUSTOMER ID=T2.CUSTOMER ID )9 T4 GROUP BY T4.CCARDNUM,T4.CTYPE,T4.CTY”)

1011 avg expense=dataset[,4]12 income=dataset[,5]1314 #Inseparable Classes15 ydep=ifelse(dataset[,2]==”Visa”,1,−1)16 plot(avg expense,income,col=ifelse(ydep>0,”blue”,”red”),17 main=”Inseparable Classes”,xlab=”AVG.TRANSACTION”,ylab=”MONTHLY INCOME”)1819 #Separable Classes20 ycity=ifelse(dataset[,3]==”IST”,1,−1)21 plot(saversbank[,1],saversbank[,2],col=ifelse(ycity==1,”red”,”blue”),22 main=”Separable Classes”,xlab=”AVG.TRANSACTION”,ylab=”MONTHLY INCOME”)

32

5.2. LINEAR DISCRIMINANT ANALYSIS Monday 20th February, 2017

5.2 Linear Discriminant Analysis

Another classification method is discriminant analysis which relies on the idea of finding a hyper-

plane that separates the classes best. From this perspective the discriminant method is similar

to perceptron Hand et al. (2001). The hyperplane can be assumed in different forms, such as

linear, quadratic, exponential etc. In this chapter we only consider liner discriminant anaylsis.

For further reading on nonlinear discriminant functions, the reader is referred to Alpaydin (2014).

As stated above, the linear discriminant method relies on the idea of finding a weight vector

w that separates classes in the dataset. Hence our discriminant function will be in the following

form

g(x) = wTx + w0, (5.2)

where w is a p-dimensional vector.

In order to find the best w vector, Fisher (1936) suggest the following treatment for the class

case. Define Ci the covariance matrix for classes i = 1, 2 and let C is the pooled sample covariance

matrix formulated as follows:

C =1

n1 + n2

(n1C1 + n2C2).

The score function to be maximized for calculating w is

S(w) =wTµ1 −wTµ2

wT Cw,

where µi, i = 1, 2 are cluster means.

In R, we call linear discriminant method using the function lda(), of which usage is very similar

to lm(). The function returns

• initial proportion of classes in the data set (πi, i ∈ {1, 2, ...}),

• class averages (µi, i ∈ {1, 2, ...}),

• the resulting weight vector w.

In order to predict classes of new objects using the linear discriminant model we call predict

function with a data set including new observations. Note that predict is also useful for predicting

with other statistical models, such as linear regression, exponential smoothing etc.

For two dimensional problems, the results of linear discriminant can be employed using Equa-

tion 5.3 suggested by Hand et al. (2001). R program that employs this equation s given in Code

Block 5.3 for classification of SaversBank’s customers. Resulting plot of classified customers are

given in Figure 5.2.

33

5.3. DECISION TREE Monday 20th February, 2017

Code Block 5.3: Linear Discriminant Analysis for Customers of SaversBank

1 library(MASS)2 ycity=dataset[,3]3 ycity[(ycity!=”IST”)]=04 ycity[(ycity==”IST”)]=156 saversbank=data.frame(datasset[,c(4,5)])7 ld city=lda(ycity˜.,saversbank)8 city.lda=predict(ld city)$class9

10 plot(saversbank[,1],saversbank[,2],col=ifelse(city.lda==1,”red”,”blue”),11 main=”Linear Discriminant”,xlab=”AVG.TRANSACTION”,ylab=”MONTHLY INCOME”)1213 a=−1∗ld city$scaling[1]/ ld city$scaling[2]14 b=t(ld city$scaling)%∗%(ld city$means[2,]−ld city$means[1,])∗0.5 +15 log(ld city$prior)[2]− log(ld city$prior)[1]1617 abline(a,b,lwd=2)18 text(300,2000,”Threshold Function”)

wT

(x− 1

2(µ1 − µ2)

)− log(

π1π2

) = 0, (5.3)

The R Program in Code Block 5.3 starts with calling the required library, MASS, for linear

discriminant function. The dependent variable, ycity, is transformed in lines 2-4. Linear discrim-

inant model is built in line 7 and plot in lines 10-11. In the rest of the program, we plot the

customers and the threshold line separating them.

5.3 Decision Tree

Decision tree is a hierarchical classification technique aims to divide input space into subregions

in an iterative manner. A decision tree composed of internal nodes and terminal leaves, where

internal nodes represents splits whereas terminal leaves indicates final devisions.

Decision tree algorithms (in R Gui) search for threshold values that split the observations (in

the training data set) into two divisions. For a continuous variable xj, decision tree algorithms

search for t s.t. xj ≤ t and xj ≥ t will be grouped separately. To this end, algorithms evaluate

all possible threshold values for an internal node and select the value that maximizes a purity

criterion. The process starts at the root node and proceeds iteratively towards leaf nodes by

selecting a threshold value for each internal node.

Simplest purity criterion can be the ratio of the number of observations from class i, denoted by

34

5.3. DECISION TREE Monday 20th February, 2017

0 100 200 300 400 500 600

020

0040

0060

0080

0010

000

Linear Discriminant

AVG.TRANSACTION

MO

NT

HLY

_IN

CO

ME

Threshold Function

Figure 5.2: Classified Customers with Linear Discriminant Method

N ik, to the total number of observations in the leaf node k, Nk. This ratio is denoted by pik =

N ik

Nk.

Note that if pik is either 1 or 0 for a leaf node k, that is all observations come from the class, than

the split is pure Alpaydin (2014).

In R, we call the function rpart(), which uses Gini Index, a common impurity measure, for

decision trees W.N. and B.D. (2002). Gini Index is formulated as follows for a leaf (internal

decision) node k:

Gk = 1−∑i∈C

pik2,

where C is the set including all classes in the data set. Minimization of Gini Index for a node k

stands for finding a threshold t which either decrease or increase purity ratio pik. Minimization of

Gini Index by the algorithm rpart increases purity of a class at the leaf node (Figure 5.3). Also

note that the worst case scenario is having a purity ratio equal to 0.5 which means there are equal

numbers of elements at a leaf node.

Example: Decision Tree Classifier for Customers of SaversBank

Managers of SaversBank are willing to understand the customer base of the bank in order to

design better services and promotions which can be executed with simple rules. Also in their

latest campaigns, they particularly want to focus on the customers in Izmir. To this end, they

want to employ a decision tree classification using city maximum amount of savings of their

35

5.4. NEAREST NEIGHBOR METHODS Monday 20th February, 2017

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

p(ik)

Gin

i Ind

ex

Figure 5.3: Gini Index for Node k

customers.

R code that extracts data, fit a decision tree and plot it is given in Code Block 5.4. First 8 lines

are for getting data from the database into R. Lines 10-13 do enumeration of city information.

After calling the required library, we fit the decision tree in line 18 and plot it in lines 20-21. In

the rest of the code block, some of the splits from the decision tree is given in Figure 5.5.

5.4 Nearest Neighbor Methods

Nearest Neighbor Method is a classification algorithm which iteratively estimates classes of each

objects in the test set using the objects in the training set. The idea of the algorithm is as follows:

To estimate the class of an object, y, with input vector, x in the test set, we find its k nearest

neighbors in the training set. Then we check classes of these k objects and assign the most com-

mon class as the class estimation, y. As we consider k nearest neighbors, the algorithm is also

known as kNN in the literature.

For instance, suppose we want to estimate the class of the eighth object (in the test set) using

seven objects in the training set given in Figure 5.6. As can be seen there are three class 1 and

four class 2 objects. Hence the class of the eighth object is class 1 (by majority).

36


Code Block 5.4: Decision Tree for Customers of SaversBank

1 library(”RSQLite”)2 library(DBI)3 db<− dbConnect(SQLite(),dbname=”SaversBank.db”)4 dataset=dbGetQuery(db,”5 select T1.CUST ID,MAX(T1.BALANCE) max save ,T2.MONTHLY INCOME,T3.CITY6 from SAVINGS TRANS T1, M CREDIT CARDS T2, M CUSTOMERS T3 WHERE7 T1.CUST ID=T2.CUSTOMER ID AND T3.CUSTOMER ID=T2.CUSTOMER ID8 GROUP BY T1.CUST ID”)9

10 varcity=dataset[,4]11 varcity[(varcity!=”IZM”)]=0;12 varcity[(varcity==”IZM”)]=1;13 varcity=as.numeric(varcity)1415 library(rpart)16 dataregress=dataset[,c(2,3)]17 names(dataregress)<− c(”max save”,”M INC”)18 fit=rpart(varcity˜.,data=dataregress)1920 plot(fit, uniform=TRUE, main=”Classification Tree for SaversBank”)21 text(fit,cex=0.8)2223 windows()24 plot(dataregress,main=”Decision Tree Split”,25 col=ifelse((varcity==1),”red”,”blue”))26 abline(v=93590,col=”red”)27 abline(v=38190,col=”red”)28 segments(93590,3800,400000,3800,col=”red”)29 segments(93590,3350,400000,3350,col=”red”)30 segments(93590,4650,400000,4650,col=”red”)31 segments(93590,3100,400000,3100,col=”red”)32 segments(93590,4200,400000,4200,col=”red”)33 segments(−500,900,38190,900,col=”red”)34 legend(250000,2000,c(”IZMIR”,”Others”),35 col=ifelse((varcity==1),”red”,”blue”),pch=21)

37


Classification Tree for SaversBank

|max_save< 9.359e+04

max_save< 3.819e+04

M_INC>=900

max_save< 1.491e+04

M_INC>=2450

M_INC< 3100

M_INC>=3350

M_INC< 2200

M_INC>=1800

M_INC< 1550

M_INC< 3800

M_INC>=3350

M_INC< 3100

M_INC< 4650

M_INC>=4200

0.01099

0 0.4592 0

0.045450.6019 0

0 0.5031

0.5135

0

0 0.6522 0 0.7509

0.8938

Figure 5.4: Classifying Customers of SaversBank

0 50000 150000 250000

2000

4000

6000

8000

1000

0

Decision Tree Split

max_save

M_I

NC

IZMIROthers

Figure 5.5: Partition of Input Space by Decision Tree

38


0.1 0.2 0.3 0.4

0.1

0.2

0.3

0.4

0.5

Classify the 8th Point

Var1

Var

2

Class1Class2

Figure 5.6: Example 1. Classify the Eight Point

The kNN algorithm is called with knn() in R. To apply the method properly we should split

data set (both input data and class variables) into two parts as training and test sets. The kNN

algorithm works with training and test sets of input data, and training set of class variable. Later

the estimated classes are compared with test set of class variable. Such an application of kNN

method is given in the following example.

Example: kNN Classification for Products of Supersale Groceries

Supersale groceries s willing to make an assessment for their products being sold in different cities.

They think their top sale products must be the ones generate daily revenue larger than 150 TL.

They want a prediction tool for their new product development department which aims to increase

the market share of the company by introducing more top sale products to the market.

To this end, they apply to our university for a predictive classifier that works on their product

database which they provided to us. An application of kNN algorithm to this small case is provided

in Code Block 5.5.

The application in Code Block 5.5 starts with calling necessary libraries for sqlite databases

and connecting a sqlite database (Lines 1-3). Later we execute an SQL query retrieving a data

frame including all columns of M PRODUCTS, average sale quantity and cities of the customers.

39


Code Block 5.5: Nearest Neighbor

1 library(”RSQLite”)2 library(DBI)3 db<− dbConnect(SQLite(),dbname=”Lecture3.db”)4 dataset=dbGetQuery(db,”SELECT T1.∗,AVG(T2.SALE QUANT),T3.CITY ,5 date(datetime(T2.DATE OF SALE,’unixepoch’)) FROM M PRODUCTS T1,6 TR SALES T2, M CUSTOMERS T3 WHERE T1.PRODUCT ID=T2.PRODUCT ID7 AND T3.CUSTOMER ID=T2.CUSTOMER ID GROUP BY T2.PRODUCT ID,T3.CITY,8 date(datetime(T2.DATE OF SALE,’unixepoch’))”)9

10 revenue=dataset[,2]∗dataset[,3]11 classifier=(revenue>150)12 classifier=as.numeric(classifier)1314 dataset[(dataset[,4]==”ADN”),4]=1; dataset[(dataset[,4]==”ANK”),4]=2;15 dataset[(dataset[,4]==”IST”),4]=3; dataset[(dataset[,4]==”IZM”),4]=4;16 dataset[,4]=as.numeric(dataset[,4])1718 n=length(dataset[,1])19 z=c(1:n);Dev4k<−020 for(s in 1:7){21 MADvect<−022 for(i in 1:100){23 trainset=sample(z,n∗0.7)24 dataset.train=dataset[trainset,]25 dataset.test=dataset[−trainset,]26 dataset.train2=dataset.train[,c(2,3,4)]27 dataset.test2=dataset.test[,c(2,3,4)]2829 classifier.train=classifier[trainset]30 classifier.test=classifier[−trainset]31 est=knn(dataset.train2,dataset.test2,classifier.train,k=4)32 est=as.numeric(est)33 MADvect[i]=mean(abs(classifier.test−est))34 print(paste(s,i))35 }36 Dev4k[s]=mean(MADvect)}

40


We calculate revenue by multiplying the second and the third columns of the data frame. Later we

define a top-seller classifier which assigns 1 to products with a revenue of larger than 150 (Lines

10-13).

In the fourth column of the data frame dataset, we have city names as strings. We transform

city names into a categorical variable and store these values to the fourth column of the dataset

(Lines 14-16). In the rest of Code Block 5.5 we aim to decide for best k value by conducting

out-of-sampling tests repeatedly in the inner for loop. Each time the code enters into a for loop,

it selects a random training set (70%) of the data size and assign the rest to the test set (Lines

23-34). By executing this out-of-sampling test 100 times (it could be ever larger depending on the

standard deviation of MAD) we calculate average MAD for each k value.

41

Bibliography

Alpaydin, Ethem. 2014. Introduction to machine learning . MIT press.

Becker, Richard A, John M Chambers, Allan R Wilks. 1988. The new s language. Pacific Grove,

Ca.: Wadsworth & Brooks, 1988 1.

Fisher, Ronald A. 1936. The use of multiple measurements in taxonomic problems. Annals of

eugenics 7(2) 179–188.

Hand, David J, Heikki Mannila, Padhraic Smyth. 2001. Principles of data mining . MIT press.

Hartigan, John A, Manchek A. Wong. 1979. Algorithm as 136: A k-means clustering algorithm.

Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1) 100–108.

Ryan, Thomas A, Brian L Joiner, Barbara F Ryan, et al. 1976. Minitab student handbook . Duxbury

Press.

W.N., Venables, Ripley B.D. 2002. Modern applied statistics with S . Springer Science & Business

Media.

42

Documents

IE 485 - Introduction to Data Mining - R Tutorial · Chapter 1 Introduction to Data Mining with R This document includes R codes and brief discussions that take place in IE 485. I