Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
IE 485 - Introduction to Data Mining - R Tutorial
Mustafa Hekimoglu, PhD
Monday 20th February, 2017
Contents
1 Introduction to Data Mining with R 3
1.1 Databases To be Used In the Course . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 SuperSale Grocery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Savers Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Visualizing Data 7
2.1 Scatter and Line Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Programming with R 10
3.1 Some Data Types and Classes In R . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.3 Not Available (NA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.4 Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.5 Other Data Types In R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 If Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Loops in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 For Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 While Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Descriptive Modeling Using R 19
4.1 Measures of Dissimilarity Between Objects . . . . . . . . . . . . . . . . . . . . . . 19
4.2 K-Means Algorithm In R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Predictive Modeling 29
5.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1
CONTENTS Monday 20th February, 2017
5.4 Nearest Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2
Chapter 1
Introduction to Data Mining with R
This document includes R codes and brief discussions that take place in IE 485. I believe having
such a document at your deposit will enhance your performance during your homeworks and your
projects.
As we proceed in our course, I will keep updating the document with new discussions and
codes. Hence the readers should keep in mind that this document is still a work-in-progress, and
all your comments and contributions more than welcome.
The document starts with discussions of databases that we use in our course. Later chapters on
visualization, descriptive modeling, fundamental statistics, prescriptive modeling (will) take place.
In this course we will take an integrated approach to data mining applications. In each home-
work, you will receive a database consisting of several tables with many records. Hence, the first
task of each homework question will be retrieving data in the correct form before starting analysis.
1.1 Databases To be Used In the Course
The course assumes an integrated approach to data mining applications. In each homework, you
will receive a database consisting of several tables with many records. Hence, the first task of each
homework question will be retrieving data in the correct form before starting analysis.
As data is stored in databases, data miners should be able to compose appropriate queries
for their analysis purposes. The most widely used query language is Structured Query Language
(SQL). Hence, a data analyst should be able to build good SQL queries to get the right data at the
beginning of his/her analysis. To mimic this process, class participants will consider the following
databases throughout the course.
3
1.1. DATABASES TO BE USED IN THE COURSE Monday 20th February, 2017
1.1.1 SuperSale Grocery
SuperSale grocery market chains are selling fresh and frozen food for their customers 24/7. They
have three big stores located in Adana, Ankara, Istanbul, and Izmir. Thanks to their hardworking
marketing department’s CRM division, they have detailed information about their customers and
they aim to exploit this as much as possible to increase their gross income.
In order to make a data analysis for their sales, customers and products, they share a portion
of their databases with our university. Data is stored into the file Lecture3.db. In the database
three tables (2 tables for master and 1 table for transactional data) are provided.
In the first table (M CUSTOMERS) customer identity numbers, their names, genders, cities
and birth of dates are stored. Due to confidentiality issues, customer names are not provided. In
the second table (M PRODUCTS) product information is provided. The company only provides
information for its product identities and their list prices. The third table (TR SALES) is for sales
transactions which include information for date of sales, receipt id, product id, customer id, total
sold quantity and total amount of sales. Due to an unknown reason, each customer purchases
only one product from all stores of the company. Database relation between fields of tables are
depicted using MS Access Database Tools in Figure 1.1. Note that we use SQLite for storing
and retrieving data. However, MS Access Database Tools provides nice visualization for relations
between fields of different tables.
Figure 1.1: Database Relation Diagram for Supersale Grocery Store
In this figure, matching fields of tables are linked together. For instance, Customer ID field of
M CUSTOMERS match with Customer ID column of TR SALES table. Similarly, Product ID
4
1.1. DATABASES TO BE USED IN THE COURSE Monday 20th February, 2017
columns in M PRODUCTS and TR SALES tables match.
1.1.2 Savers Bank
Savers bank is a small bank that have operations in major cities of Turkey. Their main customer
base consists of upper class people who are willing to receive premium service as well as good
interest in their investment. In addition to savings account, Savers bank also issues credit cards
for their customers in Istanbul, Ankara, Izmir, Gaziantep, Manisa, Adana, Bursa.
In order to analyze their customer base, their spending behaviors and expenses, they are willing
to employ data mining techniques. The main objective of this study is to increase their customer
satisfaction by proposing well-calibrated services, and increase customer satisfaction. To this end,
Chief Operations Manager of the bank shares a small part of its database with our university.
The database consists of three tables (2 master data and 2 transactional data): M CUSTOMERS,
M CREDIT CARDS, CARD TRANS, SAVINGS TRANS. M CUSTOMERS table stores cus-
tomer information of the bank whereas M CREDIT CARDS table includes detail about customers’
credit card information. CARD TRANS and SAVINGS TRANS tables include transactional data,
e.g. expenditure, spending etc for credit card and savings accounts respectively. The relationship
between the tables and their fields are given in Figure 1.2.
Figure 1.2: Database Relation Diagram for Savers Bank
In the M CUSTOMERS table, customers information are stored in 10 columns. Each customer
is designated with a unique customer identity number, given in the column CUSTOMER ID. In
order to prevent confidentiality of their customers, they provide databases without any name or
5
1.1. DATABASES TO BE USED IN THE COURSE Monday 20th February, 2017
identity information. However, we are provided with gender, city and marital status information
for the customers of the bank.
Each customer can open two types of accounts: generic and savings account. All customers
have a generic account defined with an account number stored in the column ACCOUNT ID.
However, 55% of customers have credit card and/or savings accounts. Credit card information
are stored in CREDIT CARD HOLD, and CREDIT CARD ID columns. The former is a binary
variable for existence of a credit card for that customer. 1 stands for a customer with credit card
whereas the latter stores credit card numbers.
In M CREDIT CARDS table customer id and credit card number are stored in columns CUS-
TOMER ID, and CREDIT CARD ID. Monthly incomes of each customer also provided in the
database in MONTHLY INCOME column. For issue and expiration dates of each credit cards are
stored as well as cvv numbers in related columns (fields). Also, the bank allows its customers to
choose one of three international credit card companies: Visa, Mastercard, and American Express.
This information is stored in the column CARD TYPE.
CARD TRANS table stands for expenditures and payment of credit cards between January
1st 2013 and 2016. For customers, who use credit cards, their monthly expenditure is deducted
from their monthly salaries. Each salary and expenditure information is stored with a unique
transaction id (TRANS ID), type (expenditure (-) or payment (+)), transaction date, and pay-
ment destination number. The last column includes identity numbers for recipient of credit card
payment.
In SAVINGS TRANS table, customers balance and transactions are provided. For those who
can make some savings transfer their excess amount from their generic accounts to their savings
account from time to time. These transactions are provided with positive amount. For those who
spends more than their incomes can withdraw some money from their savings account to make
their credit card debt. The column BALANCE stores the total amount of money deposited to the
savings account by that date.
6
Chapter 2
Visualizing Data
Data visualization is an important subject due to various reasons: First, the appropriate graphics
can convey a message much more efficiently than verbal statements or mathematical terms. In
journalism this principle is also known as a picture is worth a thousand words. As today’s com-
petitive word, forces professionals to explain their ideas as good as possible, using graphical tools
efficiently is critical for success n business life.
Furthermore, visualizing data might also be helpful in understanding and validating the results
of an analysis. As the human eye is the best graphical chip in the world, often analyzing simple
scatter plots can reveal a lot of information on the subject matter. For instance, residual plots of
a linear regression model, which will be discussed in latex chapters, often reveals the most reliable
information about the model fit data.
Naturally, it is important to decide a plot type appropriate for the aim of the visualization.
Some common plot types and their usages in R are presented in the following sections.
2.1 Scatter and Line Plots
Scatter plots are the most common and very primitive way of visualizing data. Often the variable
of interest is located to the y-axis. We locate an index or another variable is to the x-axis of the
plot. In R these two types of scatter plots are called with plot() function. If a single variable
is given as the parameter, R will locate an index variable to the x-axis of the plot. To plot two
variables to the both axes, one should give x-axis variable as the first, and the y-axis variable as
the second parameter of the plot function. Below we provide two plots of data collected for Black
Cherry Trees by Ryan et al. (1976).
Example: Clustering Supermarkets With K-Means Algorithm
Dataset for Black Cherry Trees are one of the built-in data sets in R that can be reached from
datasets of R. Documentation for this package can checked from this link.
7
2.2. HISTOGRAMS Monday 20th February, 2017
0 5 10 15 20 25 30
6570
7580
85
Height of Black Cherry Trees
Index
Hei
ght (
ft)
(a) Plot with Single Variable
8 10 12 14 16 18 20
1020
3040
5060
70
Volume & Girth of Black Cherry Trees
Girth (inch)
Hei
ght (
ft)
(b) Plot with Two Variables
Figure 2.1: Scatter Plots for Black Cherry Trees
As can be seen in Figure 2.1a, heights of trees are plot with an index on the x-axis. This
plot provides some idea about the variability and mean of the variable. The dependency between
girth and the volume is obvious in Figure 2.1b. Obviously a polynomial regression line (quadratic)
would yield a good fit for this relationship. Recall the quadratic relation between the diameter
and the volume of a cylinder.
1 install.packages{datasets}
2 library(datasets) #install the data set package to the memory
3
4 plot(trees [,2],main="Height of Black Cherry Trees",ylab="Height (ft)")
5 windows()
6 plot(trees[,1],trees [,3],main="Volume & Girth of Black Cherry Trees",
7 ylab="Height (ft)",xlab="Girth (inch)")
In this code, we also put axis labels using xlab and ylab parameters of the plot function. Also
to add a graphic title one should enter a string (in ””) for the main parameter. Furthermore,
the command windows() opens a new window to plot another scatter diagram. This is the most
efficient way to open multiple plots at the same time.
2.2 Histograms
Histograms are nice representations of datasets which conveys very valuable informaton about
the mode or median of distribution as well as tails and the mean. To obtain a single-variable
8
2.2. HISTOGRAMS Monday 20th February, 2017
histogram from a dataset, we execute the function hist() by using the array including values as
the parameter.
9
Chapter 3
Programming with R
R is a medium-level programming language that runs vectors. One can define loops, user-defined
functions, conduct statistical tests and simulations in R. But for all of this flexibility, we should
be able to understand the working mechanics of programming in R.
All programming books on a specific language starts with a chapter on data types that are
recognized by the compiler. I will follow this rich tradition by introducing basic data types and
classes that may help you with initial learning curve of R. However, I should note that concepts that
are introduced here are far from generality as classes and data types recognized by the compiler
of R depends on the library and there are vast amount of libraries and classes out there.
3.1 Some Data Types and Classes In R
3.1.1 Numeric
Numeric is the default computational type in R Becker et al. (1988). When we assign a number
to variable, R defines that variable as numeric. To see that you can execute the code in the first
two lines of Code Block 3.1. The function class() throws you the class of an object defined in the
memory of R.
In some cases, it might be useful to transform an object in a different format into numeric type.
For instance, we define the variable in character type, which is explained below, in the third line of
Code Block 3.1. You cannot execute mathematical operations for this variable, and if you check its
class you’ll see that it is in the character format. Transforming the variable into the numeric format
by means of as.numeric() allows you to use the object a in mathematical computations. In the
fifth line of Code Block 3.1 throws you 22. Furthermore, you can control whether an object is nu-
meric type by executing is.numeric(). This function yields TRUE if the variable’s type is numeric.
10
3.1. SOME DATA TYPES AND CLASSES IN R Monday 20th February, 2017
Code Block 3.1: Histogram for Normal Variables
1 x<− 2; #ASSING A VALUE TO THE VARIABLE x2 class(x) #CHECK THE CLASS OF x.3 a<− ”11”;4 as.numeric(a)∗225 ptitle= ”Histogram for Normal Random Variables”;6 hist(rnorm(10000,0,5),main=ptitle);
In R, another less common data format for numbers is double. Practically numeric and double
types can be used interchangeably.
3.1.2 Character
When you enter a string of textual characters between quotation marks, R takes them in character
format. This format is useful to display messages in the terminal, or modifying plot labels or names
of columns.
In R language, strings are expressed with using quotation marks (”). Assigning a character
array to a variable might be helpful in displaying your program output in the console screen or
setting main or axis titles for plots. In the last two lines of Code Block 3.1, we set a character
array to a variable and assign main title of a plot using this variable. The output of the code block
is given in Figure 3.1. The same method can be applied for assigning titles for both horizontal
and vertical axes using parameters ylab and xlab of the plot function. Such an example si given
in Chapter 2.
3.1.3 Not Available (NA)
R compiler uses NA for missing values. All mathematical operations with NA values returns NA.
Existence of this data type is checked with the function “is.na()”. In Code Block 3.2, we provide
an example of a vector with missing elements. The first two lines exemplify our explanations on
NA and is.na() function. The fifth line of the code block is one of the most convenient use of
vectors in R.
Elements of data frames, vectors and matrices can be reached with [] in R as explained in
Subsection 3.1.4. In the fifth line, we removed the NA element from the vector by negating the
logical vector obtained from is.na() and using it as an indexing variable. This usage will appear
in the following chapters of our course.
11
3.1. SOME DATA TYPES AND CLASSES IN R Monday 20th February, 2017
Histogram for Normal Random Variables
rnorm(10000, 0, 5)
Freq
uenc
y
−15 −10 −5 0 5 10 15
050
010
0015
00
Figure 3.1: Histogram of Normally Distributed Numbers
Code Block 3.2: NA Example
1 f= c(1,3,4,NA,6)2 is.na(f)3 #returns FALSE FALSE FALSE TRUE FALSE45 f[!is.na(f)]∗56 #returns 5 15 20 30
12
3.1. SOME DATA TYPES AND CLASSES IN R Monday 20th February, 2017
Code Block 3.3: Savers Bank Data Retrieval
1 library(”RSQLite”)2 library(DBI)3 db<− dbConnect(SQLite(),dbname=”SaversBank.db”)4 dbListTables(db)5 dbListFields(db,”M CREDIT CARDS”)6 creditcarddata<− dbGetQuery(db,” SELECT CUSTOMER ID,MONTHLY INCOME7 FROM M CREDIT CARDS WHERE MONTHLY INCOME>3000”)8 hist(creditcarddata$MONTHLY INCOME)9 is.data.frame(creditcarddata) #RETURNS TRUE
10 is.data.frame(creditcarddata$MONTHLY INCOME) #RETURN FALSE
3.1.4 Data Frame
Data frames are ”tightly coupled collections of variables” stored in the memory of R. When a
data set is read from a file or retrieved from a database, the data set is stored as data frame
in the memory by default. While they are similar to matrices (or arrays), they also have useful
functionalities that are commonly employed by R programmers.
Data frames actually tables in which each column has specific name stored as an attribute of
the object. This property is exploited by programmers to reach specific columns of data frames
using $ operator. An example of such a usage is presented in the following example for which the
R code is given in Code Block 3.1.4.
Example: Distribution of Savers Bank Credit Card Holders
Managers of Savers Bank decides to adopt an aggressive promotion program in order to increase
monthly spending rate of their customers since credit cards are the primary source of profit for
the company. To do so, the first step is finding the distribution of customers who can afford to
spend more and they aim to reach this customer sample by looking at monthly incomes.
In the Code Block 3.1.4, we first retrieve the data from the database provided by them and
execute the necessary SQL query which filters the customers with monthly incomes larger-than
3000. Then we obtain the histogram of those customers’ monthly incomes using the code in line
8 by reaching the column MONTHLY INCOME of the data frame creditcarddata. The resulting
histogram is given in Figure 3.2. Explanations of lines 9 and 10 are provided below.
Arrays and matrices can be transformed into data frame by executing the function data.frame().
Arrays to be collected in a single data frame are written into the function by using ”,” as separa-
tor. Naturally one important point of collecting individual arrays into a data frame is that arrays
lengths must be equal.
13
3.1. SOME DATA TYPES AND CLASSES IN R Monday 20th February, 2017
Histogram of creditcarddata$MONTHLY_INCOME
creditcarddata$MONTHLY_INCOME
Fre
quen
cy
3000 4000 5000 6000 7000 8000 9000 10000
020
040
060
080
0
Figure 3.2: Histogram of Savers Banks’s Credit Cards Holders with Monthly Income Larger-Than3000
Also you can check whether an object’s type is data frame by executing is.data.frame() func-
tion. Line 9 of the Code Block 3.1.4 returns TRUE as the data set is stored in data frame format.
On the other hand, Line 10 returns of FALSE since R compiler returns an array when a column
of a data frame is reached using $.
3.1.5 Other Data Types In R
In the previous sections, I provided brief explanations about main data types in R. However, as
noted at the beginning of the chapter, there are lots of different types in R. Hence, even years of
programming experience with R do not guarantee you to have a complete information on data
types. Good news is that there is vast amount of information easily accessible on internet for
everyone as long as you can read English. Personally I don’t believe you need to know all data
types and classes by heart. But instead, you should know how to use the documentation of R and
search engines to find solutions for your project.
Furthermore, the most helpful source in developing your own R program is the documentation.
You can reach detailed explanation on inputs, output, and parameters of a function by putting
a question mark before the name of the function and hit enter. For instance typing ?rpois and
14
3.2. IF STATEMENT Monday 20th February, 2017
hitting enter yields you an html file including explanations for a function that simulates a given
number of Poisson random variables for a given distribution parameter λ. Also keep in mind that
majority of R documentation pages include examples at the bottom which explain the usage of
the function in different settings.
3.2 If Statement
Conditional statements are one of the most commonly used code structures of R language. The
structure consists of a conditional statement and executable piece of program to be run if the
condition is satisfied. The syntax starts with an “if” followed by the conditional expression given
in brackets. Then, the executable lines should be given within curly brackets. R compiler checks
the condition and executes starting from the first executable line. If the executable part of the
statement is a single line, then one can express the if statement can be given without any curly
brackets. Such a usage of if statement is given in the first three lines of Code Block 3.4. In the first
line, we set x to 2 which satisfies the condition in line 2. Hence the code writes “x is greater-than
zero” to the console.
Another variant of if statement includes else. This code structure checks the condition state-
ment and executes the codes that follows. If the condition is not satisfied, then the compiler
executes lines following “else”. An example of such a usage of if statement is given between lines
6 and 10. The condition in line 6 fails since x = 2 that is set in the first line of Code Block 3.4.
Then y becomes 0.5. If x were equal to 0, and R compiler would execute lines 9 and 10.
In R four types of fundamental conditional statements can be expressed as follows: >= and
<= check greater-or-equal-to and smaller-or-equal-to conditions respectively. Removal of equality
signs leads to conditions with strict equality. == and ! = stand for “equal to” and “not equal
to” conditions. These fundamental condition statements are used to build composite conditional
statements using the two logical statements: “and”, “or”, which are expressed with && and ||.An example of composite logical statements together with if-else if-else structure is given in lines
between 14 and 17 in Code Block 3.4. Set x = NA. The two conditions of line 14 return FALSE.
The condition in line 16 is TRUE and the command (line 16) writes “Data type is NA!” to the
console. If x = 0, the code returns “”Division by Zero Becomes Infinity!” and 0 is assigned to y.
3.3 Loops in R
In any programming environment, loops are one of the most commonly used structure as majority
of calculations proceed in iterative fashion. In R, you can employ different types of loops using
appropriate syntax. In this section, we will present those loop structures.
15
3.3. LOOPS IN R Monday 20th February, 2017
Code Block 3.4: Three Different If Stamenets
1 x=2;2 if(x>0)3 print(”x is greater−than zero”);4 #The program returns ”Division by Zero Becomes Infinity!”56 if(x!=0)7 { y=1/x;8 } else {9 print(”Division by Zero Becomes Infinity!”);
10 y=0;}11 #For x=0, the program returns ”Division by Zero Becomes Infinity!”1213 x=NA14 if((x!=0)&&(!is.na(x)))15 { y=1/x;16 } else if(is.na(x)){ print(”Data type is NA!”);17 } else {18 print(”Division by Zero Becomes Infinity!”);19 y=0;}20 #For x=NA, the program returns ”Data type is NA!21 #For x=0, the program returns ”Division by Zero Becomes Infinity!”
3.3.1 For Loop
For loop is a structure that executes the code for finitely many times. The code consists of an
(iteration) index, start and end values for that index. The compiler assumes that the end value
is larger-than-equal to the start value and executes codes provided between brackets end− starttimes. In each iteration the index is increased by one. If there is no bracket after the for loop,
then the compiler only iterates the line following the for statement. Such a situation is provided
in the Code Block 3.5.
The code given in lines 2-6 iterates i from 1 to 5 and calculates j by multiplying i by 2 and
printing the value assigned to j to the screen. On the other hand, the code in lines 8-10 prints
only 10 as it iterates i from 1 to 5, multiply it by 2 and assign the result to j. After the iteration
is complete it prints 10 which is the result of the last multiplication.
In our second example, we set a for loop running through each element of a given vector
including names of 8 students in an international class. A curious (and maybe a bit ’nerdy’)
computer programmer is interested in developing a R code that counts the number of students
with a name starting with ’a’. This program is given in Code Block 3.6.
16
3.3. LOOPS IN R Monday 20th February, 2017
Code Block 3.5: For Loop in R
1 #WITH BRACKETS2 for(i in 1:5)3 {4 j=i∗25 print(j); #RETURNS VALUES 1,4,6,106 }7 #WITHOUT BRACKETS8 for(i in 1:5)9 j=i∗2
10 print(j); #RETURNS 10
Code Block 3.6: Name Checking Example
1 namesofstudents<−c(”ali”,”veli”,”huseyin”,”bekir”,”john”,”ahmet”,”micheal”,”ayse”);2 counter=1;3 for(i in 1:length(namesofstudents))4 {5 nm<− namesofstudents[i]6 if(substr(nm,1,1)==”a”)7 counter=counter+1;8 }9 print(j)
10 #CONSOLE OUTPUT:41112 counter=1;13 for(i in namesofstudents)14 {15 if(substr(i,1,1)==”a”)16 counter=counter+1;17 }18 print(j)19 #CONSOLE OUTPUT:4
17
3.3. LOOPS IN R Monday 20th February, 2017
The first line of Code Block 3.6 sets a vector including names of students. A variable is
initialized for desired counting process. In the for loop running from 1 to the length of the vector
namesofstudents, each element of the vector is assigned to a variable nm. If statement checks
whether the first character of each name is ’a’. If true then the counter is increased by one. A
shorter version of the same for loop is given in lines 12-18 of Code Block 3.6 which shows that for
loops can be built using a vector itself instead of trying to reach each element individually using
an index. In this example, for loop assigns each element of the vector namesofstudents, which
renders the fifth line redundant.
18
Chapter 4
Descriptive Modeling Using R
4.1 Measures of Dissimilarity Between Objects
Distance of different objects in a data set can be measured using different criteria. One common
criteria for this purpose is the Eucledean distance, denoted d(i, j).
d(i, j) = (∑k
(xi(k)− xj(k))2)1/2, (4.1)
where xi(k) is the k-th element of vector xi which represents the i-th object in the dataset.
Similarly one can define Manhattan distance, dM(i, j) to measure dissimilarity as follows:
dM(i, j) =∑k
|xi(k)− xj(k)|. (4.2)
Different distance definitions might yield different results in clustering algorithm presented in
the following sections.
Distances of a set of objects can best be expressed with a symmetric matrix, specifically called
distance matrix. In i-th row j-th column of the matrix, there exits the distance between i-th and
j-th elements of the set.
D =
0 d12 d13 ... d1n
d21 0 d23 ... d2n... ...
dn1 dn2 dn3 .... 0
(4.3)
In R, we calculate the distance matrix (in a special form) using the function dist(). In the
19
4.2. K-MEANS ALGORITHM IN R Monday 20th February, 2017
Code Block 4.1: Get Data For Products of SuperSale Groceries
1 library(”RSQLite”)2 library(DBI)3 db<− dbConnect(SQLite(),dbname=”Supersale.db”)4 prod=dbGetQuery(db,”SELECT T1.PRODUCT ID,T1.UNIT PRICE,T2.SALES5 FROM M PRODUCTS T1, (SELECT PRODUCT ID, SUM(SALE QUANT) AS SALES FROM6 TR SALES GROUP BY PRODUCT ID) T2 WHERE T1.PRODUCT ID=T2.PRODUCT ID”)
example below, we calculate and plot distances between first 20 products of SuperSale Grocery.
Example: Distance Between Products of Supersale Groceries
The first three lines of the code block 4.1 is for library loading and connection the the database.
In lines 4-6 we execute the query and retrieve data from the database.
The second line of Code Block 4.2 defines an array from 1 to 20 and plot the first 20 of
products, stored in the data frame prod, retrieved from the database. Line 7 calculates distances
between the first 20 products and assigns them to U. Distances are turned into the matrix form
with data.matrixU and we plot them using the function image(). In that function, we closed labels
of x and y axes by setting the parameters yaxt and xaxt to ’n’. Later lines 15-18 create strings
with product names, and later we put them to x and y axes in lines 20 and 21.
4.2 K-Means Algorithm In R
In order to cluster data sets including multiple dimensions, we use K-means algorithm coded in R.
As indicated in the documentation, K-maens algorithm is called by kmeans(data,center,max.iteration),
and it runs the algorithm developed by Hartigan and Wong (1979). Note that there are multiple
algorithms for partition-based clustering in the literature.
Let’s consider the data set from Supersale Grocery store example to illustrate the mechanics
of K-means algorithm.
Example: Bivariate K-Means Algorithm
Supersale Grocery store is willing to cluster its customers based on their total sales and their
cities. To illustrate the solution in better graphics, we will present results for first 20 products in
their database.
20
4.2. K-MEANS ALGORITHM IN R Monday 20th February, 2017
Code Block 4.2: Plot Distances
1 #PLOT DATA ONTO TWO AXES2 v=c(1:20);3 windows()4 plot(prod[v,c(2,3)],main=”PRODUCT PRICE AND TOTAL SALES”)56 #CALCULATE DISTANCE7 U=dist(prod[c(1:20),c(2,3)])89 #PLOT DISTANCES
10 windows()11 image(v,v,data.matrix(U),main=”Distance12 Matrx”,xlab=”Products”,ylab=”Products”,xaxt=’n’,yaxt=’n’)1314 #LABELS FOR AXES15 txtvct=array(0,20)16 for(i in v)17 txtvct[i]=sprintf(”prd−%d”,i);18 text(prod[v,2],prod[v,3],txtvct,pos=3)1920 axis(1,at=v,labels=txtvct,cex.axis=0.5,las=3)21 axis(2,at=v,labels=txtvct,cex.axis=0.5,las=1)
4 6 8 10 12
100
150
200
250
PRODUCT PRICE AND TOTAL SALES
UNIT_PRICE
SA
LES
prd−1
prd−2
prd−3prd−4
prd−5prd−6
prd−7
prd−8
prd−9
prd−10
prd−11
prd−12
prd−13
prd−14prd−15
prd−16
prd−17
prd−18
prd−19
prd−20
Figure 4.1: First 20 Products of Supersale Grocery
21
4.2. K-MEANS ALGORITHM IN R Monday 20th February, 2017
Distance Matrx
Products
Pro
duct
s
prd−
1
prd−
2
prd−
3
prd−
4
prd−
5
prd−
6
prd−
7
prd−
8
prd−
9
prd−
10
prd−
11
prd−
12
prd−
13
prd−
14
prd−
15
prd−
16
prd−
17
prd−
18
prd−
19
prd−
20
prd−1
prd−2
prd−3
prd−4
prd−5
prd−6
prd−7
prd−8
prd−9
prd−10
prd−11
prd−12
prd−13
prd−14
prd−15
prd−16
prd−17
prd−18
prd−19
prd−20
Figure 4.2: Distance Matrix Between Products of Supersale Grocery
We retrieve data from the database using subqueries in lines 1-7 of Code Block 4.3. Data
comes with categorical text, names of cities, and we transform them into categorical variables to
be processed by the K-Means algorithm in R.
We take the second and the third columns of dataset and feed them into the kmeans function
by setting the required number of clusters to 4. The results are taken to the variable kclus. To
obtain cluster of each customer, we use kclus$cluster as presented in the fifth line of Code Block
4.4. By combining these clusters of each observations with total sales and city information, we
form a new data frame with the function data.frame() (Line 5). Line 8-11 plot customers without
any indication of clusters. This plot is given in Figure 4.3.
In Code Block 4.5, we plot clustered customers using different colors. Plot function in R sets
the main title, axes titles and axes limits. However, each time you call this function, it produces a
different plot. To plot clustered products with different colors, we use points() function which has
to follow plot() function to work properly. Axis ticks are labeled with the function axis() and we
add legend to the plot using lines 11-12. Resulting plot is presented in Figure 4.4. In this figure,
clusters are indicated with different colors. For instance, cluster 1 is red, cluster 2 is blue, cluster
3 is black, and cluster 4 is depicted with green.
22
4.2. K-MEANS ALGORITHM IN R Monday 20th February, 2017
Code Block 4.3: Data Retrieval
1 library(”RSQLite”);2 library(”DBI”);3 #GET PRODUCT DATA FROM THE DATABASE4 prod=dbGetQuery(db,” SELECT T1.CUSTOMER ID,T1.SALES, T2.CITY5 FROM (SELECT CUSTOMER ID,SUM(SALE QUANT) AS SALES FROM TR SALES6 GROUP BY CUSTOMER ID) T1, M CUSTOMERS T2 WHERE7 T1.CUSTOMER ID=T2.CUSTOMER ID”)8 #CHANGE CATEGORICAL TEXT DATA INTO NUMERIC9 prod[(prod[,4]==”IST”),4]=0
10 prod[(prod[,4]==”ANK”),4]=111 prod[(prod[,4]==”IZM”),4]=212 prod[(prod[,4]==”ADN”),4]=3
Code Block 4.4: Cluster Products With K-means
1 #GET TOTAL SALES AND CITY INFO2 cls=prod[,c(2,3)]3 kclus=kmeans(cls,4)4 #WE COLLECT CLUSTER AND OTHER DATA OF EACH CUSTOMER INTO A DATA FRAME5 custcluster=data.frame(kclus$cluster,cls$SALES,cls$CITY)6 #LETS PLOT UNCLUSTERED DATA7 windows()8 plot(custcluster[,c(2,3)],ylab=”CITY”,xlab=”TOTAL SALES”,yaxt=’n’,9 main=”SALES & CITIES OF CUSTOMERS”)
10 #PLOT WITHOUT ANY AXIS TICKS WHICH ARE ADDED LATER!11 axis(2,at=c(1:4),labels=c(”IST”,”ANK”,”IZM”,”ADN”))
Code Block 4.5: Plotting Product Clusters
1 #NOW PLOT CLUSTERED DATA USING DIFFERENT COLORS...2 maxlim1=max(custcluster[,2]);minlim1=min(custcluster[,2]);3 plot(custcluster[(custcluster[,1]==1),c(2,3)],col=”red”,4 xlim=c(minlim1,maxlim1), pch=19, main=”Clusters of Customers”,5 xlab=”TOTAL SALES”,yaxt=’n’,ylab=”CITY”)6 points(custcluster[(custcluster[,1]==2),c(2,3)],col=”blue”,pch=19)7 points(custcluster[(custcluster[,1]==3),c(2,3)],col=”green”,pch=19)8 points(custcluster[(custcluster[,1]==4),c(2,3)],col=”black”,pch=19)9 axis(2,at=c(1:4),labels=c(”IST”,”ANK”,”IZM”,”ADN”))
10 txt1=”Cluster1”; txt2=”Cluster2”; txt3=”Cluster3”; txt4=”Cluster4”;11 legend(800,4,c(txt1,txt2,txt3,txt4),col=c(”red”,”blue”,”green”,”black”)12 ,pch=19)
23
4.2. K-MEANS ALGORITHM IN R Monday 20th February, 2017
300 400 500 600 700 800 900
SALES & CITIES OF CUSTOMERS
TOTAL SALES
CIT
Y
IST
AN
KIZ
MA
DN
Figure 4.3: Customers in Different Cities
300 400 500 600 700 800 900
Clusters of Customers
TOTAL SALES
CIT
Y
IST
AN
KIZ
MA
DN
Cluster1Cluster2Cluster3Cluster4
Figure 4.4: Clustered Customers in Different Colors
24
4.3. HIERARCHICAL CLUSTERING Monday 20th February, 2017
4.3 Hierarchical Clustering
Hierarchical clustering is widely used method to obtain nested-clusters for a data set. Unlike
K-Means algorithm, hierarchical clustering methods do not require a specific cluster amount to
work. Either by bottom-up or top-down approaches, algorithms groups each cluster by comparing
distances between clusters.
General search algorithm of hierarchical clustering methods rely on comparison of distances
between each clusters and joining closest ones iteratively.
One important concern in hierarchical clustering is finding a representative point for each
cluster including multiple observations. There might be possible approaches to this issue. The
most straightforward way is calculating centroids (center of cluster) which can be formulated as
follows: Let Ci is a cluster including observations {x|x ∈ C}, where x = (x1, x2, ..., xp) is a
p-dimensional vector. Then the centroid of the cluster C, denoted with
x = (x1, x2, ...xp),
where xk is the average of k-th elements of all observations in C. Using centroids of clusters has a
nice interpretation and provides a balanced representation of all elements in a cluster. However,
the computer has to re-calculate distances between the centroids of each cluster in each iteration
as centroids change by addition or deletion of elements from clusters. Yet for small or medium size
data sets, this is not a pressing concern and many hierarchical clustering algorithms use centroids.
In the following example, we present agglomerative (bottom-up) hierarchical clustering of a
sample of black cherry trees.
Example: Hierarchical Clustering of Black Cherry Trees
A built-in data set including girth, height and volume measurings of black cherry trees are consid-
ered in this example. The data set can be reached directly from R by loading (you have to install
it first) the package datasets as in the first line in Code Block 4.6. In the second line we calculate
distances between observations (line 2) and feed them into the function hclust() (line 3) which
clusters observations using a bottom-up algorithm. Plotting the result of a hierarchical clustering
algorithm generates a dendrogram. For the Black Cherry data sets, the dendrogram is given in
Figure 4.5.
For large data sets, it might be more important to get clusters than analyzing the dendrogram
which might be messy and complex. To reach clusters, we should execute the code in the fifth line
25
4.3. HIERARCHICAL CLUSTERING Monday 20th February, 2017
Code Block 4.6: Hierarchical Clustering for Black Cherry Trees
1 library(datasets)2 dist tree=dist(trees)3 mdist tree=hclust(dist tree)4 plot(mdist tree)5 mdist tree$merge
of Code Block 4.6. The fifth line generates a matrix with two columns representing mergers in its
rows. Negative elements in this matrix stands for singletons (clusters including one observation)
whereas positive runs clusters with more-than-one objects.
3129 30
28 26 2725
23 24 21 2217 18
810 15 16 12 13
9 115 6
20 14 194 7 1
2 3
020
4060
Cluster Dendrogram
hclust (*, "complete")dist_tree
Hei
ght
Figure 4.5: Black Cherry Tree Sample Clustered Hierarchically (Agglomerative)
In the following example, we present a hierarchical clustering for a large data set retrieved
from SuperSale Grocery database.
Hierarchical Clustering Customers of SuperSale Grocery
In Code Block 4.7, we retrieve data from the database of SuperSale Grocery.
26
4.3. HIERARCHICAL CLUSTERING Monday 20th February, 2017
Code Block 4.7: Data Retrieval from SuperSale Grocery
1 db<− dbConnect(SQLite(),dbname=”Lecture3.db”)2 xx=dbGetQuery(db,” SELECT T1.CUSTOMER ID, SUM(T1.SALE QUANT),3 SUM(T1.TOTAL AMOUNT), T2.GENDER FROM TR SALES T1, M CUSTOMERS T24 WHERE T1.CUSTOMER ID=T2.CUSTOMER ID GROUP BY T1.CUSTOMER ID”);
Code Block 4.8: Hierarchical Clustering for Customers of SuperSale Grocery
1 for(i in 1:length(xx[,1]))2 {3 xx[i,1]=sprintf(”Cust.ID=%d”,i)4 }56 #CHANGE CATEGORICAL TEXT INTO NUMERIC7 xx[(xx[,4]==”F”),4]=18 xx[(xx[,4]==”M”),4]=09
10 #CALCULATE DISTANCE11 distCust=dist(xx[,c(2:4)])12 #HIERARCHICAL CLUSTER13 hclust(distCust)14 clusterd cust=hclust(distCust)15 #PLOT16 plot(clusterd cust)
In Code Block 4.8, we first build strings for customer ids in lines 1-4. Later, we change the
categorical text into numeric as we did before. Distances are calculated in line 12 and a dendrogram
in Figure 4.6 is generated in line 16. An important feature for that dendrogram is that it depicts
the weakness of dendrograms for a large data set as they become inefficient due to limited space
in the x-axis.
27
4.3. HIERARCHICAL CLUSTERING Monday 20th February, 2017
768
948
848
959
623
811
189
622
815
104
524
367
817 30 513
176
870
371
98 893
689
145
536
128
206
742
205
736 24 995
787
985
347
480
116
958
275
149
807
881
265
407
891
258
360
143
702
999 99 906
586 88 417 81 401
103
533
152
976
672
437
607
194
421
400
484
612
392
726
647
491
583
630
97 829
608
798
318
225
490
300
87 851
783
329
582
515
954
921
941
188
289
777
800
810 74 633 36 122
803
930
295
317
687
755
674
255
858
684
641
166
991
628
905
868
378
806
849
441
802
160
323
894
414
551
746
489
847
239
290
488
769 67 274
538
181
304
213
707
229
520
264
462
349
576
765
428
435
624
778
442
844
278
279
795
799
114 31 700
788
827
629
102
690
113
838
458
713 27 501 17 20 292
625
816
544
560
422
482
718
970
197
568
475
91 627
412
681
987
896
106
779
236
706
552
915
471
874
898
927
997 7
333
645
302
594
617
854
665
813 82 866
539
382
727 3
408
339
908 83 169 96 170
163
429
186
22 377
450
119
939
139
537
730
781
714
801 66 938
961
228
140
842
857
950
208
212
284
887
211
409
256
865 2
261
305
719
936
195
731
822
479
616
338
434
631 64 47 235
148
982
703
196
463
271
704
133
774
266
312 11 40 80
504
697
351
926 52
588
873 26 9 21
310
943
446
366
436
127
259
200
308
395
977
988
500
771
178
841
934
886
465
651 57 717
85 507
872
374
418
762
108
766
105
336
585
973
590
598
892
135
658
432
724
740 43 39
159
656
918
348
525
325
989
115
405
287
172
814
615
216
968
125
505
60 558
773
924
634
673
780
144
563
532
311
322
391
809
986
282
249
693
230
546
455
18 203
303
464
951
120
899 34 729
232
698
443
862
468
993
221
521
445
917
880
126
621
467
784
199
8426
953
1 13 637
362
7949
496
365
075
892
233
494
649
847
468
0 182
681
273
982
445
696
7 35 44 750
451
202
909
578
620
335
705
856
291
514
903
179
337
270
299
547
233
486
626
933
257
883
737
234
619
720
350
670
974
411
545
136
331
509
454
998
636
320
669
688
346
836
324
410
782
10 37 522
952
914
19 966
141
722
51 379
129
928
579
92 759
945
167
675 59 953
712
204
752
376
214
570
794
587
268
512
654
581
888 90
164
162
427
469
397
648
109
642 25 45 426
485
384
871 23 709
155
262 46 753
343
749
224
112
565
476
859
118
316
207
890
907
786
341
923
596
161
715
850
548
497
931
385
981
567
853
652
330
493
925
393
593
557
402
559
294
657
215
193
165
508
466
808
483
732
845
307
597 77 389 42
142
383
353
796
542
676
819
132
404
209
711
793
614
306
901
326
117
260 89 168
107
747
182
226
252
369 16 180
387
187
869
904
481
285
430
540
990 49 321
101
191
174
660
296
897
677
994
743
373
440
716
403
611
519
503
242
355
839
605
146
561
319
653
659
364
760
661
832
328
472
662
666
818
919
949
281
530
394
286
527
433
640
473
61 595
121
425
173
549
911
153
517
602
580
416
511
571
701
453
937
541
691
309
767 56 632
741
867
775
790
885
516
577
978
198
983
415 95 29 996
792
972
365
535
876
15 478
217
247
979
218
692 86 852
288
452
298
664
131
361 75
251
297
572
219
964
529
756
772
609
574
638
910
971
695
846
241
359
610
825
386
980
785
754
882
492
805
245
431
345
751 72 158 38 231
253
855
461
584
603
804
889
342
562
192
370
246
789
834
543
589
912
984
314
406
864 55 506
399 71 613
573
569
770
835
663
861
694
944
396 62 227
156
470
375
837 58 154
123
797
699
748
920 12 110
223
725
942
313
390
668
254
420 73 566
671
147
352
962
683
190
248
646
554
960
240
599
761
332
340
134
791 94 354
460
601
744
678
293
728
449
137
764
643
956 5
635
965 32 604
183
447
682
398
992
618
900
487
526
831 8
591
327
935
667 48 708 33 564
448
735
244
502
250
138
363
184
272
534
130
902
763
222
879
210
655
273
828
358 14 69 929
381
830
528
757
267
283
157
686
151
175
177
696
499
171
495
600 68 550
776 70 833
575
111
556 93 78 185
357
315
821 63 53
344
734
947
424
733
220
372
553
555
860
100
356
723
237
413
877
419
1000 41 380 28 884
277
301
745
975
438
439
840 6
895
843
54 913
388
957
969
238
644
932
263
518
124
276 4
243
444
863
477
649 50
368
496
459
916
639
710
510
738
423
523
606
823
685
878
76 940
150
201
721
875
679
820
280
955
457
65 592
020
0040
0060
0080
00
Cluster Dendrogram
hclust (*, "complete")distCust
Hei
ght
Figure 4.6: Customers of SuperSale Groceries Clustered Hierarchically (Agglomerative)
28
Chapter 5
Predictive Modeling
5.1 Perceptrons
Perceptrons are the earliest classification technqiues. It relies on the idea of finding a threshold
function (of input factors) by checking all objects in the data set iteratively. In each iteration, the
threshold function is adjusted using misclassified objects in the data set. Depending on the form of
threshold function, e.g. linear, quadratic etc, denoted with s(.), adjustment due to misclassification
is applied in different forms. And each misclassification adjustment takes place with a learning
rate λ.
The pseudocode for a generic perceptron algorithm with a threshold function s() and threshold
value thr.val can be given as follows:
Algorithm 1 Generic Algorithm for Perceptron
1: misclassflag=TRUE2: while misclassflag is TRUE do3: misclassflag=FALSE4: for all xi ∈ X do5: Estimate class of xi with s(xi) and assign it to Ci.6: if Ci 6= Ci then7: Update s(.) using xi and λ ;8: misclassflag=TRUE;9: end if
10: end for11: end while
In this chapter, we will consider linear threshold which is expressed as a weighted average of
input columns: s(xi) =∑
j wjxij, where yij is the jth element of input vector i. The algorithm
starts with an initial value of w. For each input vector xi, we calculate the estimated class, Ci
and compare it with the actual class, Ci, of the object i. If there is a misclassification, then
the threshold function s(.) is updated with λ. The update mechanism will be taken as follows:
29
5.1. PERCEPTRONS Monday 20th February, 2017
Code Block 5.1: Linear Perceptron
1 perceptron<−2 function(y,x,thr,lambda=0.01,winit=array(0,length(x[1,])),max.iter=10ˆ6)3 {4 w=winit;5 iteration=1;6 incorrect.class=TRUE;7 while(incorrect.class)8 {9 incorrect.class=FALSE;
10 for(i in 1:length(y))11 {12 C.hat=ifelse((as.numeric(x[i,])%∗%as.numeric(w)>=thr),1,−1)13 if(y[i]!=C.hat)14 {15 w=w + lambda∗x[i,];16 incorrect.class=TRUE;17 }18 }19 iteration=iteration+1;20 if(iteration>max.iter)21 incorrect.class=FALSE;22 print(paste(”Iteration#”,iteration))23 }24 w25 }
Whenever the element xi is misclassified, then the weight vector w = (w1, w2, ..., wp) is updated
using the formula below:
w = w + λ(xi1, xi2, . . . , xip)
The application of the pseudocode 1 to an R program is given in Code Block 5.1, in which we
define a perceptron function that takes actual class variable as y and actual input data frame x.
λ is the learning rate and initial weight values are set to zero. In order to avoid infinite loop, we
introduced another parameter called max.iteration after which we stop the perceptron algorithm
and return the resulting value.
In the R program, the estimation of the class of dependent variable is done at line 13 and the
update is executed in line 16 which are the two core operations of a perceptron algorithm. At each
iteration the algorithm goes through entire data set and check misclassification. The program ends
either there is no misclassification occurs in the for loop (Lines 11-19) or the number of iteration
exceeds the allowed number of iterations given as parameter.
30
5.1. PERCEPTRONS Monday 20th February, 2017
We should note that the linear threshold function of perceptron assumes that the two classes
is linearly separable, i.e. there exists a wTx hyperplane that separates the two classes perfectly
(Alpaydin, 2014). If, on the other hand, classes cannot be separated, then perceptron algorithm
may fail to converge a good weight vector. An example of inseparable and separable classes are
given Figures 5.1a and 5.1b.
0 100 200 300 400 500 600
020
0040
0060
0080
0010
000
Inseparable Classes
AVG.TRANSACTION
MO
NTH
LY_I
NC
OM
E
(a) Inseparable Classes
0 100 200 300 400 500 600
020
0040
0060
0080
0010
000
Separable Classes
AVG.TRANSACTION
MO
NTH
LY_I
NC
OM
E
(b) Separable Classes
Figure 5.1: Separable and Inseparable Classes
In Figures 5.1a and 5.1b, data set including customers of SaversBank is plotted. In the left-
hand-side graph, customers are classified according to their credit card types. Blue circles indicate
customers using Visa whereas the red ones are for other types of credit cards (American Express
and Master). On the other hand, if the classification is executed based on cities of customers,
then we end up with a separable classes (Figure 5.1b). The R program getting the dataset and
yielding plots are given in Code Block 5.2.
Before closing our section on perceptron we should make one more comment on the threshold
function s(.). Alpaydin (2014) suggests that the threshold function s(.) should be used as in
Equation 5.1 in order to cover the possibility of having a nonzero intercept for the separating
hyperplane.
s(xi) = w0 +∑j≥1
wjxij. (5.1)
For such a threshold function, the weight vector w should be updated using a (p+1) dimensional
vector having 1 as its first element. The update equation is given below.
w = w + λ(1,xi1,xi2, . . . ,xip).
31
5.1. PERCEPTRONS Monday 20th February, 2017
Code Block 5.2: Data Set For (In)Seperable Classes
1 dataset=dbGetQuery(db,2 ”SELECT T4.CCARDNUM,T4.CTYPE,T4.CTY, AVG(T4.AVG TRANS),AVG(T4.INCME)3 FROM (SELECT T2.CARD TYPE CTYPE,T.CREDT CARD NUM CCARDNUM,4 (T.TRANS AMOUNT) AVG TRANS,(T2.MONTHLY INCOME) INCME,T3.CITY CTY5 FROM CARD TRANS T,M CREDIT CARDS T2, (select CITY,CREDIT CARD ID,6 CUSTOMER ID FROM M CUSTOMERS WHERE CREDIT CARD ID IS NOT NULL) T37 WHERE T.CREDT CARD NUM=T2.CREDIT CARD ID AND T.TRANS TYPE=’−’ AND8 T3.CREDIT CARD ID=T2.CREDIT CARD ID AND T3.CUSTOMER ID=T2.CUSTOMER ID )9 T4 GROUP BY T4.CCARDNUM,T4.CTYPE,T4.CTY”)
1011 avg expense=dataset[,4]12 income=dataset[,5]1314 #Inseparable Classes15 ydep=ifelse(dataset[,2]==”Visa”,1,−1)16 plot(avg expense,income,col=ifelse(ydep>0,”blue”,”red”),17 main=”Inseparable Classes”,xlab=”AVG.TRANSACTION”,ylab=”MONTHLY INCOME”)1819 #Separable Classes20 ycity=ifelse(dataset[,3]==”IST”,1,−1)21 plot(saversbank[,1],saversbank[,2],col=ifelse(ycity==1,”red”,”blue”),22 main=”Separable Classes”,xlab=”AVG.TRANSACTION”,ylab=”MONTHLY INCOME”)
32
5.2. LINEAR DISCRIMINANT ANALYSIS Monday 20th February, 2017
5.2 Linear Discriminant Analysis
Another classification method is discriminant analysis which relies on the idea of finding a hyper-
plane that separates the classes best. From this perspective the discriminant method is similar
to perceptron Hand et al. (2001). The hyperplane can be assumed in different forms, such as
linear, quadratic, exponential etc. In this chapter we only consider liner discriminant anaylsis.
For further reading on nonlinear discriminant functions, the reader is referred to Alpaydin (2014).
As stated above, the linear discriminant method relies on the idea of finding a weight vector
w that separates classes in the dataset. Hence our discriminant function will be in the following
form
g(x) = wTx + w0, (5.2)
where w is a p-dimensional vector.
In order to find the best w vector, Fisher (1936) suggest the following treatment for the class
case. Define Ci the covariance matrix for classes i = 1, 2 and let C is the pooled sample covariance
matrix formulated as follows:
C =1
n1 + n2
(n1C1 + n2C2).
The score function to be maximized for calculating w is
S(w) =wTµ1 −wTµ2
wT Cw,
where µi, i = 1, 2 are cluster means.
In R, we call linear discriminant method using the function lda(), of which usage is very similar
to lm(). The function returns
• initial proportion of classes in the data set (πi, i ∈ {1, 2, ...}),
• class averages (µi, i ∈ {1, 2, ...}),
• the resulting weight vector w.
In order to predict classes of new objects using the linear discriminant model we call predict
function with a data set including new observations. Note that predict is also useful for predicting
with other statistical models, such as linear regression, exponential smoothing etc.
For two dimensional problems, the results of linear discriminant can be employed using Equa-
tion 5.3 suggested by Hand et al. (2001). R program that employs this equation s given in Code
Block 5.3 for classification of SaversBank’s customers. Resulting plot of classified customers are
given in Figure 5.2.
33
5.3. DECISION TREE Monday 20th February, 2017
Code Block 5.3: Linear Discriminant Analysis for Customers of SaversBank
1 library(MASS)2 ycity=dataset[,3]3 ycity[(ycity!=”IST”)]=04 ycity[(ycity==”IST”)]=156 saversbank=data.frame(datasset[,c(4,5)])7 ld city=lda(ycity˜.,saversbank)8 city.lda=predict(ld city)$class9
10 plot(saversbank[,1],saversbank[,2],col=ifelse(city.lda==1,”red”,”blue”),11 main=”Linear Discriminant”,xlab=”AVG.TRANSACTION”,ylab=”MONTHLY INCOME”)1213 a=−1∗ld city$scaling[1]/ ld city$scaling[2]14 b=t(ld city$scaling)%∗%(ld city$means[2,]−ld city$means[1,])∗0.5 +15 log(ld city$prior)[2]− log(ld city$prior)[1]1617 abline(a,b,lwd=2)18 text(300,2000,”Threshold Function”)
wT
(x− 1
2(µ1 − µ2)
)− log(
π1π2
) = 0, (5.3)
The R Program in Code Block 5.3 starts with calling the required library, MASS, for linear
discriminant function. The dependent variable, ycity, is transformed in lines 2-4. Linear discrim-
inant model is built in line 7 and plot in lines 10-11. In the rest of the program, we plot the
customers and the threshold line separating them.
5.3 Decision Tree
Decision tree is a hierarchical classification technique aims to divide input space into subregions
in an iterative manner. A decision tree composed of internal nodes and terminal leaves, where
internal nodes represents splits whereas terminal leaves indicates final devisions.
Decision tree algorithms (in R Gui) search for threshold values that split the observations (in
the training data set) into two divisions. For a continuous variable xj, decision tree algorithms
search for t s.t. xj ≤ t and xj ≥ t will be grouped separately. To this end, algorithms evaluate
all possible threshold values for an internal node and select the value that maximizes a purity
criterion. The process starts at the root node and proceeds iteratively towards leaf nodes by
selecting a threshold value for each internal node.
Simplest purity criterion can be the ratio of the number of observations from class i, denoted by
34
5.3. DECISION TREE Monday 20th February, 2017
0 100 200 300 400 500 600
020
0040
0060
0080
0010
000
Linear Discriminant
AVG.TRANSACTION
MO
NT
HLY
_IN
CO
ME
Threshold Function
Figure 5.2: Classified Customers with Linear Discriminant Method
N ik, to the total number of observations in the leaf node k, Nk. This ratio is denoted by pik =
N ik
Nk.
Note that if pik is either 1 or 0 for a leaf node k, that is all observations come from the class, than
the split is pure Alpaydin (2014).
In R, we call the function rpart(), which uses Gini Index, a common impurity measure, for
decision trees W.N. and B.D. (2002). Gini Index is formulated as follows for a leaf (internal
decision) node k:
Gk = 1−∑i∈C
pik2,
where C is the set including all classes in the data set. Minimization of Gini Index for a node k
stands for finding a threshold t which either decrease or increase purity ratio pik. Minimization of
Gini Index by the algorithm rpart increases purity of a class at the leaf node (Figure 5.3). Also
note that the worst case scenario is having a purity ratio equal to 0.5 which means there are equal
numbers of elements at a leaf node.
Example: Decision Tree Classifier for Customers of SaversBank
Managers of SaversBank are willing to understand the customer base of the bank in order to
design better services and promotions which can be executed with simple rules. Also in their
latest campaigns, they particularly want to focus on the customers in Izmir. To this end, they
want to employ a decision tree classification using city maximum amount of savings of their
35
5.4. NEAREST NEIGHBOR METHODS Monday 20th February, 2017
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0.5
p(ik)
Gin
i Ind
ex
Figure 5.3: Gini Index for Node k
customers.
R code that extracts data, fit a decision tree and plot it is given in Code Block 5.4. First 8 lines
are for getting data from the database into R. Lines 10-13 do enumeration of city information.
After calling the required library, we fit the decision tree in line 18 and plot it in lines 20-21. In
the rest of the code block, some of the splits from the decision tree is given in Figure 5.5.
5.4 Nearest Neighbor Methods
Nearest Neighbor Method is a classification algorithm which iteratively estimates classes of each
objects in the test set using the objects in the training set. The idea of the algorithm is as follows:
To estimate the class of an object, y, with input vector, x in the test set, we find its k nearest
neighbors in the training set. Then we check classes of these k objects and assign the most com-
mon class as the class estimation, y. As we consider k nearest neighbors, the algorithm is also
known as kNN in the literature.
For instance, suppose we want to estimate the class of the eighth object (in the test set) using
seven objects in the training set given in Figure 5.6. As can be seen there are three class 1 and
four class 2 objects. Hence the class of the eighth object is class 1 (by majority).
36
5.4. NEAREST NEIGHBOR METHODS Monday 20th February, 2017
Code Block 5.4: Decision Tree for Customers of SaversBank
1 library(”RSQLite”)2 library(DBI)3 db<− dbConnect(SQLite(),dbname=”SaversBank.db”)4 dataset=dbGetQuery(db,”5 select T1.CUST ID,MAX(T1.BALANCE) max save ,T2.MONTHLY INCOME,T3.CITY6 from SAVINGS TRANS T1, M CREDIT CARDS T2, M CUSTOMERS T3 WHERE7 T1.CUST ID=T2.CUSTOMER ID AND T3.CUSTOMER ID=T2.CUSTOMER ID8 GROUP BY T1.CUST ID”)9
10 varcity=dataset[,4]11 varcity[(varcity!=”IZM”)]=0;12 varcity[(varcity==”IZM”)]=1;13 varcity=as.numeric(varcity)1415 library(rpart)16 dataregress=dataset[,c(2,3)]17 names(dataregress)<− c(”max save”,”M INC”)18 fit=rpart(varcity˜.,data=dataregress)1920 plot(fit, uniform=TRUE, main=”Classification Tree for SaversBank”)21 text(fit,cex=0.8)2223 windows()24 plot(dataregress,main=”Decision Tree Split”,25 col=ifelse((varcity==1),”red”,”blue”))26 abline(v=93590,col=”red”)27 abline(v=38190,col=”red”)28 segments(93590,3800,400000,3800,col=”red”)29 segments(93590,3350,400000,3350,col=”red”)30 segments(93590,4650,400000,4650,col=”red”)31 segments(93590,3100,400000,3100,col=”red”)32 segments(93590,4200,400000,4200,col=”red”)33 segments(−500,900,38190,900,col=”red”)34 legend(250000,2000,c(”IZMIR”,”Others”),35 col=ifelse((varcity==1),”red”,”blue”),pch=21)
37
5.4. NEAREST NEIGHBOR METHODS Monday 20th February, 2017
Classification Tree for SaversBank
|max_save< 9.359e+04
max_save< 3.819e+04
M_INC>=900
max_save< 1.491e+04
M_INC>=2450
M_INC< 3100
M_INC>=3350
M_INC< 2200
M_INC>=1800
M_INC< 1550
M_INC< 3800
M_INC>=3350
M_INC< 3100
M_INC< 4650
M_INC>=4200
0.01099
0 0.4592 0
0.045450.6019 0
0 0.5031
0.5135
0
0 0.6522 0 0.7509
0.8938
Figure 5.4: Classifying Customers of SaversBank
0 50000 150000 250000
2000
4000
6000
8000
1000
0
Decision Tree Split
max_save
M_I
NC
IZMIROthers
Figure 5.5: Partition of Input Space by Decision Tree
38
5.4. NEAREST NEIGHBOR METHODS Monday 20th February, 2017
0.1 0.2 0.3 0.4
0.1
0.2
0.3
0.4
0.5
Classify the 8th Point
Var1
Var
2
Class1Class2
Figure 5.6: Example 1. Classify the Eight Point
The kNN algorithm is called with knn() in R. To apply the method properly we should split
data set (both input data and class variables) into two parts as training and test sets. The kNN
algorithm works with training and test sets of input data, and training set of class variable. Later
the estimated classes are compared with test set of class variable. Such an application of kNN
method is given in the following example.
Example: kNN Classification for Products of Supersale Groceries
Supersale groceries s willing to make an assessment for their products being sold in different cities.
They think their top sale products must be the ones generate daily revenue larger than 150 TL.
They want a prediction tool for their new product development department which aims to increase
the market share of the company by introducing more top sale products to the market.
To this end, they apply to our university for a predictive classifier that works on their product
database which they provided to us. An application of kNN algorithm to this small case is provided
in Code Block 5.5.
The application in Code Block 5.5 starts with calling necessary libraries for sqlite databases
and connecting a sqlite database (Lines 1-3). Later we execute an SQL query retrieving a data
frame including all columns of M PRODUCTS, average sale quantity and cities of the customers.
39
5.4. NEAREST NEIGHBOR METHODS Monday 20th February, 2017
Code Block 5.5: Nearest Neighbor
1 library(”RSQLite”)2 library(DBI)3 db<− dbConnect(SQLite(),dbname=”Lecture3.db”)4 dataset=dbGetQuery(db,”SELECT T1.∗,AVG(T2.SALE QUANT),T3.CITY ,5 date(datetime(T2.DATE OF SALE,’unixepoch’)) FROM M PRODUCTS T1,6 TR SALES T2, M CUSTOMERS T3 WHERE T1.PRODUCT ID=T2.PRODUCT ID7 AND T3.CUSTOMER ID=T2.CUSTOMER ID GROUP BY T2.PRODUCT ID,T3.CITY,8 date(datetime(T2.DATE OF SALE,’unixepoch’))”)9
10 revenue=dataset[,2]∗dataset[,3]11 classifier=(revenue>150)12 classifier=as.numeric(classifier)1314 dataset[(dataset[,4]==”ADN”),4]=1; dataset[(dataset[,4]==”ANK”),4]=2;15 dataset[(dataset[,4]==”IST”),4]=3; dataset[(dataset[,4]==”IZM”),4]=4;16 dataset[,4]=as.numeric(dataset[,4])1718 n=length(dataset[,1])19 z=c(1:n);Dev4k<−020 for(s in 1:7){21 MADvect<−022 for(i in 1:100){23 trainset=sample(z,n∗0.7)24 dataset.train=dataset[trainset,]25 dataset.test=dataset[−trainset,]26 dataset.train2=dataset.train[,c(2,3,4)]27 dataset.test2=dataset.test[,c(2,3,4)]2829 classifier.train=classifier[trainset]30 classifier.test=classifier[−trainset]31 est=knn(dataset.train2,dataset.test2,classifier.train,k=4)32 est=as.numeric(est)33 MADvect[i]=mean(abs(classifier.test−est))34 print(paste(s,i))35 }36 Dev4k[s]=mean(MADvect)}
40
5.4. NEAREST NEIGHBOR METHODS Monday 20th February, 2017
We calculate revenue by multiplying the second and the third columns of the data frame. Later we
define a top-seller classifier which assigns 1 to products with a revenue of larger than 150 (Lines
10-13).
In the fourth column of the data frame dataset, we have city names as strings. We transform
city names into a categorical variable and store these values to the fourth column of the dataset
(Lines 14-16). In the rest of Code Block 5.5 we aim to decide for best k value by conducting
out-of-sampling tests repeatedly in the inner for loop. Each time the code enters into a for loop,
it selects a random training set (70%) of the data size and assign the rest to the test set (Lines
23-34). By executing this out-of-sampling test 100 times (it could be ever larger depending on the
standard deviation of MAD) we calculate average MAD for each k value.
41
Bibliography
Alpaydin, Ethem. 2014. Introduction to machine learning . MIT press.
Becker, Richard A, John M Chambers, Allan R Wilks. 1988. The new s language. Pacific Grove,
Ca.: Wadsworth & Brooks, 1988 1.
Fisher, Ronald A. 1936. The use of multiple measurements in taxonomic problems. Annals of
eugenics 7(2) 179–188.
Hand, David J, Heikki Mannila, Padhraic Smyth. 2001. Principles of data mining . MIT press.
Hartigan, John A, Manchek A. Wong. 1979. Algorithm as 136: A k-means clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1) 100–108.
Ryan, Thomas A, Brian L Joiner, Barbara F Ryan, et al. 1976. Minitab student handbook . Duxbury
Press.
W.N., Venables, Ripley B.D. 2002. Modern applied statistics with S . Springer Science & Business
Media.
42