Upload
ngolien
View
236
Download
3
Embed Size (px)
Citation preview
Lazy Learning- Classification using Nearest Neighbors
The principle behind this machine learning approach is that objects
that are alike are more likely to have properties that are alike. We
can use this principle to classify data by placing it in the category
with which it is most similar, or “nearest” neighbors.
-The key concepts that define nearest neighbor classifiers and why
they are called “lazy”
-Methods to measure similarity between two examples using
distances.
-How to use the R implementation of the k-Nearest Neighbors
(kNN) algorithm to diagnose breast cancer.
Classification using nearest neighbors:
Nearest neighbor classification is well suited to many machine
learning applications when the relationships between features and
the target class are numerous, complicated and difficult to
understand. Another way to describe it is when the concept is
difficult to define, but you know it when you see it, then nearest
neighbor might be appropriate.
On the other hand, if there is no clear distinction among the
groups, then the algorithm may not well suited for identifying the
boundaries.
The kNN algorithm
Strengths Weaknesses
Simple and effective
Makes no assumptions
about the underlying data
distribution
Fast training phase
Does not produce a model,
which limits the ability to
find insights in the
relationships between
features
Slow classification phase
Requires a large amount of
memory
Nominal features and
missing data require
additional processing
The kNN algorithm begins with a training dataset made up of
examples that are classified into several categories, as labeled by a
nominal variable. Assume that we have a test dataset containing
unlabeled examples that otherwise have the same features as the
training data. For each record in the test dataset, kNN identifies k
records in the training data that are the “nearest” in similarity,
where k is an integer specified in advance. The unlabeled test
instance is assigned the class of the majority of the k nearest
neighbors.
Let us assume that we have a blind tasting experience, in which we
need to classify what food we tasted as fruit, protein, or vegetable.
Suppose that prior to eating the mystery item, we created a taste
dataset in which we recorded two features of each ingredient: A
measure from 1 to 10 on how crunchy the food is and second a
score from 1 to 10 of how sweet the ingredient tastes. Then we
labeled each ingredient as one of the three types of food: fruits,
vegetables or proteins. We could have a table such as:
ingredient sweetness crunchiness food type
apple 10 9 fruit
banana 10 1 fruit
carrot 7 10 vegetable
celery 3 10 vegetable
cheese 1 1 protein
nuts 3 6 protein
The kNN algorithm treats the features as coordinates in a
multidimensional feature space. As our dataset includes only two
features, the feature space is 2 dimensional, with the x dimension
indicating the ingredient sweetness and the y dimension indicating
the crunchiness.
After constructing the dataset, we decide to use it to answer the
question: is a tomato a fruit or vegetable. We can use nearest
neighbor approach to determine which class is a better fit.
Calculating distance:
Locating an object nearest neighbors requires a distance function
or a formula that will measure the similarity between two
instances.
Traditionally the kNN algorithm uses the Euclidian distance.
Euclidian distance is specified by the following formula, where p
and q are instances to be compared, each having n features. The
term p1 refers to the value of the first feature in p, while q1 refers to
the first feature of q.
dist (p, q) = √(𝑝1 − 𝑞1)2 + (𝑝2 − 𝑞2)2 + ⋯ (𝑝𝑛 − 𝑞𝑛)2
The distance formula involves comparing the values of each
feature. For example, to calculate the distance between tomato
(sweetness=6, crunchiness=4), and green beans (sweetness=3,
crunchiness=7), the formula becomes as follows:
dist (tomato, green bean) = √(6 − 3)2 + (4 − 7)2 = 4.2
In a similar vein, we can calculate the distance between the tomato
and several other ingredients as follows:
ingredient sweetness crunchiness food type distance to
tomato
grape 8 5 fruit
green
bean
3 7 vegetable
nuts 3 6 protein
orange 7 3 fruit
To classify the tomato as vegetable, fruit or protein, we’ll begin by
assigning the tomato the food type of its single nearest neighbor.
This is called 1NN because k =1. The orange is the nearest
neighbor to the tomato, with a distance of 1.4. As orange is a fruit,
the 1NN algorithm would classify tomato as a fruit.
If we use the kNN algorithm with k=3 instead, it performs a vote
among the three nearest neighbors: orange, grape, and nuts.
Because the majority class among these neighbors is fruit (2 out of
3 votes), the tomato again is classified as a fruit.
Choosing an appropriate k
Deciding how many neighbors to use for kNN determines how
well the model will generalize to future data. The balance between
overfitting and underfitting the training data is a problem known as
the bias-variance tradeoff. Choosing large k reduces the variance
caused by noisy data, but it can bias the learner and can run the
risk of ignoring small but important patterns.
Smaller k values allow more complex decision boundaries that
more carefully fit the training data and can overfit the learner.
In practice, choosing k depends on the difficulty of the concept to
be learned and the number of records in the training data.
Typically, k is set between 3 and 10. One common practice is to
set k equal to the square root of the number of training example. In
the food classifier, we might set it to 4 assuming there were 15
examples in the training data set.
An alternative value is to test several values of k on a variety of
test datasets and choose the one that delivers the best classification
performance.
Preparing data for use with kNN
Before applying the kNN algorithm, we need to transform the
features into standard range. This is because the distance formula
is dependent on how features are measured. In particular, if certain
features have much larger values than others, the distance
measurements will be strongly dominated by those larger values
features.
In our example, this was not an issue, since both sweetness and
crunchiness were measured on a scale from 1 to 10. Suppose we
added an additional feature indicating spiciness, which we measure
using the Scoville scale. This scale is a standardized measure of
spice heat, ranging from 0 (not spicy) to over one million (for the
hottest chili peppers). Because the difference between spicy and
non-spicy foods can be over one million while the difference
between sweet and non-sweet food is at most 10, we might find
that our distance measure only takes spiciness into account.
Consequently, the impact of the other features will be negligible
compared to spiciness.
For this reason, we need to rescale the various features such that
each one contributes relatively equally to the distance formula. For
example, if crunchiness and sweetness are measured on a scale of 1
to 10, then we would like spiciness to have the same scale.
Types of normalizations:
Min-max normalization:
This method of rescaling transforms a feature such that all of its
values fall in a range between 0 and 1. The formula for
normalizing a feature is as follows:
Xnew= 𝑋−min (𝑋)
max(𝑋)−min(𝑋)
This normalization method will indicate how far from 0% to 100%
the original value fell along the range between the original
minimum and maximum.
z-score standardization:
This method subtracts the mean value of a feature X and divides
the result by the standard deviation of X.
𝑋 = 𝑋 − 𝜇
𝜎=
𝑋 − 𝑀𝑒𝑎𝑛(𝑋)
𝑆𝑡𝑑𝐷𝑒𝑣(𝑋)
This formula rescales each feature values in terms of how many
standard deviation factors they fall above or below the mean value.
The resulting value is called a z-score.
The z-scores fall in unbounded range and can be negative or
positive.
The Euclidian distance formula is not defined for nominal data.
Therefore, to calculate the distance between nominal features, we
need to convert them into numeric format.
A typical solution is to use dummy coding, where a value of 1
indicates one category, and 0 indicates the other. For instance,
dummy coding for a gender variable could be constructed as:
𝑓𝑒𝑚𝑎𝑙𝑒 = {1 𝑖𝑓 𝑥 = 𝑓𝑒𝑚𝑎𝑙𝑒0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Now the dummy coding of two category gender variable results in
a single feature named female. There is no need to construct a
separate feature for male, since the two categories are mutually
exclusive.
An n-category nominal feature can be dummy coded by creating
binary indicator variables for (n-1) levels of the category. For
example, dummy coding of three-category temperature variable
(hot, warm, cold) could be setup as (3-1) = 2 features:
ℎ𝑜𝑡 = {1 𝑖𝑓 𝑥 = ℎ𝑜𝑡0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑤𝑎𝑟𝑚 = {1 𝑖𝑓 𝑥 = 𝑤𝑎𝑟𝑚
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Here, knowing that hot and warm are both 0 is enough to conclude
that the temperature is cold. We therefore do not need a third
feature for cold.
The distance between features using dummy coding is always 0 (if
the values are different) or 1 (if the values are the same), so there is
no need to normalize.
If the nominal values are ordinal, an alternative to dummy coding
would be to number the categories and apply normalization. For
instance cold, warm and hot can be numbered as 1, 2, 3, which
normalizes to 0, 0.5 and 1.
A caveat in this approach is that it should only be used if
categories are at an equal distance from each other.
Why is the kNN algorithm lazy?
Classification algorithms based on nearest neighbor methods are
considered lazy because no abstraction occurs. The abstraction and
generalization processes are not part of it. Using a strict definition
of learning, in which the learner summarizes raw input into a
model (equations, decision trees, clustering, if then rules), a lazy
learner is not really learning anything. Instead, it is only storing the
training data, which takes very little time. Classification, however,
is very slow. This is unlike most classifiers in which training takes
a long time, but classification is very fast.
Due to the heavy reliance on the training instances. Lazy learning
is known as an instance-based learning or rote learning.
An instance-based learners do not build a model, the method is
said to be in a class of non-parametric learning methods- in that
no parameters are learnt about the data.
Without generating a theory about the underlying data, non-
parametric methods limit our ability to understand how the
classifier is using the data. On the other hand, it allows the learner
to find natural patterns rather than trying to fit the data into a
preconceived form.
Although kNN classifiers may be considered lazy, they are still
quite powerful.
Diagnosing breast cancer with the kNN algorithm.
Routine breast cancer screening allows the disease to be diagnosed
and treated prior to it causing irreparable damage. The process of
early detection involves examining the breast tissue for abnormal
lumps or masses. If a lump is found, a fine needle aspiration biopsy
is performed, which utilizes a hollow needle to extract a small
portion of cells from the mass. A clinician then examines the cells
under a microscope to determine whether the mass is likely to be
malignant or benign.
If machine learning could automate the identification of the
cancerous cells, it would greatly benefit the healthcare system and
improve the efficiency of the detection process, allowing
physicians to focus on treatment rather than diagnosis.
We will investigate the utility of machine learning for detecting
cancerous cells by applying the kNN algorithm to measurements of
biopsied cells from women with abnormal breast masses.
Step 1: Collecting Data
We will use the “Breast Cancer Wisconsin Diagnostic” data set
from the UCI Machine Learning Repository. This data was
donated by researchers of the University of Wisconsin and
includes measurements from digitized images of fine-needle
aspirate of breast mass.
The breast cancer data includes 569 examples of cancer biopsies,
each with 32 features. One feature is an identification number,
another is the cancer diagnosis (M for malignant, B for benign).
The 30 remaining features are numeric value laboratory
measurement
id diagnosis
radius_
mean
texture_
mean
perimeter
mean
area_
mean
smoothness_
mean
compactness_
mean
87139402 B 12.32 12.39 78.85 464.1 0.1028 0.06981
8910251 B 10.6 18.95 69.28 346.4 0.09688 0.1147
905520 B 11.04 16.83 70.92 373.2 0.1077 0.07804
868871 B 11.28 13.39 73 384.8 0.1164 0.1136
9012568 B 15.19 13.21 97.65 711.8 0.07963 0.06934
906539 B 11.57 19.04 74.2 409.7 0.08546 0.07722
925291 B 11.51 23.93 74.52 403.5 0.09261 0.1021
87880 M 13.81 23.75 91.56 597.8 0.1323 0.1768
862989 B 10.49 19.29 67.41 336.1 0.09989 0.08578
The 30 numeric measurements comprise the mean, standard error
and worst (largest) values for 10 different characteristics of the
digitized cell nuclei: Radius, Texture, Perimeter, Area,
Smoothness, Compactness, Concavity, Concave points, Symmetry,
Fractal dimension
Based on their names, these features seem to relate to the size,
shape and density of the cell nuclei.
Step 2: Exploring and preparing the data:
-Import the CSV data file from Dropbox, saving the breast cancer
data to the wbcd data frame.
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
Using the command str(wbcd), we see that the data is structured
with 569 examples, and 32 features.
The first variable is an integer named id. This feature is the
identifier for the patient from whom the sample was taken, it does
not provide useful information and needs to be excluded from the
model.
To drop the id feature, located in column 1, we will copy the wbcd
data frame by excluding this column:
wbcd<- wbcd[-1]
The next variable diagnosis is of particular interest, as it is the
outcome we hope to predict.
The command
> table(wbcd$diagnosis) indicates that 357 masses are benign and
212 are malignant.
When using a data frame in R, the data types must all be the same
otherwise they will be subjected to type conversion. This may or
may not be what you want, if the data frame has string/character
data as well as numeric data, the numeric data will be converted to
strings/characters and numerical operations will probably not give
what you expected.
Additionally, many R machine learning classifiers require that the
target feature is coded as a factor.
Factors in R are stored as a vector of integer values with a
corresponding set of character values to use when the factor is
displayed. The factor function is used to create a factor. The only
required argument to factor is a vector of values which will be
returned as a vector of factor values.
# recode diagnosis as a factor
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B",
"M"),labels = c("Benign", "Malignant"))
# table or proportions with more informative labels
round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
Benign Malignant
62.5 37.3
The remaining 30 features are all numeric and consist of 3
different measurements of 10 characteristics. We will examine the
first 3 to illustrate some points.
> summary(wbcd[c("radius_mean", "area_mean", "smoothness_m
ean")])
radius_mean area_mean smoothness_mean
Min. : 6.981 Min. : 143.5 Min. :0.05263
1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
Median :13.370 Median : 551.1 Median :0.09587
Mean :14.127 Mean : 654.9 Mean :0.09636
3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
Max. :28.110 Max. :2501.0 Max. :0.16340
Do you notice anything problematic about the values?
Recall that the distance calculation of kNN is heavily dependent
upon the scale of the
measurement of the input features. Smoothness ranges from 0.05
to 0.16, while the area_mean
ranges from 143.5 to 2501.0, the impact of area is going to be
much larger than smoothness in
the distance calculation. This could potentially cause problems for
our classifier, so we need to apply normalization and rescale the
features to a standard range of values.
Transformation- normalizing numeric data:
To normalize these features, we need to create a function
normalize() in R. This function takes a vector x of numeric values,
and for each value in x, subtract the minimum value of x and
divide by the range of values in x. Finally the resulting vector is
returned.
# create normalization function
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
Test the normalize function on:
(1,2,3,4,5)
normalize(c(1,2,3,4,5))
(10,20,30, 40, 50)
The function returns the same values although the values in the sec
ond vector at 10 times larger than the first one.
Now we need to apply the normalize function to every feature in t
he dataset except the diagnosis.
Instead of normalizing each feature individually, we will use the R
function lapply() to automate the process. The lapply() function
takes a list and applies a function to each element of the list. As a
data frame is a list of equal –length vectors, we can use lapply() to
apply normalize() to each features of the data frame.
The resulting list needs to be stored again in a data frame (matrix),
so we will use the as.data.frame() function to that effect
# normalize the wbcd data
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
The command above, applies the function normalize to features 2
to 31 from the data set and stores the result in a data frame wbcd_n
to indicate that the data has been normalized.
This command also excludes the diagnosis feature because we do
not want to normalize it.
Let’s repeat the same summary command to check that the 3
features radius, area and smoothness are within the same range.
Data Preparation: Creating training and test datasets.
Although all 569 biopsies are labeled with a benign or malignant
diagnosis, in order to evaluate the performance of the kNN
algorithm, we need to divide our data sets into 2 portions a training
dataset in which the algorithm can train and a test dataset which
will be used to test the algorithm performance on data not seen
before. The training data set will be used to build the kNN . The
test dataset will allow us to determine how well our algorithm
performs on unlabeled data and used to estimate the predictive
accuracy of the model.
We will use 469 records for training and the remaining 100 to
simulate new unknown cancer status
We will split the wbcd_n dataset into wbcd_train and wbcd_test
# create training and test data
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ].
The first command takes rows from 1 to 469 and all its attending
columns (left blank after ,) and stores them in wbcd_train. The
second command populates wbcd_test dataset with rows 470 to
569.
Label or target variable:
The diagnosis label has been excluded from the training and test
data sets, however we need it for training the kNN model. So we
will add two commands to store those values into 2 vectors:
wbcd_train_labels and wbcd_test_labels
# create labels for training and test data
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
Step 3: training a model on the data:
After all these steps, we are now ready to classify our unknown
records. For the kNN, the training phase involves no model
building, it only involves storing the input data in a structured
format.
To classify our test instances, we will use the kNN implementation
from the class package which provides a set of basic R functions
for classifications.
# load the "class" library
library(class)
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k=21)
the knn() function in the class package provides a standard, classic
implementation of the kNN algorithm. For each instance in the test
data, the function will identify the k-nearest neighbors, using
Euclidian distance, where k is a user specified number. The test
instance is classified by taking a majority vote among the k-nearest
neighbors that is assigning the class of the majority of the k
neighbors. A tie is broken at random.
Training and classification using knn () function is performed in a
single function call, using 4 parameters :
kNN classification syntax
using the knn() function in the class package
Building the classifier and making predictions:
p <- knn(train, test, class, k)
train is the data frame containing numeric training data
test is the data frame containing numeric test data
class is a factor vector with the class for each row in the
training data
k is an integer indicating the number of nearest neighbors
to consider
The function returns a factor vector of predicted classes for each
row in the test data frame.
Example:
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k=21)
We have already prepared our data for processing, the only thing
we need is to specify the number of neighbors to include in the
vote. As our training dataset has 469 instances, we might try k=21,
an odd number roughly equal to the square root of 469 (21.65).
Using an odd number reduces the chances of having a tie.
The knn() function returns a factor vector wbcd_test_pred of
predicted labels for each example in the test dataset.
Step 4: Evaluating model performance:
The next step of the process is to evaluate how well the predicted
classes in the wbcd_test_pred vector match up with the know
values of the wbcd_test_labels vector. To do this, we can use the
CrossTable() function in the gmodels package.
# Create the cross tabulation of predicted vs. actual
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq=FALSE)
The resulting table looks like:
Cell Contents |-------------------------| | N | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------|
Total Observations in Table: 100 | wbcd_test_pred wbcd_test_labels | Benign | Malignant | Row Total | -----------------|-----------|-----------|-----------| Benign | 61 | 0 | 61 | | 1.000 | 0.000 | 0.610 | | 0.968 | 0.000 | | |(TN) 0.610 |(FP) 0.000 | | -----------------|-----------|-----------|-----------| Malignant | 2 | 37 | 39 | | 0.051 | 0.949 | 0.390 | | 0.032 | 1.000 | | |(FN) 0.020 |(TP) 0.370 | | -----------------|-----------|-----------|-----------| Column Total | 63 | 37 | 100 | | 0.630 | 0.370 | | -----------------|-----------|-----------|-----------|
The cell percentages in the table indicate the proportion of values
that fall into four categories.
The top left cell (TN) are true negative results. These 61 of 100
values indicate cases where the mass was benign, and the kNN
algorithm correctly classified them as such.
The bottom right (TP) indicate the true positive results, where the
classifier and the clinically determined label agree that the mass is
malignant. A total of 37 predictions were true positives.
The cells labeled as false negative (malignant classified as benign)
and false positive (benign classified as malignant) contain counts
of the examples where the kNN algorithm disagreed with the true
label. FN are more dangerous errors than FP, in that it may lead a
patient to believe that she is cancer free when in reality the disease
may continue to spread. FP are less dangerous, but can come at a
cost to the patient and healthcare system as additional tests or
treatments may have to be provided.
A total of 2% were incorrectly classified, bringing the model
accuracy to 98%. It might be worthwhile trying another iteration of
the model to see if we can improve the performance and reduce the
number of dangerous false negatives.
Improving Model Performance
We will attempt two simple variations on our previous classifier
1) We will employ an alternative method for rescaling our
numeric features
2) We will try different values of k.
Transformation- Z score standardization
Although normalization is usually used for kNN classification, it
may not always be the most appropriate way to rescale features, in
that outliers are not given their proper weight, although we may
see that in some cases, tumors grow uncontrollably.
The z-score standardization may improve our predictive accuracy .
To standardize a vector, we can use R’s built in scale() function
which by default rescales values using the z-score standardization.
To apply the scale() function to the wbcd data, we can use the
following command that rescales all features except the diagnosis.
## Step 5: Improving model performance ----
# use the scale() function to z-score standardize a data frame
wbcd_z <- as.data.frame(scale(wbcd[-1]))