Lazy Learning- Classification using Nearest Neighborscs.tsu.edu/ghemri/CS497/ClassNotes/ML/LazyLearningkNN.pdf · Lazy Learning- Classification using Nearest Neighbors The principle

Lazy Learning- Classification using Nearest Neighbors

The principle behind this machine learning approach is that objects

that are alike are more likely to have properties that are alike. We

can use this principle to classify data by placing it in the category

with which it is most similar, or “nearest” neighbors.

-The key concepts that define nearest neighbor classifiers and why

they are called “lazy”

-Methods to measure similarity between two examples using

distances.

-How to use the R implementation of the k-Nearest Neighbors

(kNN) algorithm to diagnose breast cancer.

Classification using nearest neighbors:

Nearest neighbor classification is well suited to many machine

learning applications when the relationships between features and

the target class are numerous, complicated and difficult to

understand. Another way to describe it is when the concept is

difficult to define, but you know it when you see it, then nearest

neighbor might be appropriate.

On the other hand, if there is no clear distinction among the

groups, then the algorithm may not well suited for identifying the

boundaries.

The kNN algorithm

Strengths Weaknesses

Simple and effective

Makes no assumptions

about the underlying data

distribution

Fast training phase

Does not produce a model,

which limits the ability to

find insights in the

relationships between

features

Slow classification phase

Requires a large amount of

memory

Nominal features and

missing data require

additional processing

The kNN algorithm begins with a training dataset made up of

examples that are classified into several categories, as labeled by a

nominal variable. Assume that we have a test dataset containing

unlabeled examples that otherwise have the same features as the

training data. For each record in the test dataset, kNN identifies k

records in the training data that are the “nearest” in similarity,

where k is an integer specified in advance. The unlabeled test

instance is assigned the class of the majority of the k nearest

neighbors.

Let us assume that we have a blind tasting experience, in which we

need to classify what food we tasted as fruit, protein, or vegetable.

Suppose that prior to eating the mystery item, we created a taste

dataset in which we recorded two features of each ingredient: A

measure from 1 to 10 on how crunchy the food is and second a

score from 1 to 10 of how sweet the ingredient tastes. Then we

labeled each ingredient as one of the three types of food: fruits,

vegetables or proteins. We could have a table such as:

ingredient sweetness crunchiness food type

apple 10 9 fruit

banana 10 1 fruit

carrot 7 10 vegetable

celery 3 10 vegetable

cheese 1 1 protein

nuts 3 6 protein

The kNN algorithm treats the features as coordinates in a

multidimensional feature space. As our dataset includes only two

features, the feature space is 2 dimensional, with the x dimension

indicating the ingredient sweetness and the y dimension indicating

the crunchiness.

After constructing the dataset, we decide to use it to answer the

question: is a tomato a fruit or vegetable. We can use nearest

neighbor approach to determine which class is a better fit.

Calculating distance:

Locating an object nearest neighbors requires a distance function

or a formula that will measure the similarity between two

instances.

Traditionally the kNN algorithm uses the Euclidian distance.

Euclidian distance is specified by the following formula, where p

and q are instances to be compared, each having n features. The

term p1 refers to the value of the first feature in p, while q1 refers to

the first feature of q.

dist (p, q) = √(𝑝1 − 𝑞1)2 + (𝑝2 − 𝑞2)2 + ⋯ (𝑝𝑛 − 𝑞𝑛)2

The distance formula involves comparing the values of each

feature. For example, to calculate the distance between tomato

(sweetness=6, crunchiness=4), and green beans (sweetness=3,

crunchiness=7), the formula becomes as follows:

dist (tomato, green bean) = √(6 − 3)2 + (4 − 7)2 = 4.2

In a similar vein, we can calculate the distance between the tomato

and several other ingredients as follows:

ingredient sweetness crunchiness food type distance to

tomato

grape 8 5 fruit

green

bean

3 7 vegetable

nuts 3 6 protein

orange 7 3 fruit

To classify the tomato as vegetable, fruit or protein, we’ll begin by

assigning the tomato the food type of its single nearest neighbor.

This is called 1NN because k =1. The orange is the nearest

neighbor to the tomato, with a distance of 1.4. As orange is a fruit,

the 1NN algorithm would classify tomato as a fruit.

If we use the kNN algorithm with k=3 instead, it performs a vote

among the three nearest neighbors: orange, grape, and nuts.

Because the majority class among these neighbors is fruit (2 out of

3 votes), the tomato again is classified as a fruit.

Choosing an appropriate k

Deciding how many neighbors to use for kNN determines how

well the model will generalize to future data. The balance between

overfitting and underfitting the training data is a problem known as

the bias-variance tradeoff. Choosing large k reduces the variance

caused by noisy data, but it can bias the learner and can run the

risk of ignoring small but important patterns.

Smaller k values allow more complex decision boundaries that

more carefully fit the training data and can overfit the learner.

In practice, choosing k depends on the difficulty of the concept to

be learned and the number of records in the training data.

Typically, k is set between 3 and 10. One common practice is to

set k equal to the square root of the number of training example. In

the food classifier, we might set it to 4 assuming there were 15

examples in the training data set.

An alternative value is to test several values of k on a variety of

test datasets and choose the one that delivers the best classification

performance.

Preparing data for use with kNN

Before applying the kNN algorithm, we need to transform the

features into standard range. This is because the distance formula

is dependent on how features are measured. In particular, if certain

features have much larger values than others, the distance

measurements will be strongly dominated by those larger values

features.

In our example, this was not an issue, since both sweetness and

crunchiness were measured on a scale from 1 to 10. Suppose we

added an additional feature indicating spiciness, which we measure

using the Scoville scale. This scale is a standardized measure of

spice heat, ranging from 0 (not spicy) to over one million (for the

hottest chili peppers). Because the difference between spicy and

non-spicy foods can be over one million while the difference

between sweet and non-sweet food is at most 10, we might find

that our distance measure only takes spiciness into account.

Consequently, the impact of the other features will be negligible

compared to spiciness.

For this reason, we need to rescale the various features such that

each one contributes relatively equally to the distance formula. For

example, if crunchiness and sweetness are measured on a scale of 1

to 10, then we would like spiciness to have the same scale.

Types of normalizations:

Min-max normalization:

This method of rescaling transforms a feature such that all of its

values fall in a range between 0 and 1. The formula for

normalizing a feature is as follows:

Xnew= 𝑋−min (𝑋)

max(𝑋)−min(𝑋)

This normalization method will indicate how far from 0% to 100%

the original value fell along the range between the original

minimum and maximum.

z-score standardization:

This method subtracts the mean value of a feature X and divides

the result by the standard deviation of X.

𝑋 = 𝑋 − 𝜇

𝜎=

𝑋 − 𝑀𝑒𝑎𝑛(𝑋)

𝑆𝑡𝑑𝐷𝑒𝑣(𝑋)

This formula rescales each feature values in terms of how many

standard deviation factors they fall above or below the mean value.

The resulting value is called a z-score.

The z-scores fall in unbounded range and can be negative or

positive.

The Euclidian distance formula is not defined for nominal data.

Therefore, to calculate the distance between nominal features, we

need to convert them into numeric format.

A typical solution is to use dummy coding, where a value of 1

indicates one category, and 0 indicates the other. For instance,

dummy coding for a gender variable could be constructed as:

𝑓𝑒𝑚𝑎𝑙𝑒 = {1 𝑖𝑓 𝑥 = 𝑓𝑒𝑚𝑎𝑙𝑒0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Now the dummy coding of two category gender variable results in

a single feature named female. There is no need to construct a

separate feature for male, since the two categories are mutually

exclusive.

An n-category nominal feature can be dummy coded by creating

binary indicator variables for (n-1) levels of the category. For

example, dummy coding of three-category temperature variable

(hot, warm, cold) could be setup as (3-1) = 2 features:

ℎ𝑜𝑡 = {1 𝑖𝑓 𝑥 = ℎ𝑜𝑡0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

𝑤𝑎𝑟𝑚 = {1 𝑖𝑓 𝑥 = 𝑤𝑎𝑟𝑚

0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Here, knowing that hot and warm are both 0 is enough to conclude

that the temperature is cold. We therefore do not need a third

feature for cold.

The distance between features using dummy coding is always 0 (if

the values are different) or 1 (if the values are the same), so there is

no need to normalize.

If the nominal values are ordinal, an alternative to dummy coding

would be to number the categories and apply normalization. For

instance cold, warm and hot can be numbered as 1, 2, 3, which

normalizes to 0, 0.5 and 1.

A caveat in this approach is that it should only be used if

categories are at an equal distance from each other.

Why is the kNN algorithm lazy?

Classification algorithms based on nearest neighbor methods are

considered lazy because no abstraction occurs. The abstraction and

generalization processes are not part of it. Using a strict definition

of learning, in which the learner summarizes raw input into a

model (equations, decision trees, clustering, if then rules), a lazy

learner is not really learning anything. Instead, it is only storing the

training data, which takes very little time. Classification, however,

is very slow. This is unlike most classifiers in which training takes

a long time, but classification is very fast.

Due to the heavy reliance on the training instances. Lazy learning

is known as an instance-based learning or rote learning.

An instance-based learners do not build a model, the method is

said to be in a class of non-parametric learning methods- in that

no parameters are learnt about the data.

Without generating a theory about the underlying data, non-

parametric methods limit our ability to understand how the

classifier is using the data. On the other hand, it allows the learner

to find natural patterns rather than trying to fit the data into a

preconceived form.

Although kNN classifiers may be considered lazy, they are still

quite powerful.

Diagnosing breast cancer with the kNN algorithm.

Routine breast cancer screening allows the disease to be diagnosed

and treated prior to it causing irreparable damage. The process of

early detection involves examining the breast tissue for abnormal

lumps or masses. If a lump is found, a fine needle aspiration biopsy

is performed, which utilizes a hollow needle to extract a small

portion of cells from the mass. A clinician then examines the cells

under a microscope to determine whether the mass is likely to be

malignant or benign.

If machine learning could automate the identification of the

cancerous cells, it would greatly benefit the healthcare system and

improve the efficiency of the detection process, allowing

physicians to focus on treatment rather than diagnosis.

We will investigate the utility of machine learning for detecting

cancerous cells by applying the kNN algorithm to measurements of

biopsied cells from women with abnormal breast masses.

Step 1: Collecting Data

We will use the “Breast Cancer Wisconsin Diagnostic” data set

from the UCI Machine Learning Repository. This data was

donated by researchers of the University of Wisconsin and

includes measurements from digitized images of fine-needle

aspirate of breast mass.

The breast cancer data includes 569 examples of cancer biopsies,

each with 32 features. One feature is an identification number,

another is the cancer diagnosis (M for malignant, B for benign).

The 30 remaining features are numeric value laboratory

measurement

id diagnosis

radius_

mean

texture_

mean

perimeter

mean

area_

mean

smoothness_

mean

compactness_

mean

87139402 B 12.32 12.39 78.85 464.1 0.1028 0.06981

8910251 B 10.6 18.95 69.28 346.4 0.09688 0.1147

905520 B 11.04 16.83 70.92 373.2 0.1077 0.07804

868871 B 11.28 13.39 73 384.8 0.1164 0.1136

9012568 B 15.19 13.21 97.65 711.8 0.07963 0.06934

906539 B 11.57 19.04 74.2 409.7 0.08546 0.07722

925291 B 11.51 23.93 74.52 403.5 0.09261 0.1021

87880 M 13.81 23.75 91.56 597.8 0.1323 0.1768

862989 B 10.49 19.29 67.41 336.1 0.09989 0.08578

The 30 numeric measurements comprise the mean, standard error

and worst (largest) values for 10 different characteristics of the

digitized cell nuclei: Radius, Texture, Perimeter, Area,

Smoothness, Compactness, Concavity, Concave points, Symmetry,

Fractal dimension

Based on their names, these features seem to relate to the size,

shape and density of the cell nuclei.

Step 2: Exploring and preparing the data:

-Import the CSV data file from Dropbox, saving the breast cancer

data to the wbcd data frame.

wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

Using the command str(wbcd), we see that the data is structured

with 569 examples, and 32 features.

The first variable is an integer named id. This feature is the

identifier for the patient from whom the sample was taken, it does

not provide useful information and needs to be excluded from the

model.

To drop the id feature, located in column 1, we will copy the wbcd

data frame by excluding this column:

wbcd<- wbcd[-1]

The next variable diagnosis is of particular interest, as it is the

outcome we hope to predict.

The command

> table(wbcd$diagnosis) indicates that 357 masses are benign and

212 are malignant.

When using a data frame in R, the data types must all be the same

otherwise they will be subjected to type conversion. This may or

may not be what you want, if the data frame has string/character

data as well as numeric data, the numeric data will be converted to

strings/characters and numerical operations will probably not give

what you expected.

Additionally, many R machine learning classifiers require that the

target feature is coded as a factor.

Factors in R are stored as a vector of integer values with a

corresponding set of character values to use when the factor is

displayed. The factor function is used to create a factor. The only

required argument to factor is a vector of values which will be

returned as a vector of factor values.

# recode diagnosis as a factor

wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B",

"M"),labels = c("Benign", "Malignant"))

# table or proportions with more informative labels

round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)

Benign Malignant

62.5 37.3

The remaining 30 features are all numeric and consist of 3

different measurements of 10 characteristics. We will examine the

first 3 to illustrate some points.

> summary(wbcd[c("radius_mean", "area_mean", "smoothness_m

ean")])

radius_mean area_mean smoothness_mean

Min. : 6.981 Min. : 143.5 Min. :0.05263

1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637

Median :13.370 Median : 551.1 Median :0.09587

Mean :14.127 Mean : 654.9 Mean :0.09636

3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530

Max. :28.110 Max. :2501.0 Max. :0.16340

Do you notice anything problematic about the values?

Recall that the distance calculation of kNN is heavily dependent

upon the scale of the

measurement of the input features. Smoothness ranges from 0.05

to 0.16, while the area_mean

ranges from 143.5 to 2501.0, the impact of area is going to be

much larger than smoothness in

the distance calculation. This could potentially cause problems for

our classifier, so we need to apply normalization and rescale the

features to a standard range of values.

Transformation- normalizing numeric data:

To normalize these features, we need to create a function

normalize() in R. This function takes a vector x of numeric values,

and for each value in x, subtract the minimum value of x and

divide by the range of values in x. Finally the resulting vector is

returned.

# create normalization function

normalize <- function(x) {

return ((x - min(x)) / (max(x) - min(x)))

}

Test the normalize function on:

(1,2,3,4,5)

normalize(c(1,2,3,4,5))

(10,20,30, 40, 50)

The function returns the same values although the values in the sec

ond vector at 10 times larger than the first one.

Now we need to apply the normalize function to every feature in t

he dataset except the diagnosis.

Instead of normalizing each feature individually, we will use the R

function lapply() to automate the process. The lapply() function

takes a list and applies a function to each element of the list. As a

data frame is a list of equal –length vectors, we can use lapply() to

apply normalize() to each features of the data frame.

The resulting list needs to be stored again in a data frame (matrix),

so we will use the as.data.frame() function to that effect

# normalize the wbcd data

wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))

The command above, applies the function normalize to features 2

to 31 from the data set and stores the result in a data frame wbcd_n

to indicate that the data has been normalized.

This command also excludes the diagnosis feature because we do

not want to normalize it.

Let’s repeat the same summary command to check that the 3

features radius, area and smoothness are within the same range.

Data Preparation: Creating training and test datasets.

Although all 569 biopsies are labeled with a benign or malignant

diagnosis, in order to evaluate the performance of the kNN

algorithm, we need to divide our data sets into 2 portions a training

dataset in which the algorithm can train and a test dataset which

will be used to test the algorithm performance on data not seen

before. The training data set will be used to build the kNN . The

test dataset will allow us to determine how well our algorithm

performs on unlabeled data and used to estimate the predictive

accuracy of the model.

We will use 469 records for training and the remaining 100 to

simulate new unknown cancer status

We will split the wbcd_n dataset into wbcd_train and wbcd_test

# create training and test data

wbcd_train <- wbcd_n[1:469, ]

wbcd_test <- wbcd_n[470:569, ].

The first command takes rows from 1 to 469 and all its attending

columns (left blank after ,) and stores them in wbcd_train. The

second command populates wbcd_test dataset with rows 470 to

569.

Label or target variable:

The diagnosis label has been excluded from the training and test

data sets, however we need it for training the kNN model. So we

will add two commands to store those values into 2 vectors:

wbcd_train_labels and wbcd_test_labels

# create labels for training and test data

wbcd_train_labels <- wbcd[1:469, 1]

wbcd_test_labels <- wbcd[470:569, 1]

Step 3: training a model on the data:

After all these steps, we are now ready to classify our unknown

records. For the kNN, the training phase involves no model

building, it only involves storing the input data in a structured

format.

To classify our test instances, we will use the kNN implementation

from the class package which provides a set of basic R functions

for classifications.

# load the "class" library

library(class)

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,

cl = wbcd_train_labels, k=21)

the knn() function in the class package provides a standard, classic

implementation of the kNN algorithm. For each instance in the test

data, the function will identify the k-nearest neighbors, using

Euclidian distance, where k is a user specified number. The test

instance is classified by taking a majority vote among the k-nearest

neighbors that is assigning the class of the majority of the k

neighbors. A tie is broken at random.

Training and classification using knn () function is performed in a

single function call, using 4 parameters :

kNN classification syntax

using the knn() function in the class package

Building the classifier and making predictions:

p <- knn(train, test, class, k)

train is the data frame containing numeric training data

test is the data frame containing numeric test data

class is a factor vector with the class for each row in the

training data

k is an integer indicating the number of nearest neighbors

to consider

The function returns a factor vector of predicted classes for each

row in the test data frame.

Example:

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,

cl = wbcd_train_labels, k=21)

We have already prepared our data for processing, the only thing

we need is to specify the number of neighbors to include in the

vote. As our training dataset has 469 instances, we might try k=21,

an odd number roughly equal to the square root of 469 (21.65).

Using an odd number reduces the chances of having a tie.

The knn() function returns a factor vector wbcd_test_pred of

predicted labels for each example in the test dataset.

Step 4: Evaluating model performance:

The next step of the process is to evaluate how well the predicted

classes in the wbcd_test_pred vector match up with the know

values of the wbcd_test_labels vector. To do this, we can use the

CrossTable() function in the gmodels package.

# Create the cross tabulation of predicted vs. actual

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,

prop.chisq=FALSE)

The resulting table looks like:

Cell Contents |-------------------------| | N | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------|

Total Observations in Table: 100 | wbcd_test_pred wbcd_test_labels | Benign | Malignant | Row Total | -----------------|-----------|-----------|-----------| Benign | 61 | 0 | 61 | | 1.000 | 0.000 | 0.610 | | 0.968 | 0.000 | | |(TN) 0.610 |(FP) 0.000 | | -----------------|-----------|-----------|-----------| Malignant | 2 | 37 | 39 | | 0.051 | 0.949 | 0.390 | | 0.032 | 1.000 | | |(FN) 0.020 |(TP) 0.370 | | -----------------|-----------|-----------|-----------| Column Total | 63 | 37 | 100 | | 0.630 | 0.370 | | -----------------|-----------|-----------|-----------|

The cell percentages in the table indicate the proportion of values

that fall into four categories.

The top left cell (TN) are true negative results. These 61 of 100

values indicate cases where the mass was benign, and the kNN

algorithm correctly classified them as such.

The bottom right (TP) indicate the true positive results, where the

classifier and the clinically determined label agree that the mass is

malignant. A total of 37 predictions were true positives.

The cells labeled as false negative (malignant classified as benign)

and false positive (benign classified as malignant) contain counts

of the examples where the kNN algorithm disagreed with the true

label. FN are more dangerous errors than FP, in that it may lead a

patient to believe that she is cancer free when in reality the disease

may continue to spread. FP are less dangerous, but can come at a

cost to the patient and healthcare system as additional tests or

treatments may have to be provided.

A total of 2% were incorrectly classified, bringing the model

accuracy to 98%. It might be worthwhile trying another iteration of

the model to see if we can improve the performance and reduce the

number of dangerous false negatives.

Improving Model Performance

We will attempt two simple variations on our previous classifier

1) We will employ an alternative method for rescaling our

numeric features

2) We will try different values of k.

Transformation- Z score standardization

Although normalization is usually used for kNN classification, it

may not always be the most appropriate way to rescale features, in

that outliers are not given their proper weight, although we may

see that in some cases, tumors grow uncontrollably.

The z-score standardization may improve our predictive accuracy .

To standardize a vector, we can use R’s built in scale() function

which by default rescales values using the z-score standardization.

To apply the scale() function to the wbcd data, we can use the

following command that rescales all features except the diagnosis.

## Step 5: Improving model performance ----

# use the scale() function to z-score standardize a data frame

wbcd_z <- as.data.frame(scale(wbcd[-1]))

Documents

Lazy Learning- Classification using Nearest Neighborscs.tsu.edu/ghemri/CS497/ClassNotes/ML/LazyLearningkNN.pdf · Lazy Learning- Classification using Nearest Neighbors The principle