Breast cancerdetection IE594 Project Report

DEPARTMENT OF MECHANICAL AND INDUSTRIAL

ENGINEERING

Course: IE-594 Data Science - II

A report on

Breast Cancer Detection Using Proteomes.

Under Guidance of

Dr. Ashkan Sharabiani By,

Ashish Menkudale (UIN: 656130575) Amol Deshpande (UIN: 655587196) Geet Deogaonkar (UIN: 675017667) Kunal Lall (UIN: 669676558)

Page | 2

BREAST CANCER DETECTION USING PROTEOMES

CONTENTS

Introduction .................................................................................................................................................. 4

Data Set ......................................................................................................................................................... 4

Data Preprocessing: Data visualization ......................................................................................................... 7

Data Preprocessing: Normality checks ....................................................................................................... 12

Data Preprocessing: Missing values ............................................................................................................ 12

Data Preprocessing: Data transformation (pca implementation) .............................................................. 13

Supervised Learning: Classification with Support vector machines ........................................................... 15

Supervised Learning: Classification with Multinomial logistic regression .................................................. 16

Supervised Learning: Classification with K nearest neighbors.................................................................... 18

Supervised Learning: Classification with Decision trees ............................................................................. 19

Supervised Learning: Classification with Gradient boosting (Ensemble Techniques) ................................ 21

Supervised Learning: Classification with Random forest (Ensemble Techniques) ..................................... 22

Supervised Learning: Classification with Neural net ................................................................................... 23

Unsupervised Learning: K Means clustering ............................................................................................... 24

Unsupervised Learning: Hierarchical clustering ......................................................................................... 26

Application development: Graphical user interface (Tkinter) .................................................................... 29

Conclusion ................................................................................................................................................... 31

Future scope ............................................................................................................................................... 31

Reference .................................................................................................................................................... 31

Code ............................................................................................................................................................ 32

Page | 3


LIST OF FIGURES

Figure 1: high-resolution accurate-mass tandem mass spectrometry (MS/MS) analysis of 105 samples for

peptide fractionation and phosphopeptide enrichment. ............................................................................. 5

Figure 2: Selection of 12,553 proteins (10,062 genes) ................................................................................. 6

Figure 3: initial distribution of the sample intensities. X axis: Intensity by mass spectrometry. Y Axis:

Number of proteins (12553 in total). ............................................................................................................ 7

Figure 4: Types of tumor samples and respective count. ............................................................................. 8

Figure 5: Distribution of age when breast cancer was diagnosed. ............................................................... 8

Figure 6: Gender distribution ........................................................................................................................ 9

Figure 7: estrogen-receptor status. positive (or ER+) if it has receptors for estrogen. This suggests that

the cancer cells, like normal breast cells, may receive signals from estrogen that could promote their

growth. .......................................................................................................................................................... 9

Figure 8: progesterone-receptor status. positive (PR+) if it has progesterone receptors. Again, this means

that the cancer cells may receive signals from progesterone that could promote their growth. ............. 10

Figure 9: HER2 (human epidermal growth factor receptor 2) status, which tells whether or not HER2 is

playing a role in the. ................................................................................................................................... 10

Figure 10: American Joint Committee on Cancer status: tumor size ......................................................... 11

Figure 11: Vital Status of collected samples ............................................................................................... 11

Figure 12: Distribution for intensity after scaling. ...................................................................................... 12

Figure 13: entire data set. ........................................................................................................................... 13

Figure 14: DataFrame after converting into principle components. Shape 89 X 6. ................................... 14

Figure 15: PCA transformation ................................................................................................................... 14

Figure 16: SVM working principle ............................................................................................................... 15

Figure 17: Confusion matrix for classification by support vector machine ................................................ 16

Figure 18: Confusion matrix for classification by multinomial logistic regression .................................... 17

Figure 19: Confusion matrix for classification by multinomial logistic regression ..................................... 18

Figure 20: Confusion matrix for classification by decision tree classifier (gini). ......................................... 20

Figure 21:Confusion matrix for classification by decision tree classifier (entropy). ................................... 20

Figure 22: Confusion matrix for classification using gradient boosting. ..................................................... 22

Figure 23: Confusion matrix for classification using gradient boosting. ..................................................... 23

Figure 24: PCA visualization ........................................................................................................................ 25

Figure 25: Cluser visualization .................................................................................................................... 25

Figure 26: Hierarchical clustering with different methodologies. .............................................................. 27

Figure 27: Hierarchical clustering. .............................................................................................................. 28

Figure 28: GUI interface. ............................................................................................................................. 29

Figure 29: Sample result for classification usinglogistic regression. ........................................................... 30

Page | 4


INTRODUCTION

In present world, approximately 7.6 million people lose their battle of life to cancer every year of which 10.3% are breast cancer patients. There is a high probability of winning this battle only if the cancer is detected at a premature stage and classified into an appropriate type. Developing a system to accurately classify types of cancer can thus be highly beneficial in tackling cancer. Our objective is to develop a clustering algorithm as well as classification algorithms which can effectively classify types of breast cancer.

The idea is to find a different set of proteins (they all have unique NP_/XP_ identifiers) that would divide the data set even more robustly, perhaps, into a large number of clusters with very distinct protein expression signatures. The data needs to be used to assess how the mutations in the DNA are affecting the protein expression landscape in breast cancer.

Genes in our DNA are first transcribed into RNA molecules which then are translated into proteins. Changing the information content of DNA has impact on the behavior of the proteome, which is the main functional unit of cells, taking care of cell division, DNA repair, enzymatic reactions and signaling etc. The associated data set contains published iTRAQ proteome profiling of 77 breast cancer samples generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).

It contains expression values for ~12.000 proteins for each sample, with missing values present when a given protein could not be quantified in a given sample. Initially we are thought of using two clustering algorithms namely, K means clustering and hierarchical clustering. As we study dataset further, we used classification algorithms as well.

For the scope of this project it was decided to analyze the data to form distinct clusters based on their tumor type. Unsupervised learning (K-means clustering and hierarchical clustering) were used. Also, it was decided to analyze this data as a classification task. Based on different attributes (primarily mass spectrometry analysis results for 12553 proteins) few classification algorithms were implemented to see if the model can generate the accurate label of cancer type.

DATA SET

105 breast tumors previously characterized by the TCGA were selected for proteomic analysis after histopathological documentation. The cohort included a balanced representation of PAM50-defined intrinsic subtypes including 25 basal-like, 29 luminal A, 33 luminal B, and 18 HER2 (ERBB2)-enriched tumors, along with 3 normal breast tissue samples.

Samples were analyzed by high-resolution accurate-mass tandem mass spectrometry (MS/MS) that included extensive peptide fractionation and phosphopeptide enrichment (extended data

Page | 5


Fig. 1). Each square is color scaled and provides result for mass to charge ratio of different protein types.

Figure 1: high-resolution accurate-mass tandem mass spectrometry (MS/MS) analysis of 105 samples for

peptide fractionation and phosphopeptide enrichment.

An isobaric peptide labelling approach (iTRAQN1) was employed to quantify protein and phosphosite levels across samples, with 37 iTRAQ 4-plexes analysed in total. A total of 15,369 proteins (12,405 genes) and 62,679 phosphosites were confidently identified with 11,632 proteins per tumour and 26,310 phosphosites per tumour on average.

After filtering for observation in at least a quarter of the samples (Supplementary Methods, Extended Data Fig. 2), 12,553 proteins (10,062 genes) and 33,239 phosphosites, with their relative abundances quantified across tumors, were used in subsequent analyses in this study.

Page | 6


Figure 2: Selection of 12,553 proteins (10,062 genes)

Stable longitudinal performance and low technical noise were demonstrated by repeated interspersed analyses of a single batch of patient-derived luminal and basal breast cancer xenograft samples10. Owing to the heterogeneous nature of breast tumors 11, 12, 13, and because proteomic analyses were performed on tumor fragments that were different from those used in the genomic analyses, rigorous pre-specified sample and data quality control metrics were implemented14, 15. Extensive analyses concluded that 28 of the 105 samples were compromised by protein degradation. These samples were excluded from further analysis with subsequent informatics focused on the 77 tumor samples and three biological replicates.

The consolidated data was present in 3 files as further.

File 1: 77_cancer_proteomes_CPTAC_itraq.csv

RefSeq_accession_number: RefSeq protein ID (each protein has a unique ID in a RefSeq database)

gene_symbol: a symbol unique to each gene (every protein is encoded by some gene)

gene_name: a full name of that gene Remaining columns: log2 iTRAQ ratios for each sample (protein expression data, most important), three last columns are from healthy individuals

File 2: clinical_data_breast_cancer.csv

First column "Complete TCGA ID" is used to match the sample IDs in the main cancer proteomes file (see example script).

All other columns have self-explanatory names, contain data about the cancer classification of a given sample using different methods.

Page | 7


File 3: PAM50_proteins.csv

Contains the list of genes and proteins used by the PAM50 classification system.

The column RefSeqProteinID contains the protein IDs that can be matched with the IDs in the main protein expression data set.

DATA PREPROCESSING: DATA VISUALIZATION

For every data analysis, initial data cleaning is required. Otherwise outliers in the data set produces wrong prediction models and that can seriously affect the model accuracy. Initially, it was decided to analyze our data through visualization to get an overview. Few plots are shown further to provide an overview of distribution of few features in dataset.

Figure 3: initial distribution of the sample intensities. X axis: Intensity by mass spectrometry. Y Axis:

Number of proteins (12553 in total).

Few other features were visualized to get the essence of the data. Those are plotted as further. Visualization also helped in making sense of relation of different variables as initially this problem statement and 3 different files were difficult to relate with each other.

Page | 8


Figure 4: Types of tumor samples and respective count.

Figure 5: Distribution of age when breast cancer was diagnosed.

Page | 9


Figure 6: Gender distribution

Figure 7: estrogen-receptor status. positive (or ER+) if it has receptors for estrogen. This suggests that the cancer cells, like normal breast cells, may receive signals from estrogen that could promote their growth.

Page | 10


Figure 8: progesterone-receptor status. positive (PR+) if it has progesterone receptors. Again, this means

that the cancer cells may receive signals from progesterone that could promote their growth.

Figure 9: HER2 (human epidermal growth factor receptor 2) status, which tells whether or not HER2 is

playing a role in the.

Page | 11


Figure 10: American Joint Committee on Cancer status: tumor size

Figure 11: Vital Status of collected samples

Data was preprocessed further to build models.

Page | 12


DATA PREPROCESSING: NORMALITY CHECKS

All statistical methods rely on initial assumption of data being normally distributed. Hence, before applying any statistical method, it should be confirmed whether the data is normally distributed or not.

From the figure 1 (initial trend of mass spectrometry results distribution, it was seen that values were not normally distributed. Those values were scaled with respect to mean to make data normal. The resulting distribution after scaling is shown further.

Figure 12: Distribution for intensity after scaling.

DATA PREPROCESSING: MISSING VALUES

Before moving further with building up regression model, data was classified according to different types of features. For missing values, different strategies were charted as data was incomplete and consisted of different types of attributes.

Rows were identified which were having missing values. Missing values are less than 5% of total observations. In these rows, for the categorical data, data points were eliminated from

Page | 13


consideration. For continuous and discrete data, the missing values are replaced by mean value of entire data of that variable.

Nominal variables: Missing values taken as 0

Ordinal variables: Data points with missing values were deleted.

Discrete variables: Missing values taken as Mean value.

Continuous variables: Missing values taken as Mean value.

DATA PREPROCESSING: DATA TRANSFORMATION (PCA IMPLEMENTATION)

The resulting data after normalizing is shown below. There are 89 sample individuals studied. For each of the samples, mass spectrometry count (mass to charge ratio) is counted for 12553 different proteins. Each individual sample has a specific tag number.

Respective of this tag number, their tumor stage is documented in another file named ‘clinical_data_breast_cancer.csv’.

Figure 13: entire data set.

The data frame was transposed and respective tumor types were appended to the data frame. Resulting data frame was of shape 12583 X 89. We used principle components to represent these features. By trial and error, we decided to convert entire data into 5 principle components. The resulting data frame is used for further analysis. It is shown below as well.

Page | 14


Figure 14: DataFrame after converting into principle components. Shape 89 X 6.

The pictorial representation of principle components representing the data set is as further,

Figure 15: PCA transformation

The data preprocessing is completed at this stage. Supervised and unsupervised learning

algorithms are used and different associated parameters are studied.

Page | 15


SUPERVISED LEARNING: CLASSIFICATION WITH SUPPORT VECTOR MACHINES

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

Figure 16: SVM working principle

Results of classification carried out using support vector machines are summarized in a confusion matrix.

True Healthy T1 T2 T3 T4

Predicted

Healthy 0 0 0 0 0

T1 0 2 0 0 0

T2 1 1 19 2 2

T3 0 0 0 0 0

T4 0 0 0 0 0

Page | 16


Figure 17: Confusion matrix for classification by support vector machine

The parameters used are:

(C=1.0, cache_size = 200, class_weight = None, coef0 = 0.0, decision_function_shape='ovo', degree=3, gamma = 'auto', kernel = 'rbf', max_iter = -1, probability = False, random_state = None, shrinking = True, tol = 0.001, verbose = False)

With these parameters, the accuracy is observed to be optimum. Accuracy for test data is equal to 77.78%.

SUPERVISED LEARNING: CLASSIFICATION WITH MULTINOMIAL LOGISTIC

REGRESSION

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. If the multinomial logit is used to model choices, it relies on the assumption of independence of irrelevant alternatives (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility.

Page | 17


This allows the choice of K alternatives to be modeled as a set of K-1 independent binary choices, in which one alternative is chosen as a "pivot" and the other K-1 compared against it, one at a time.

The results obtained are represented in a confusion matrix. The accuracy obtained is 66.67%.

Figure 18: Confusion matrix for classification by multinomial logistic regression

Parameters used are,

(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1).


Predicted

Healthy 1 0 0 0 0

T1 0 0 2 0 0

T2 0 3 17 2 2

T3 0 0 0 0 0

T4 0 0 0 0 0

Page | 18


SUPERVISED LEARNING: CLASSIFICATION WITH K NEAREST NEIGHBORS

In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms. Both for classification and regression, it can be useful to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor. A commonly used distance metric for continuous variables is Euclidean distance.

With KNN classifier, accuracy obtained is 77.78%. The confusion matrix is as further,

Figure 19: Confusion matrix for classification by multinomial logistic regression


Predicted

Healthy 0 0 0 0 0

T1 0 2 0 0 0

T2 1 0 19 2 2

T3 0 1 0 0 0

T4 0 0 0 0 0

Page | 19



KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)

SUPERVISED LEARNING: CLASSIFICATION WITH DECISION TREES

Decision tree learning is the construction of a decision tree from class-labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node. There are many specific decision-tree algorithms. Notable ones include:

ID3 (Iterative Dichotomiser 3)

C4.5 (successor of ID3)

CART (Classification And Regression Tree)

CHAID (CHi-squared Automatic Interaction Detector). Performs multi-level splits when computing classification trees.

MARS: extends decision trees to handle numerical data better.

The results of classification using decision tree are,

For gini,

For entropy,


Predicted

Healthy 0 0 0 0 0

T1 0 2 6 0 0

T2 1 0 10 2 1

T3 0 1 2 0 1

T4 0 0 1 0 0


Predicted

Healthy 1 0 2 0 0

T1 0 2 2 0 0

T2 0 1 13 2 0

T3 0 0 0 0 1

T4 0 0 2 0 1

Page | 20


Figure 20: Confusion matrix for classification by decision tree classifier (gini).

Figure 21:Confusion matrix for classification by decision tree classifier (entropy).

Page | 21


Accuracy score for Decision tree classifier (gini) with test data set is: 48.14 % whereas, accuracy

score for Decision tree classifier (entropy) with test data set is: 59.25 %.

The parameters used are,

(criterion='gini' / ‘entropy’, splitter='best', max_depth=None, min_samples_split=2,

min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None,

max_leaf_nodes=None, min_impurity_split=1e-07, class_weight=None, presort=False)

SUPERVISED LEARNING: CLASSIFICATION WITH GRADIENT BOOSTING

(ENSEMBLE TECHNIQUES)

Gradient boosting is a machine learning technique for regression and classification problems,

which produces a prediction model in the form of an ensemble of weak prediction models,

typically decision trees. It builds the model in a stage-wise fashion like other boosting methods

do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

Soon after the introduction of gradient boosting Friedman proposed a minor modification to the

algorithm, motivated by Breiman's bagging method. Specifically, he proposed that at each

iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn

at random without replacement. Friedman observed a substantial improvement in gradient

boosting's accuracy with this modification.

Subsample size is some constant fraction f of the size of the training set. When f = 1, the algorithm

is deterministic and identical to the one described above. Smaller values of f introduce

randomness into the algorithm and help prevent overfitting, acting as a kind of regularization.

The algorithm also becomes faster, because regression trees have to be fit to smaller datasets at

each iteration.


Predicted

Healthy 0 0 0 0 0

T1 0 2 1 0 0

T2 1 0 17 2 2

T3 0 1 1 0 0

T4 0 0 0 0 0

Page | 22


Figure 22: Confusion matrix for classification using gradient boosting.

Accuracy score for Gradient boosting classifier with test data set is: 70.37 %


(criterion = 'friedman_mse', init = None, learning_rate = 1, loss = 'deviance', max_depth = 3,

max_features = None, max_leaf_nodes = None, min_impurity_split = 1e-07, min_samples_leaf =

1, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 100, presort = 'auto',

random_state = None, subsample = 1.0, verbose = 0, warm_start = False).

SUPERVISED LEARNING: CLASSIFICATION WITH RANDOM FOREST (ENSEMBLE

TECHNIQUES)

Random forests or random decision forests are an ensemble learning method for classification,

regression and other tasks, that operate by constructing a multitude of decision trees at training

time and outputting the class that is the mode of the classes (classification) or mean prediction

(regression) of the individual trees.

Accuracy score for Random Forest classifier with test data set is 77.78 %.

Page | 23


Figure 23: Confusion matrix for classification using gradient boosting.

The parameters used are,

(n_estimators = 10, criterion = 'gini', max_depth = None, min_samples_split = 2, in_samples_leaf

= 1, min_weight_fraction_leaf = 0.0, max_features = 'auto', max_leaf_nodes = None,

min_impurity_split = 1e-07, bootstrap = True, oob_score = False, n_jobs = 1, random_state =

None, verbose = 0, warm_start = False, class_weight = None)

SUPERVISED LEARNING: CLASSIFICATION WITH NEURAL NET

Artificial neural networks (ANNs) or connectionist systems are a computational model used in computer science and other research disciplines, which is based on a large collection of simple neural units (artificial neurons), loosely analogous to the observed behavior of a biological brain's


Predicted

Healthy 0 0 1 0 0

T1 0 2 0 1 0

T2 0 1 17 1 0

T3 0 0 2 0 0

T4 0 0 2 0 0

Page | 24


axons. Each neural unit is connected with many others, and links can enhance or inhibit the activation state of adjoining neural units. [further explanation needed] Each individual neural unit computes using summation function. There may be a threshold function or limiting function on each connection and on the unit itself, such that the signal must surpass the limit before propagating to other neurons. These systems are self-learning and trained, rather than explicitly programmed, and excel in areas where the solution or feature detection is difficult to express in a traditional computer program.

The neural net is configured as further,

n_nodes_hl1 = 100

n_nodes_hl2 = 200

n_nodes_hl3 = 100

n_classes = 5

batch_size = 3

hm_epochs = 300

The accuracy obtained is 66.67%.

UNSUPERVISED LEARNING: K MEANS CLUSTERING

k-means is one of the simplest unsupervised learning algorithms that solve the well-known

clustering problem. The procedure follows a simple and easy way to classify a given data

set through a certain number of clusters (assume k clusters) fixed apriority. The main idea is

to define k centers, one for each cluster. These centers should be placed in a

cunning way because of different location causes different result. So, the better choice is to

place them as much as possible far away from each other.

The next step is to take each point belonging to a given data set and associate it to the nearest

center. When no point is pending, the first step is completed and an early group age is done. At

this point we need to re-calculate k new centroids as barycenter of the clusters resulting from

the previous step. After we have these k new centroids, a new binding must be

done between the same data set points and the nearest new center. A loop has been generated.

Because of this loop we may notice that the k centers change their location step by step until no

more changes are done or in other words centers do not move any more.

We generated 5 Principal Components but used only 3 Principal Components to show the Data

Points and the result generated is as shown below:

Page | 25


Figure 24: PCA visualization

The result obtained after applying the K-means algorithm on the dataset we have obtained the

following results.

Figure 25: Cluser visualization

Page | 26


The parameters used are:

class sklearn.cluster.KMeans(n_clusters=5, init='k-

means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, rand

om_state=None, copy_x=True, n_jobs=1, algorithm='auto')

With these parameters, the clustering obtained is shown in the above figure.

UNSUPERVISED LEARNING: HIERARCHICAL CLUSTERING

Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic

process of hierarchical clustering is this: Start by assigning each item to a cluster, so that if you

have N items, you now have N clusters, each containing just one item. Let the distances

(similarities) between the clusters the same as the distances (similarities) between the items they

contain. Find the closest (most similar) pair of clusters and merge them into a single cluster, so

that now you have one cluster less. Compute distances (similarities) between the new cluster

and each of the old clusters. Repeat steps 2 and 3 until all items are clustered into a single cluster

of size N.

The result obtained after applying the different types of hierarchical clustering algorithm we got the

results.

In complete-linkage clustering, the link between two clusters contains all element pairs, and the

distance between clusters equals the distance between those two elements (one in each cluster)

that are farthest away from each other. The shortest of these links that remains at any step

causes the fusion of the two clusters whose elements are involved. The method is also known

as farthest neighbour clustering. The result of the clustering can be visualized as a dendrogram.

In single-linkage clustering is one of several methods of hierarchical clustering. It is based on

grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two

clusters that contain the closest pair of elements not yet belonging to the same cluster as each

other. A drawback of this method is that it tends to produce long thin clusters in which nearby

elements of the same cluster have small distances, but elements at opposite ends of a cluster

may be much farther from each other than to elements of other clusters. This may lead to

difficulties in defining classes that could usefully subdivide the data.

In average linkage clustering, at each step, the nearest two clusters are combined into a higher-

level cluster. The distance between any two clusters A and B each of size |A| and |B| respectively,

is taken to be the average of all the distances between the pair of points that is the mean distance

between the elements of the cluster.

Page | 27


Figure 26: Hierarchical clustering with different methodologies.

Page | 28


The complete hierarchy clustering enlarged is as shown in the figure below:

Figure 27: Hierarchical clustering.

Page | 29


APPLICATION DEVELOPMENT: GRAPHICAL USER INTERFACE (TKINTER)

To make this project more accessible and usable, it was decided to implement graphical user interface and provide the options to visualize the results for different classification techniques.

The interface is as further.

Figure 28: GUI interface.

The above shown is a Graphical User Interface (GUI) created using Tkinter library in python for showing the results of the Breast Cancer Detection ML problem. In this GUI we used widgets such as labels, radio buttons, drop-down menu, buttons and canvas.

Page | 30


This GUI gives the user results of a specific machine learning algorithm and its model in the form of confusion matrix and the algorithm’s accuracy. It provides the user an intuitive, interesting and a simple way to compare all the result of models.

In order to generate the results of a specific model, user has to first select the type of machine learning algorithm and then the model related to that algorithm. The desired model is set pressing the button ‘Select Model’, and the results are displayed as soon as the user click the ‘Calculate Result’ button.

For instance, if a user selects ‘Supervised Learning’ and then select ‘Logistic Regression’ as the type of model, GUI will show the results in the form of confusion matrix and model’s accuracy, once the user sets the model and press the ‘Calculate Result’ button.

Figure 29: Sample result for classification using logistic regression.

Page | 31


CONCLUSION

Tumor Types are clustered from the breast cancer proteome data set by successful implementation of unsupervised learning techniques like K-Means & Hierarchical Clustering.

Tumor Types are successfully classified into their respective types using various supervised learning techniques like Neural Network, Support Vector Machines, Logistic Regression, Decision Trees, Gradient Boosting, Random Forest & K-NN.

FUTURE SCOPE

The models for supervised learning can be iterated so that they can achieve the higher accuracy.

Parameters of these classification techniques can be tweaked to optimized.

Overall, due to less number of samples, it was difficult to build the models with higher

accuracy. Hence as more data comes in, these models can be reiterated.

Also, GUI can be made more comprehensive and more flexibility can be provided to these

model for the end user to tweak the parameters and visualize the results.

REFERENCE

• Stages of cancer: https://cancerstaging.org/references-tools/quickreferences/Documents/BreastMedium.pdf

• Random Forest parameters http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

• SVM parameters http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

• KNN parameters http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

• Logistic regression parameters http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

• Decision tree gini/entropy parameters http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

• gradient boosting parameters http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

• Breast Cancer Proteomes retrieved from: https://www.kaggle.com/piotrgrabo/breastcancerproteomes

https://cancerstaging.org/references-tools/quickreferences/Documents/BreastMedium.pdf

https://www.kaggle.com/piotrgrabo/breastcancerproteomes

Page | 32


CODE

#Pre processing

#importing required libraries.

import sklearn, re

import numpy as np

import pandas as pd

import matplotlib

import matplotlib.pyplot as plt

from sklearn.preprocessing import Imputer

from sklearn import preprocessing

from sklearn.decomposition import PCA

import tensorflow as tf

#Read in the data - keep in mind last 3 columns are healthy individuals

Proteomics=pd.read_csv('C://Users//Dell//Desktop//DS_2_project//77_cancer_proteomes_CPTAC_itraq.csv')

#checking head and shape

Proteomics.head()

#IDs.shape

IDs=pd.read_csv('C://Users//Dell//Desktop//DS_2_project//clinical_data_breast_cancer.csv')

Page | 33


#Edit some names for potential matching later (TCGA)

Proteomics.columns=[re.sub('\.[0-9][0-9]TCGA','',x) for x in Proteomics.columns]

IDs['Complete_TCGA_ID']=[re.sub('TCGA\-','',x) for x in IDs['Complete_TCGA_ID']]

#Code the tumor type to the patient ID

IDDict=dict(zip(IDs['Complete_TCGA_ID'],IDs['Tumor']))

#Add the healthy subjects

IDDict[Proteomics.columns[-3]]='Healthy'



#Get the X variables separate

#Excluding 1st three columns

ProteomicsXRaw=Proteomics[Proteomics.columns[3:len(Proteomics.columns)]].T

ProteomicsXRaw.head()

ProteomicsXRaw.shape

Page | 34


#Impute missing values, scale before PCA

impute=Imputer(missing_values='NaN',strategy='mean',axis=0)

impute.fit(ProteomicsXRaw)

ProteomicsX=impute.transform(ProteomicsXRaw)

#How is the distribution of the sample intensities?

SampleIntensities=ProteomicsXRaw.sum(axis=0)

SampleDist=plt.hist(SampleIntensities.values)

plt.title('Sample Intensity Distribution')

plt.show()

Page | 35


#Scaling

for inputs in range(len(ProteomicsX.T)):

ProteomicsX.T[inputs]=preprocessing.scale(ProteomicsX.T[inputs])

#How is the distribution of the sample intensities after imputing and transforming? More suitable for PCA?

SampleIntensities=ProteomicsX.sum(axis=0)

SampleDist=plt.hist(SampleIntensities)

plt.title('Sample Intensity Distribution')

plt.show()

IntermedSet=[IDDict[x] for x in list(ProteomicsXRaw.index)]

#PCA

pca=PCA(n_components=5)

ProteomicsX_pca=pca.fit(ProteomicsX)

ProteomicsX_pca2=ProteomicsX_pca.transform(ProteomicsX)

Page | 36


#Merging two arrays, one with PCAs and another with tumor types.

a = pd.DataFrame(ProteomicsX_pca2)

b = pd.DataFrame(IntermedSet)

frames = [a,b]

result = pd.concat(frames, axis = 1)

#rename columns

result.columns = ['pca_1','pca_2','pca_3','pca_4','pca_5','tumor']

Results.head()

Result.shape

#split into test train.

X1 = result[[0,1,2,3,4]]

y1 = result[[5]]

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, random_state=45)

X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.3, random_state=45)

#make y dummy variable for train1 and test1

i = ['tumor']

y_train1 = pd.get_dummies(y_train1, columns = i, drop_first = False)

y_test1 = pd.get_dummies(y_test1, columns = i, drop_first = False)

Page | 37


#visualization

#see the tumor types and their frequency.

import plotly.plotly as py

import plotly.graph_objs as go

import numpy as np

x = y1.tumor

data = [go.Histogram(x=x)]

layout = go.Layout(

title='Type of tumor count',

xaxis=dict(

title='Types of tumor'

),

yaxis=dict(

title='Count'

),

bargap=0.2,

bargroupgap=0.1

)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='styled histogram')

Page | 38


#see the age distribution of smaples.

import plotly.plotly as py

import plotly.graph_objs as go

import numpy as np

x = IDs.Age_at_Initial_Pathologic_Diagnosis


layout = go.Layout(

title='Distribution of age when breast cancer was diagnosed',

xaxis=dict(

title='Age at initial pathalogical diagnosis'

),

yaxis=dict(

title='Count'

),

bargap=0.2,

bargroupgap=0.1

)



Page | 39


# Estrogen status.

x = IDs.ER_Status


layout = go.Layout(

title='ER_Status',

xaxis=dict(

title='ER_status'

),

yaxis=dict(

title='Count'

),

bargap=0.2,

bargroupgap=0.1

)



Page | 40


#Prostegene status

x = IDs.PR_Status


layout = go.Layout(

title='PR_Status',

xaxis=dict(

title='PR_status'

),

yaxis=dict(

title='Count'

),

bargap=0.2,

bargroupgap=0.1

)



Page | 41


#HER2 Status

x = IDs.HER2_Final_Status


layout = go.Layout(

title='HER2_Final_Status',

xaxis=dict(

title='HER2_Final_Status'

),

yaxis=dict(

title='Count'

),

bargap=0.2,

bargroupgap=0.1

)



Page | 42


#AJCC status of samples.

x = IDs.AJCC_Stage


layout = go.Layout(

title='AJCC_Stage',

xaxis=dict(

title='AJCC_Stage'

),

yaxis=dict(

title='Count'

),

bargap=0.2,

bargroupgap=0.1

)



Page | 43


#Vital status of samples.

x = IDs.Vital_Status


layout = go.Layout(

title='Vital_status',

xaxis=dict(

title='Vital Status'

),

yaxis=dict(

title='Count'

),

bargap=0.2,

bargroupgap=0.1

)



# Classification models_Supervised learning

Page | 44


#Support vector machines

from sklearn.preprocessing import label_binarize

from sklearn.grid_search import GridSearchCV

from sklearn.svm import SVC, LinearSVC

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.metrics import confusion_matrix, roc_curve, auc, classification_report

from sklearn import svm

clf = svm.SVC(decision_function_shape = "ovo")

clf.fit(X_train, y_train)

#getting accuracy

from sklearn.metrics import accuracy_score

predicted_svm = clf.predict(X_train)

#print ("Accuracy score for training data is ", (accuracy_score(y_train,predicted_svm)*100),"%")

predicted_svm_y = clf.predict(X_test)

print ("Accuracy score for testing data with SVM is ", (accuracy_score(y_test,predicted_svm_y)*100),"%")

#confusion matrix

cm_svm = confusion_matrix(y_test, clf.predict(X_test))

cm_df = pd.DataFrame(cm_svm.T, index=clf.classes_, columns=clf.classes_)

cm_df.index.name = 'Predicted'

cm_df.columns.name = 'True'

print(cm_df)

#confusion matrix visualization.


Predicted

Healthy 0 0 0 0 0

T1 0 2 0 0 0

T2 1 1 19 2 2

T3 0 0 0 0 0

T4 0 0 0 0 0

Page | 45


from sklearn.preprocessing import label_binarize

cm_svm = confusion_matrix(y_test, clf.predict(X_test))

fig = plt.figure()

ax = fig.add_subplot(111)

cax = ax.matshow(cm_svm)

plt.title('Confusion matrix of the classifier for Support Vector machines')

fig.colorbar(cax)

labels = ['Healthy','T1','T2','T3','T4']

ax.set_xticklabels([''] + labels)

ax.set_yticklabels([''] + labels)

plt.xlabel('Predicted')

plt.ylabel('True')

plt.show()

#Logistic regression

from sklearn.linear_model import LogisticRegression

from sklearn import metrics

# estimator is intiated

log = LogisticRegression()

# model fitting

log.fit(X_train, y_train)

Page | 46


# prediction

y_pred_log = log.predict(X_test)

# model score on accuracy

pred_logreg = metrics.accuracy_score(y_test, y_pred_log)

print ("Accuracy for Logistic Regression: {}%".format(pred_logreg * 100))

#Confusion matrix

cm_log = confusion_matrix(y_test, log.predict(X_test))

cm_df = pd.DataFrame(cm_log.T, index=clf.classes_, columns=clf.classes_)



print(cm_df)

#Confusion matrix visualization

cm_log = confusion_matrix(y_test, y_pred_log)

fig = plt.figure()


cax = ax.matshow(cm_log)

plt.title('Confusion matrix of the classifier logistic regression')

fig.colorbar(cax)





plt.ylabel('True')

plt.show()


Predicted

Healthy 1 0 0 0 0

T1 0 0 2 0 0

T2 0 3 17 2 2

T3 0 0 0 0 0

T4 0 0 0 0 0

Page | 47


#KNN classification

# import the model

from sklearn.neighbors import KNeighborsClassifier

# instantiate the estimator

knn = KNeighborsClassifier()

# fit the model

knn.fit(X_train, y_train)

# predict the response

y_pred_knn = knn.predict(X_test)

from sklearn import metrics

# accuracy score

pred_knn = metrics.accuracy_score(y_test, y_pred_knn)

print ("Accuracy for KNN: {}%".format(pred_knn * 100))

#Confusion matrix for KNN

Page | 48


cm_dt_gini = confusion_matrix(y_test, knn.predict(X_test))

cm_df = pd.DataFrame(cm_dt_gini.T, index=clf.classes_, columns=clf.classes_)



print(cm_df)

#Confusion matrix _ KNN visualization

cm_knn = confusion_matrix(y_test, y_pred_knn)

fig = plt.figure()


cax = ax.matshow(cm_knn)

plt.title('Confusion matrix of the classifier KNN')

fig.colorbar(cax)





plt.ylabel('True')

plt.show()


Predicted

Healthy 0 0 0 0 0

T1 0 2 0 0 0

T2 1 0 19 2 2

T3 0 1 0 0 0

T4 0 0 0 0 0

Page | 49


#Decision trees with gini and entropy

from sklearn import tree

# Creating tree object

# model = tree.DecisionTreeRegressor() for regression

model_gini = tree.DecisionTreeClassifier(criterion='gini') # for gini, by default.

model_entropy = tree.DecisionTreeClassifier(criterion='entropy') # for entropy.

# Train the model using the training sets and check score

model_gini.fit(X_train, y_train)

model_entropy.fit(X_train, y_train)

#print ("Score with gini (training):", model_gini.score(X_train, y_train)*100,"%")

#print ("Score with entropy (training):", model_entropy.score(X_train, y_train)*100,"%")

#Prediction with decision tree classifier.

predicted_dt_gini = model_gini.predict(X_test)

predicted_dt_entropy = model_entropy.predict(X_test)

print ("Accuracy score for Decision tree classifier (gini) with test data set is:", (accuracy_score(y_test,predicted_dt_gini)*100),"%")

print ("Accuracy score for Decision tree classifier (entropy) with test data set is:", (accuracy_score(y_test,predicted_dt_entropy)*100),"%")

Page | 50


#confusion matrix Gini

cm_dt_gini = confusion_matrix(y_test, model_gini.predict(X_test))

cm_df = pd.DataFrame(cm_dt_gini.T, index=clf.classes_, columns=clf.classes_)



print(cm_df)

#confusion matrix Gini visualization.

fig = plt.figure()


cax = ax.matshow(cm_dt_gini)

plt.title('Confusion matrix of the decision tree classifier (gini)')

fig.colorbar(cax)





plt.ylabel('True')

plt.show()


Predicted

Healthy 0 0 0 0 0

T1 0 2 6 0 0

T2 1 0 10 2 1

T3 0 1 2 0 1

T4 0 0 1 0 0

Page | 51


#Confusion matrix entropy

cm_dt_entropy = confusion_matrix(y_test, model_entropy.predict(X_test))

cm_df = pd.DataFrame(cm_dt_entropy.T, index=clf.classes_, columns=clf.classes_)



print(cm_df)

#Confusion matrix entropy visualization.

fig = plt.figure()


cax = ax.matshow(cm_dt_entropy)

plt.title('Confusion matrix of the decision tree classifier (entropy)')

fig.colorbar(cax)


Predicted

Healthy 1 0 2 0 0

T1 0 2 2 0 0

T2 0 1 13 2 0

T3 0 0 0 0 1

T4 0 0 2 0 1

Page | 52






plt.ylabel('True')

plt.show()

#Gradient boosting.

from sklearn.ensemble import GradientBoostingClassifier

gbf = GradientBoostingClassifier(n_estimators=100, learning_rate=1, max_depth=3)

gbf.fit(X_train, y_train)

#getting accuracy of gradient boosting.

Page | 53


predicted_train_gbf = gbf.predict(X_train)

#print ("Accuracy score for Gradient boosting classifier with train data set is:", (accuracy_score(y_train,predicted_train_gbf))*100,"%")

predicted_test_gbf = gbf.predict(X_test)

print ("Accuracy score for Gradient boosting classifier with test data set is:", ((accuracy_score(y_test,predicted_test_gbf))*100),"%")

#confusion matrix gradient boosting.

cm_gb = confusion_matrix(y_test, gbf.predict(X_test))

cm_df = pd.DataFrame(cm_gb.T, index=clf.classes_, columns=clf.classes_)



print(cm_df)

#confusion matrix gradient boosting visualization

fig = plt.figure()


cax = ax.matshow(cm_gb)

plt.title('Confusion matrix of the gradient boosting)')

fig.colorbar(cax)





plt.ylabel('True')

plt.show()


Predicted

Healthy 0 0 0 0 0

T1 0 2 1 0 0

T2 1 0 17 2 2

T3 0 1 1 0 0

T4 0 0 0 0 0

Page | 54


#Random Forest

from sklearn.ensemble import RandomForestClassifier #use RandomForestRegressor for regression problem

# Creating Random Forest object for classifier.

model= RandomForestClassifier(n_estimators=100)

model.fit(X_train, y_train)

predictedRF= model.predict(X_test)

predictedRF1 = model.predict(X_train)

print ("Accuracy score for Random Forest classifier with train data set is:", (accuracy_score(y_train, predictedRF1)*100),"%")

print ("Accuracy score for Random Forest classifier with test data set is:", ((accuracy_score(y_test, predictedRF))*100),"%")

#confusion matrix for random forest.

cm_rf = confusion_matrix(y_test, gbf.predict(X_test))

Page | 55


cm_df = pd.DataFrame(cm_rf, index=clf.classes_, columns=clf.classes_)



print(cm_df)

#confusion matrix for random forest visualization.

fig = plt.figure()


cax = ax.matshow(cm_rf)

plt.title('Confusion matrix of the random forest)')

fig.colorbar(cax)





plt.ylabel('True')

plt.show()


Predicted

Healthy 0 0 1 0 0

T1 0 2 0 1 0

T2 0 1 17 1 0

T3 0 0 2 0 0

T4 0 0 2 0 0

Page | 56


#Tensor flow.

n_nodes_hl1 = 30

n_nodes_hl2 = 30

n_nodes_hl3 = 30

n_nodes_hl4 = 30

n_nodes_hl5 = 30

n_classes = 5

batch_size = 20

hm_epochs = 500

x = tf.placeholder('float')

y = tf.placeholder('float')

hidden_1_layer = {'f_fum':n_nodes_hl1,

'weight':tf.Variable(tf.random_normal([5, n_nodes_hl1])),

'bias':tf.Variable(tf.random_normal([n_nodes_hl1]))}


'weight':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])),











output_layer = {'f_fum':None,

'weight':tf.Variable(tf.random_normal([n_nodes_hl5, n_classes])),

'bias':tf.Variable(tf.random_normal([n_classes])),}

# Nothing changes

Page | 57


def neural_network_model(data):

l1 = tf.add(tf.matmul(data,hidden_1_layer['weight']), hidden_1_layer['bias'])

l1 = tf.nn.relu(l1)

l2 = tf.add(tf.matmul(l1,hidden_2_layer['weight']), hidden_2_layer['bias'])

l2 = tf.nn.relu(l2)


l3 = tf.nn.relu(l3)


l4 = tf.nn.relu(l4)


l5 = tf.nn.relu(l5)

output = tf.matmul(l5,output_layer['weight']) + output_layer['bias']

return output

def train_neural_network(x):

prediction = neural_network_model(x)

cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )

optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)

with tf.Session() as sess:

sess.run(tf.global_variables_initializer())

for epoch in range(hm_epochs):

epoch_loss = 0

i=0

while i < len(X_train):

start = i

end = i+batch_size

batch_x = np.array(X_train1[start:end])

batch_y = np.array(y_train1[start:end])

_, c = sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})

epoch_loss += c

i+=batch_size

Page | 58


print('Epoch', epoch+1, 'completed out of',hm_epochs,'loss:',epoch_loss)

correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))

accuracy = tf.reduce_mean(tf.cast(correct, 'float'))

print('Accuracy:', (accuracy.eval({x:X_test1, y:y_test1})*100))

train_neural_network(x)

Accuracy: 78.4444447756

# Unsupervised learning Hierarchical Clustering

import seaborn as sns

from sklearn.preprocessing import scale


from sklearn.cluster import KMeans

from scipy.cluster import hierarchy

%matplotlib inline

plt.style.use('seaborn-white')

df2 = pd.DataFrame(ProteomicsX_pca2)

df2.columns = np.arange(df2.columns.size)

df2.info()

#output

<class 'pandas.core.frame.DataFrame'> RangeIndex: 89 entries, 0 to 88 Data columns (total 5 columns): 0 89 non-null float64 1 89 non-null float64 2 89 non-null float64 3 89 non-null float64 4 89 non-null float64

Page | 59


dtypes: float64(5) memory usage: 3.6 KB fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6)) color_idx = pd.factorize(y.tumor)[0] cmap = plt.cm.hsv # Left plot ax1.scatter(X.iloc[:,0], X.iloc[:,1], c=color_idx, cmap=cmap, alpha=0.5, s=50) ax1.set_ylabel('Principal Component 2') # Right plot ax2.scatter(X.iloc[:,0], X.iloc[:,2], c=color_idx, cmap=cmap, alpha=0.5, s=50) ax2.set_ylabel('Principal Component 3') # Custom legend for the classes (y) since we do not create scatter plots per class (which could h

ave their own labels). handles = [] labels = pd.factorize(y.tumor.unique()) norm = mpl.colors.Normalize(vmin=0.0, vmax=14.0) for i, v in zip(labels[0], labels[1]): handles.append(mpl.patches.Patch(color=cmap(norm(i)), label=v, alpha=0.5)) ax2.legend(handles=handles, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) # xlabel for both plots for ax in fig.axes: ax.set_xlabel('Principal Component 1')

Page | 60


pd.DataFrame([X.iloc[:,:5].std(axis=0, ddof=0).as_matrix(),

pca.explained_variance_ratio_[:5],

np.cumsum(pca.explained_variance_ratio_[:5])],

index=['Standard Deviation', 'Proportion of Variance', 'Cumulative Proportion'],

columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'])

# Hierarchical Clustering

import seaborn as sns

from sklearn.preprocessing import scale



from scipy.cluster import hierarchy

%matplotlib inline

plt.style.use('seaborn-white')

df2 = pd.DataFrame(ProteomicsX_pca2)

df2.columns = np.arange(df2.columns.size)

df2.info()

Page | 61


<class 'pandas.core.frame.DataFrame'> RangeIndex: 89 entries, 0 to 88 Data columns (total 5 columns): 0 89 non-null float64 1 89 non-null float64 2 89 non-null float64 3 89 non-null float64 4 89 non-null float64 dtypes: float64(5) memory usage: 3.6 KB fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6)) color_idx = pd.factorize(y.tumor)[0] cmap = plt.cm.hsv # Left plot ax1.scatter(X.iloc[:,0], X.iloc[:,1], c=color_idx, cmap=cmap, alpha=0.5, s=50) ax1.set_ylabel('Principal Component 2') # Right plot ax2.scatter(X.iloc[:,0], X.iloc[:,2], c=color_idx, cmap=cmap, alpha=0.5, s=50) ax2.set_ylabel('Principal Component 3') # Custom legend for the classes (y) since we do not create scatter plots per class (which could have their own labels). handles = [] labels = pd.factorize(y.tumor.unique()) norm = mpl.colors.Normalize(vmin=0.0, vmax=14.0) for i, v in zip(labels[0], labels[1]): handles.append(mpl.patches.Patch(color=cmap(norm(i)), label=v, alpha=0.5)) ax2.legend(handles=handles, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) # xlabel for both plots for ax in fig.axes: ax.set_xlabel('Principal Component 1')

Page | 62


pd.DataFrame([X.iloc[:,:5].std(axis=0, ddof=0).as_matrix(),

pca.explained_variance_ratio_[:5],

np.cumsum(pca.explained_variance_ratio_[:5])],

index=['Standard Deviation', 'Proportion of Variance', 'Cumulative Proportion'],

columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5'])

X.iloc[:,:10].var(axis=0, ddof=0).plot(kind='bar', rot=0)

plt.ylabel('Variances');

Page | 63


fig, (ax1,ax2,ax3) = plt.subplots(1,3, figsize=(20,20))

for linkage, cluster, ax in zip([hierarchy.complete(X), hierarchy.average(X), hierarchy.single(X)],

['c1','c2','c3'],

[ax1,ax2,ax3]):

cluster = hierarchy.dendrogram(linkage, labels=X.index, orientation='right',

color_threshold=0, leaf_font_size=10, ax=ax)

ax1.set_title('Complete Linkage')

ax2.set_title('Average Linkage')

ax3.set_title('Single Linkage');

Page | 64


plt.figure(figsize=(10,20))

cut4 = hierarchy.dendrogram(hierarchy.complete(X),

labels=X.index, orientation='right', color_threshold=140, leaf_font_size=10)

plt.vlines(140,0,plt.gca().yaxis.get_data_interval()[1], colors='r', linestyles='dashed');

Page | 65


Page | 66


K-Means Clustering

list(ProteomicsXRaw.index)#Plotting the first 3 components

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(1,figsize=(9,6))

ax = fig.gca(projection='3d')

from collections import OrderedDict

TumorCode={'Healthy':'green','T1':'yellow','T2':'orange','T3':'red','T4':'darkviolet'}


ColorSet=[TumorCode[x] for x in IntermedSet]

for i,c,ID in zip(range(len(ProteomicsX_pca2)),ColorSet,IntermedSet):

ax.scatter3D(xs=ProteomicsX_pca2[:,0][i],

ys=ProteomicsX_pca2[:,1][i],

zs=ProteomicsX_pca2[:,2][i],

c=c,

label=ID,

s=90,zorder=1)

ax.set_xlabel(str.format('1st Component'+'

'+str(ProteomicsX_pca.explained_variance_ratio_[0])[0:5])+'%')

ax.set_ylabel(str.format('2nd Component'+'


ax.set_zlabel(str.format('3rd Component'+'


plt.title('PCA of Breast Cancer Proteomics')

Page | 67


ColorSet, IntermedSet = plt.gca().get_legend_handles_labels()

by_label = OrderedDict(zip(IntermedSet, ColorSet))

plt.legend(by_label.values(), by_label.keys(),loc=3)

plt.show()

#Plotting the first 3 components

from mpl_toolkits.mplot3d import Axes3D


ax = fig.gca(projection='3d')


Page | 68


TumorCode={'Healthy':'green','T1':'yellow','T2':'orange','T3':'red','T4':'darkviolet'}



for i,c,ID in zip(range(len(ProteomicsX_pca2)),ColorSet,IntermedSet):

ax.scatter3D(xs=ProteomicsX_pca2[:,0][i],

ys=ProteomicsX_pca2[:,1][i],

zs=ProteomicsX_pca2[:,2][i],

c=c,

label=ID,

s=90,zorder=1)

ax.set_xlabel(str.format('1st Component'+'


ax.set_ylabel(str.format('2nd Component'+'


ax.set_zlabel(str.format('3rd Component'+'


ax.view_init(azim=30)

plt.title('PCA of Breast Cancer Proteomics')

ColorSet, IntermedSet = plt.gca().get_legend_handles_labels()

by_label = OrderedDict(zip(IntermedSet, ColorSet))

plt.legend(by_label.values(), by_label.keys(),loc=3)

plt.show()

Page | 69


#Clustering attemps - try 5 clusters on the PCA-reduced data


clusternum=5

#Fit KMeans on the PCA Proteomics dataset

KMeansModel=KMeans(n_clusters=clusternum, init='k-means++')

KMeanData=ProteomicsX_pca2

KMeansModel.fit(KMeanData)

#Get the labels of the cluster predictions, and location of cluster centroids

labels=KMeansModel.labels_

Page | 70


centroids=KMeansModel.cluster_centers_

##############################################################################

#Plot the clusters and the observations with respect to the cluster boundaries

#Some of these commands are adapted from the scikit-learn example for KMeans

#... which is found at http://scikit-

learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-

cluster-plot-kmeans-digits-py (thanks!)

##############################################################################



#Further reduce to 2 components for the decision boundary plot

TwoCompReduced = PCA(n_components=2).fit_transform(ProteomicsX)

KMeansSub=KMeans(n_clusters=clusternum, init='k-means++')

KMeansSub.fit(TwoCompReduced)

# Step size - adjusted for speed here

h = .05

# Plot the decision boundary. For that, we will assign a color to each

x_min, x_max = TwoCompReduced[:, 0].min() - 1, TwoCompReduced[:, 0].max() + 1

y_min, y_max = TwoCompReduced[:, 1].min() - 1, TwoCompReduced[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.

Z = KMeansSub.predict(np.c_[xx.ravel(), yy.ravel()])

Page | 71


# Put the result into a color plot

Z = Z.reshape(xx.shape)

plt.figure(1)

plt.clf()

plt.imshow(Z, interpolation='nearest',

extent=(xx.min(), xx.max(), yy.min(), yy.max()),

aspect='auto', origin='lower',alpha=0.2)

#Get the colors of the tumor type as used in the PCA above

TumorCode=OrderedDict([('Healthy','green'),('T1','yellow'),('T2','orange'),('T3','red'),('T4','darkvi

olet')])



for i in range(clusternum):

# select only data observations with cluster label == i

DataSubset = KMeanData[np.where(labels==i)]

#Get the matching list of colors by tumor type filtered by the cluster label

MatchList=[x for x in np.where(labels==i)[0]]

ColorList=[ColorSet[x] for x in MatchList]

#Cluster IDs

ClusterID=np.repeat(i,len(KMeanData[np.where(labels==i)]))

for i,c,ID in zip(range(len(DataSubset)),ColorList,ClusterID):

plt.scatter(x=DataSubset[:,0][i],

Page | 72


y=DataSubset[:,1][i],

c=c,

label=ID,s=90)

#Plot positions of centroids

plt.scatter(centroids[:, 0], centroids[:, 1],

marker='x', s=169, linewidths=3,

color='black', zorder=10)

markers = [plt.Line2D([0,0],[0,0],color=color, marker='o', linestyle='',markersize=16) for color in

TumorCode.values()]

plt.legend(markers, TumorCode.keys(), numpoints=1,fontsize=16)

plt.title('K-Means of Breast Cancer Proteomics',size=20)

plt.show()

Page | 73


#GUI

from Tkinter import *

from PIL import Image, ImageTk

root = Tk()

root.title("Breast Cancer Detection using Proteoms")

root.geometry("800x800")

#label1

var = StringVar()

label1 = Label( root, textvariable=var, width=35 , font="Helvetica 20 bold")

var.set("Select the Machine Learning Algorithm")

label1.grid(row=0, sticky = W, padx=10, pady=20)

var = StringVar()

label2 = Label( root, textvariable=var, width=20 , font="Helvetica 20 bold")

var.set("Select the Model")

label2.grid(row=0,column=1,pady=20)

#label2

var2 = StringVar()

label3 = Label( root, width=50, textvariable=var2)

label3.grid(row=3,column=1)

# Image of Results

image = Image.open("abc.png")

resized =image.resize((700, 500),Image.ANTIALIAS)

Page | 74


photo = ImageTk.PhotoImage(resized)

label = Label(root, image=photo)

label.image = photo # keep a reference!

label.grid(row = 20 , columnspan=4, sticky = W, padx = 10, pady = 10)

mb_var = StringVar()

#mb_var.set("Model Selection")

mb = OptionMenu(root, mb_var, ())

mb.configure(width=20)

mb.grid(row = 1,column = 1)

def reset_option_menu(options, index=None):

menu = mb["menu"]

menu.delete(0, "end")

for string in options:

menu.add_command(label=string, command=lambda value=string:mb_var.set(value))

if index is not None:

mb_var.set(options[index])

def a():

reset_option_menu(["Logistic Regression","KNN","Random Forest","Decision Tree-

Gini","Decision Tree-Entropy","SVM","Neural Network","Gradient Boosting"], 0)

def b():

reset_option_menu(["K-Clustering", "Hierarchial Clustering"], 0)

Page | 75


def default():

reset_option_menu([""], 0)

def d():

var = "The Selected Model is " + mb_var.get()

var2.set(var)

def e():

var = mb_var.get()

if var == "SVM":

image1 = Image.open("SVM_Results.png")

resized =image1.resize((700, 500),Image.ANTIALIAS)

photo1 = ImageTk.PhotoImage(resized)

label.configure(image = photo1)

label.image = photo1

elif var == "KNN":

image2 = Image.open("KNN_Results.png")





Page | 76


elif var == "Logistic Regression":

image3 = Image.open("Logistic_Regression_Results.png")





elif var == "Random Forest":

image4 = Image.open("Random_Forest_Results.png")





elif var == "Decision Tree-Gini":

image5 = Image.open("Decesion_Tree(Gini).png")





elif var == "Decision Tree-Entropy":

image6 = Image.open("Decision_Tree_(Entropy).png")

Page | 77






elif var == "Gradient Boosting":

image7 = Image.open("Gradient_Boosting_Results.png")





elif var == "K-Clustering":

image8 = Image.open("K-Clustering_Results.png")





elif var == "Hierarchial Clustering":

image9 = Image.open("Hierarchial_results.png")





Page | 78


elif var == "Neural Network":

image10 = Image.open("Neural_nets_results.png")





#Radio Button

var1 = IntVar()

R1 = Radiobutton(root, text="Supervised Learning", variable=var1, value=1, command = a )

R1.grid(row=1,sticky = W , padx = 20)

R2 = Radiobutton(root, text="Unsupervised Learning", variable=var1, value=2, command = b)

R2.grid(row=2,sticky = W , padx = 20)

default()

B = Button(root, text ="Set Model",font="Helvetica 14 ", relief=RAISED, command=d)

B.grid(row = 3,sticky = W, padx = 10, pady=10)

B = Button(root, text ="Calculate Result", font="Helvetica 14 ", command=e)

B.grid(row = 4, column= 1,pady=10)

root.mainloop()

Engineering

Breast cancerdetection IE594 Project Report