MACHINE LEARNING AND BREAST CANCER PREDICTION · technology and developing new techniques under new technical architectures. ... The prediction of breast cancer survivability –life

Company Confidential - For Internal Use Only

Copyright © 2013, SAS Insti tute Inc. Al l r ights reserved.

MACHINE LEARNING AND

BREAST CANCER PREDICTION

1

USING SEER DATA 1973-2012

ISF 2015 - JUNE 24, 2015

Tae Yoon Lee - [email protected]

Youngme Seo - [email protected]



AGENDAMACHINE LEARNING AND

BREAST CANCER PREDICTION

1. Machine Learning – Data Mining – Big Data Analytics – Data Scientist

2. Breast Cancer Prediction and Prognosis

3. Machine Learning Methods

4. Comparison of Machine Learning methods

5. Summary and Future Research

2



WHAT IS MACHINE

LEARNING ?AUTOMATION OF DATA MINING ….

3

Machine learning is

• a branch of artificial intelligence

• employs a variety of statistical, probabilistic and optimization

techniques

• allows computers to “learn” from past examples

• detect hard-to-discern pattern from large, noisy or complex data

sets.”

Cruz and Whishart (2006), “Application of Machine Learning in Cancer

Prediction and Prognosis



WHAT IS MACHINE

LEARNING ?DATA MINING, BIG DATA ANALYTICS, DATA SCIENTISTS

4

Differences between Machine Learning, Data Mining, and Big Data Analytics

• Machine Learning - Machine acquires knowledge from data

• Data Mining – both Human & Machine together acquire Knowledge from data

Note that Data Mining and Machine Learning have been interchangeably used and

appear to be overlapped in many ways.

• Big Data Analytics are using Machine Learning and Data Mining under Hadoop

technology and developing new techniques under new technical architectures.



WHAT IS MACHINE

LEARNING ?DATA MINING, BIG DATA ANALYTICS, DATA SCIENTISTS

5

Data Size Analytics Sampling Architecture Analysts title Our Demo

Big Data

Big Data Analytics uses Data Mining/ Machine Learning / developing new techniques Mostly not

Hadoop HDFS / In-Memory Statistics & Fast processorsAutomation Process Data Scientists

(unstructured data)

Large data Data Mining / Machine Learning usually Large Disk space computers Data Miners Here

Small data Statistical Analysis occasionally Small Disk space computers Statisticians



WHAT IS DATA

MINING ?MACHINE LEARNING >>>>> BIG DATA ANALYTICS

6



DATA MINING: MODEL DEVELOPMENT PROCESS

Sample Explore Modify Model Assess

Utility



BREAST CANCER: PREDICTION (DIAGNOSIS & PROGNOSIS)

Three foci of Breast Cancer Prediction and Prognosis:

1. The prediction of breast cancer susceptibility – risk assessment prior to

occurrence. (Diagnosis)

2. The prediction of breast cancer recurrence – likelihood of redeveloping

(Prognosis)

3. The prediction of breast cancer survivability – life expectancy, survival,

progression, tumor-drug sensitivity (Prognosis)

The success of Prognosis prediction is dependent on the quality of the

Diagnosis.

8



MACHINE LEARNING

METHODS:PREDICTION PROGNOSIS

9

1. Logistic Regression – predict the probability of the target event

2. Decision Tree – a segmentation of the data that is created by

applying a series of simple rules. (English rule)

3. Random Forest – multiple Decision Trees with random samples and

random attributes. (ensemble method, hard to interpret)

4. Neural Networks – detecting complex nonlinear relationships in data

5. Support Vector Machines – construct a set of hyperplanes that

maximize the margin between two classes for classification.



MACHINE LEARNING

METHODS:DECISION TREE

10



MACHINE LEARNING


11



MACHINE LEARNING

METHODS: RANDOM FOREST

12



MACHINE LEARNING

METHODS:NEURAL NETWORKS

Multilayer Perceptron 3 Hidden Units:

13



MACHINE LEARNING

METHODS:NEURAL NETWORKS

14



MACHINE LEARNING

METHODS:LOGISTIC REGRESSION MODEL

15

log(𝑝

1−𝑝) = a + β1 𝑥1+ β2 𝑥2+ β3 𝑥3+…+ β𝑘 𝑥𝑘



MACHINE LEARNING

METHODS:SUPPORT VECTOR MACHINE

16



Trend of 5-year, 10-year, 15-year survival percentage

from 1973 to 2007 diagnosis

DATA CLEANING:

TREND OF SURVIVAL PERCENTAGES

17

SEER Data 1973-2012

Breast Cancer Incidence

(740,506 obs / 149 var)



DATA CLEANING: FILTERING

18

• Target is Surviving 15 years after Diagnosis

• 2012-15= 1997 (Date_yr <= 1997)

• Drop missing tumor size information

• Remove Other cause of Death

Year of

Diagnosis #Patients

#Survivor

after 15

years

%Survival

after 15

years

1988 10,998 5,975 54%

1989 11,021 6,075 55%

1990 11,871 6,718 57%

1991 12,518 7,285 58%

1992 13,050 7,914 61%

1993 13,328 8,189 61%

1994 14,243 8,938 63%

1995 15,147 9,741 64%

1996 16,008 10,534 66%

1997 17,302 11,800 68%

Total 135,486 83,169 61%

SEER Data:

1988-1997 Year of Diagnosis

(reduced to 135,486 obs from 740,506 obs)



VARIABLE

SELECTION:9 CATEGORICAL VARIABLES+ 5 CONTINUOUS VARIABLES

19

Categorical Variables:

1. Primary site code - SITE02V

2. Histology - HISTO2V

3. Behavior - BEHO2V

4. Grade – GRADE

5. Extension of disease –

EOD10_EX

6. Lymph node involvement –

EOD10_ND

7. Radiation – RADIATN

8. Stage of Cancer – D_AJCC_M

9. Site specific surgery code –

SS_SURG

Continuous Variables:

1. Age at diagnosis – AGE_DX

2. Tumor size – EOD10_SZ

3. Number of positive nodes –

EOD10_PN

4. Number of nodes examined –

EOD10_NE

5. Number of primaries -

NUMPRIMS





DATA CLEANING: TREND OF SURVIVAL PERCENTAGES

20



DATA CLEANING: TREND OF SURVIVAL PERCENTAGES

21



15-year

Survival %



MACHINE LEARNING

DATA MINING:MODELING PROCESS

22

Data Partitioned

40% Training data

30% Validation data

30% Test dataModel selection Criteria:

Validation ROC Index



MODEL

COMPARIONS:NEURAL NETWORKS SELECTED WITH ROC INDEX 0.851

23



SUMMARY UPDATE

24

True

Positive

False

Positive



FALSE POSITIVE: INCORRECTLY PREDICTED SURVIVALS

25

target output

1= survival 1=survival true positive

0=die 1=survival false postive

1=survival 0=die false negative

0=die 0=die true negative

Decision Tree

has the smallest

False Positive.

Neural Network TRAIN 3955 13349 7578 29313

Neural Network VALIDATE 2980 9918 5777 21970 0.851

LogisticReg TRAIN 4073 12384 8543 29195

LogisticReg VALIDATE 3066 9290 6405 21884 0.831

RandomForest TRAIN 3664 12163 8764 29604

RandomForest VALIDATE 2783 9095 6600 22167 0.83

SVM TRAIN 3523 11436 9491 29745

SVM VALIDATE 2640 8514 7181 22310 0.827

DecisionTree TRAIN 6548 14344 6583 26720

DecisionTree VALIDATE 4899 10757 4938 20051 0.811

TRUE

Positive ROC Index

Model Selection based on Validation data: ROC Index

Event Classification Table

Model Data Role

FALSE

Negative

TRUE

Negative

FALSE

Positive



MACHINE LEARNING

METHODS:PROS AND CONS

Logistic Regression – Causal effect

Decision Tree – English rule, segmentation, variable selection, use both

categorical and interval with missing values

Random Forest – reduce overfitting

Neural Networks – nonlinear, local maximum

Support Vector Machine – nonlinear, global maximum

26



SUMMARY AND

FUTURE RESEARCH:

Machine Learning/Data mining is a key technique to automate Medical

disease classification with much improved architecture. Further tests and

research are needed.

Further specification: SVM (linear, polynomial, RBF, sigmoid kernel)

Methods: Clustering, segmentation, two stage modeling, cross validation

Data: subsets (HER2+/-), different cancers, unstructured data

Architeture: HDFS Laser server, In-Memory statistics, Results Visualization

27



THANK YOU!

ISF 2015 – 24 JUNE 2014

Tae Yoon Lee & Youngme Seo

[email protected]

[email protected]



DATA MINING PROCESS EXAMPLE:

29

Step 1:

Data Checks, Segmentation Suites

Description:

Cleaned data for Modeling

Data checks, Feasibility Analysis

EM Input data, StatExplore,

Multiplot , Impute nodes

Output Clean data for modeling

Step 2:

Variable Reduction

Identify Patient risk drivers

Fine/Coarse Classing, Variable

Clustering, Correlation, VIF

Analysis

EM Interactive Grouping, Variable

Clustering, SAS code nodes

Candidate variables and Shortlist

variabales

Step 3:

Segmentation Variable Selection

Discrete & Continuous

Segmentation variables

Decision Tree, Segment Profiling

Analysis

EM Decision Tree, Clustering, Logistic

Regression, Segment Profiling

nodes

Optimum Segmentation split

on segmentation variables

Step 4:

Model Build

Build both a whole patient level and

segment level models

Fine/Coarse Classing,

Correlation, VIF, Logistic Regression

EM Interactive Grouping,

Scorecard, SAS code, Variable

Clustering, Segment Profiling nodes

Model Estimates, Parent Gini & System

Gini’s

Step 5:

Model Comparison

Compare the best System Gini with the

Parent Gini

Compare Gini’s, consider Benefits of

segmentation

EM SAS code, Model Comparison nodes

Identify if the Segmentation is

effective and required



MACHINE LEARNING


Decision Tree

30



VARIABLE

SELECTION:9 CATEGORICAL VARIABLES+ 5 CONTINUOUS VARIABLES

31

Categorical Variables:

1. Primary site code - SITE02V

2. Histology - HISTO2V

3. Behavior - BEHO2V

4. Grade – GRADE

5. Extension of disease –

EOD10_EX

6. Lymph node involvement –

EOD10_ND

7. Radiation – RADIATN

8. Stage of Cancer – D_AJCC_M

9. Site specific surgery code –

SS_SURG

Continuous Variables:

1. Age at diagnosis – AGE_DX

2. Tumor size – EOD10_SZ

3. Number of positive nodes –

EOD10_PN

4. Number of nodes examined –

EOD10_NE

5. Number of primaries -

NUMPRIMS

Documents

MACHINE LEARNING AND BREAST CANCER PREDICTION · technology and developing new techniques under new technical architectures. ... The prediction of breast cancer survivability –life