Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

Class 12-13 Notes

S1

S2

x ∙ 𝑤 = 𝑏 + 1

x ∙ 𝑤 = 𝑏 − 1

x ∙ 𝑤 = 𝑏

S3: Another Proof of margin for the curious:margin is distance from points u and v

on their perspective planes lying along the vector w

1x w b

1x w b

w

v

u 2

2

2 2

2

(1 )1

|| ||

( 1 )

|| ||

margin =|| ||

(1 ) ( 1 )|| ||

|| || || ||

2|| ||

|| ||

2

|| ||

u w v w

bu w b w w b

w

bsimilarly

w

u v

b bw w

w w

ww

w

+

-

S4

S5

x ∙ 𝑤 = 𝑏 + 1

x ∙ 𝑤 = 𝑏 − 1

x ∙ 𝑤 = 𝑏

S6: Approximate Misclassification error with a nicer convex function

Misclassification error

SVM error

a.k.a. hinge loss

-1 0 1yf(x)

1

Misclassification error for point (x,y) : step(-y(xw-b)) SVM Error for point(x,y):=max(1-y(xw+b),0)

S7: AMPL Model File HW3_QSARSVM_mod.txt

; #CONSTANTSparam p; # number of predictor variablesparam n; # number of observationsparam c; # SVM model parameter

set P:={1..p}; # indices of predictor variablesset N:={1..n}; # indices of observations

param X{N,P}; # observation data param Y{N}; # classes of each observation (1 or -1)

#VARIABLES of model var w{P};var b;var z{N};

#OBJECTIVE FUNCTION for SVM classification model minimize objective:c*sum{i in N} z[i] + sum{j in P} w[j]*w[j] ;

#CONSTRAINTSsubject to trainerror {i in N}:Y[i]*( sum {j in P} X[i,j]*w[j]-b ) + z[i] >= 1;

subject to slack {i in N}:z[i] >= 0

S8: AMPL Data File sampleSVM_dat2.txt

#PARAMETERSparam p:= 2; # number of featuresparam n:= 7; # total number of observationsparam c:=10; # constant used in SVM model

# All X Dataparam X :

1 2:= 1 1 02 -1 -23 -1 14 1 25 2 36 0 47 0 -0.5;

# Y Data (must be -1 or 1)param Y :=1 12 13 14 -15 -16 -17 -1;

S9: AMPL Command File HW3_QSARSVM_cmd.txt

# Command file to train and test SVM # Created by Kristin Benentt, March 2015.#Solve SVM Modelsolve;

# Display Solutions weights, bias, and objectivedisplay w;display b;

S8: Calling from NEOS using Minos solver

• http://www.neos-server.org/neos/solvers/nco:MINOS/AMPL.html

S9: Minos Results

S11: Estimating Generalization Error of a Model

S12: Classification Homework: Drug Discovery Problem

•Cost about $1Billion to develop drug•First-year sales > $1Billion/drug•1 drug approved/1000 compounds tested•1 out of 100 drugs succeeds to market•Trend – More drug failures in late stage clinical trials

or even after on market!•Pressure to reduce drug costs/development time

• Data Driven models to predict bioactivites of drugs have potential to accelerate drug discovery process• Efficacy• Toxicity• Absorption• Distribution• Metabolism• Excretion• Biodegradibility

S13: Classification Task

The European REACH regulation requires information on ready biodegradation, which is a screening test to assess the biodegradability of chemicals. At the same time REACH encourages the use of alternatives to animal testing which includes predictions from quantitative structure–activity relationship (QSAR) models.

Your assignment is to build a classification QSAR model to predict ready biodegradation of chemicals based on provided training set, estimate generalization error using the testing set. Estimate biodegradiblity of a new point.

S14: Data Available

Experimental values chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE). The training data sets consists of these 83e chemicals, whether they were readily biodegradabe (1= yes , -1 – no) and 41 molecular descriptors.

A testing set of 42 molecules with known biodegradabilityMystery point with unknown biodegradability

Data files are in form IDNUM, descriptor 1, ….., descriptor 41, classMolecules and Molecular descriptors are proprietary. No details provided except cryptic names in a separate file.

S15: Overall Experimental Design

Data providedTraining set of 833 41 dimensional pointsTest set of 50 41 dimensional points

Train classification model on training set to determine model parameters using NEOS.Calculate error on the training set. Test classification model with the model parameters found on the test set to see how well it works on future data.+ some other stuffMetric of success:small number of points misclassified on test set

S16: Many Data Sets are Published with Associated Papers.

Reference: Mansour et al, Training and testing a linear SVM models to predict biodegradibility of chemicals based on data, 2013

Location: http://pubs.acs.org/doi/abs/10.1021/ci4000213

http://pubs.acs.org/doi/abs/10.1021/ci4000213

S18: Regression Data Set: Energy Efficiency Data Set

We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 499 building shapes. The dataset comprises 80 samples and 8 features, aiming to predict two real valued responses.

The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses.

X1 Relative CompactnessX2 Surface AreaX3 Wall AreaX4 Roof AreaX5 Overall HeightX6 OrientationX7 Glazing AreaX8 Glazing Area Distributiony1 Heating Loady2 Cooling Load

Reference A. Tsanas, A. Xifara: 'Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools', Energy and Buildings, Vol. 49, pp. 560-567, 2012

Location http://archive.ics.uci.edu/ml/datasets/Energy+efficiency

http://people.maths.ox.ac.uk/tsanas/publications.html

http://archive.ics.uci.edu/ml/datasets/Energy+efficiency

UC IRVINE

S20: Homework 3 Extra Credit

• Design and experiment to explore the effect of the tradeoff parameter λ in regression and C in classification on the training and testing errors.

• Describe what you observed in terms of training and testing errors. Indicate which model is best for each problem and why.

3 pts for Regression 3 pts for Classification

S21: Project

Project Description and Checklist will be available on Prof. K’s web page in DM folder.

Deliverables per group:

• Project Proposal - April 3

• Project Progress Report – April 24

• Project Final Report - May 12

• Project Poster Presentation (Grad Only)

May 22 during exam slot

S22: Sample Project Like homework

Find a problem that can be solved using data and a classification or regression model (can be SVM or other optimization based model). • Pick one or more data problems from sources like UCI, public

Papers• Make train and test sets for each problem.• Create one or more predictive models using NEOS (easy to make

many if you change parameters like C)• Run models on the training sets• Evaluate training set errors• Test models on the testing sets, calculate• Discuss your findings. How well did the models work? Which

models were preferable. Compare the performance of the models on the different data set in terms of efficiency, error, interpretability of solutions.

• Extra credit for creativity and exploration• See project description for more details.

S23: Finding Problems or Methods

1) scholar.google.com or pubmed

Find a paper and topic that interests you. Try to duplicate it

2) Start with a machine learning task and look up some relevant papers

Many learning algorithm have an optimization problem as their engine

3) Text book , wikipedia, problem in other class

4) Ask your favorite professor

S24: Finding Data

1) Machine learning repositoriesDELVEhttp://www.cs.toronto.edu/~delve/data/datasets.htmlUCIhttp://archive.ics.uci.edu/ml/2) Pubmed frequently has data in supplementary materialhttp://www.ncbi.nlm.nih.gov/pubmed3) Domain specific data repositories

finance, astronomy, biology, ???4) Your local research project5) Simulated data

http://www.cs.toronto.edu/~delve/data/datasets.html

http://archive.ics.uci.edu/ml/

http://www.ncbi.nlm.nih.gov/pubmed

S25: Help available for formulating project

Regular office hours

M 2:30 to 3:30

F 9 to 11

Give a call 6899 or email [email protected]

if you want to come by or set up and appointment

Documents

Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion