27
Class 12-13 Notes

Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

Class 12-13 Notes

Page 2: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S1

Page 3: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S2

x ∙ 𝑤 = 𝑏 + 1

x ∙ 𝑤 = 𝑏 − 1

x ∙ 𝑤 = 𝑏

Page 4: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S3: Another Proof of margin for the curious:margin is distance from points u and v

on their perspective planes lying along the vector w

1x w b

1x w b

w

v

u 2

2

2 2

2

(1 )1

|| ||

( 1 )

|| ||

margin =|| ||

(1 ) ( 1 )|| ||

|| || || ||

2|| ||

|| ||

2

|| ||

u w v w

bu w b w w b

w

bsimilarly

w

u v

b bw w

w w

ww

w

+

-

Page 5: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S4

Page 6: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S5

x ∙ 𝑤 = 𝑏 + 1

x ∙ 𝑤 = 𝑏 − 1

x ∙ 𝑤 = 𝑏

Page 7: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S6: Approximate Misclassification error with a nicer convex function

Misclassification error

SVM error

a.k.a. hinge loss

-1 0 1yf(x)

1

Misclassification error for point (x,y) : step(-y(xw-b)) SVM Error for point(x,y):=max(1-y(xw+b),0)

Page 8: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S7: AMPL Model File HW3_QSARSVM_mod.txt

; #CONSTANTSparam p; # number of predictor variablesparam n; # number of observationsparam c; # SVM model parameter

set P:={1..p}; # indices of predictor variablesset N:={1..n}; # indices of observations

param X{N,P}; # observation data param Y{N}; # classes of each observation (1 or -1)

#VARIABLES of model var w{P};var b;var z{N};

#OBJECTIVE FUNCTION for SVM classification model minimize objective:c*sum{i in N} z[i] + sum{j in P} w[j]*w[j] ;

#CONSTRAINTSsubject to trainerror {i in N}:Y[i]*( sum {j in P} X[i,j]*w[j]-b ) + z[i] >= 1;

subject to slack {i in N}:z[i] >= 0

Page 9: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S8: AMPL Data File sampleSVM_dat2.txt

#PARAMETERSparam p:= 2; # number of featuresparam n:= 7; # total number of observationsparam c:=10; # constant used in SVM model

# All X Dataparam X :

1 2:= 1 1 02 -1 -23 -1 14 1 25 2 36 0 47 0 -0.5;

# Y Data (must be -1 or 1)param Y :=1 12 13 14 -15 -16 -17 -1;

Page 10: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S9: AMPL Command File HW3_QSARSVM_cmd.txt

# Command file to train and test SVM # Created by Kristin Benentt, March 2015.#Solve SVM Modelsolve;

# Display Solutions weights, bias, and objectivedisplay w;display b;

Page 11: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S8: Calling from NEOS using Minos solver

• http://www.neos-server.org/neos/solvers/nco:MINOS/AMPL.html

Page 12: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S9: Minos Results

Page 13: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S11: Estimating Generalization Error of a Model

Page 14: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S12: Classification Homework: Drug Discovery Problem

•Cost about $1Billion to develop drug•First-year sales > $1Billion/drug•1 drug approved/1000 compounds tested•1 out of 100 drugs succeeds to market•Trend – More drug failures in late stage clinical trials

or even after on market!•Pressure to reduce drug costs/development time

• Data Driven models to predict bioactivites of drugs have potential to accelerate drug discovery process• Efficacy• Toxicity• Absorption• Distribution• Metabolism• Excretion• Biodegradibility

Page 15: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S13: Classification Task

The European REACH regulation requires information on ready biodegradation, which is a screening test to assess the biodegradability of chemicals. At the same time REACH encourages the use of alternatives to animal testing which includes predictions from quantitative structure–activity relationship (QSAR) models.

Your assignment is to build a classification QSAR model to predict ready biodegradation of chemicals based on provided training set, estimate generalization error using the testing set. Estimate biodegradiblity of a new point.

Page 16: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S14: Data Available

Experimental values chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE). The training data sets consists of these 83e chemicals, whether they were readily biodegradabe (1= yes , -1 – no) and 41 molecular descriptors.

A testing set of 42 molecules with known biodegradabilityMystery point with unknown biodegradability

Data files are in form IDNUM, descriptor 1, ….., descriptor 41, classMolecules and Molecular descriptors are proprietary. No details provided except cryptic names in a separate file.

Page 17: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S15: Overall Experimental Design

Data providedTraining set of 833 41 dimensional pointsTest set of 50 41 dimensional points

Train classification model on training set to determine model parameters using NEOS.Calculate error on the training set. Test classification model with the model parameters found on the test set to see how well it works on future data.+ some other stuffMetric of success:small number of points misclassified on test set

Page 18: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S16: Many Data Sets are Published with Associated Papers.

Reference: Mansour et al, Training and testing a linear SVM models to predict biodegradibility of chemicals based on data, 2013

Location: http://pubs.acs.org/doi/abs/10.1021/ci4000213

Page 19: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion
Page 20: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S18: Regression Data Set: Energy Efficiency Data Set

We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 499 building shapes. The dataset comprises 80 samples and 8 features, aiming to predict two real valued responses.

The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses.

X1 Relative CompactnessX2 Surface AreaX3 Wall AreaX4 Roof AreaX5 Overall HeightX6 OrientationX7 Glazing AreaX8 Glazing Area Distributiony1 Heating Loady2 Cooling Load

Reference A. Tsanas, A. Xifara: 'Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools', Energy and Buildings, Vol. 49, pp. 560-567, 2012

Location http://archive.ics.uci.edu/ml/datasets/Energy+efficiency

Page 21: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

UC IRVINE

Page 22: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S20: Homework 3 Extra Credit

• Design and experiment to explore the effect of the tradeoff parameter λ in regression and C in classification on the training and testing errors.

• Describe what you observed in terms of training and testing errors. Indicate which model is best for each problem and why.

3 pts for Regression 3 pts for Classification

Page 23: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S21: Project

Project Description and Checklist will be available on Prof. K’s web page in DM folder.

Deliverables per group:

• Project Proposal - April 3

• Project Progress Report – April 24

• Project Final Report - May 12

• Project Poster Presentation (Grad Only)

May 22 during exam slot

Page 24: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S22: Sample Project Like homework

Find a problem that can be solved using data and a classification or regression model (can be SVM or other optimization based model). • Pick one or more data problems from sources like UCI, public

Papers• Make train and test sets for each problem.• Create one or more predictive models using NEOS (easy to make

many if you change parameters like C)• Run models on the training sets• Evaluate training set errors• Test models on the testing sets, calculate• Discuss your findings. How well did the models work? Which

models were preferable. Compare the performance of the models on the different data set in terms of efficiency, error, interpretability of solutions.

• Extra credit for creativity and exploration• See project description for more details.

Page 25: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S23: Finding Problems or Methods

1) scholar.google.com or pubmed

Find a paper and topic that interests you. Try to duplicate it

2) Start with a machine learning task and look up some relevant papers

Many learning algorithm have an optimization problem as their engine

3) Text book , wikipedia, problem in other class

4) Ask your favorite professor

Page 26: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S24: Finding Data

1) Machine learning repositoriesDELVEhttp://www.cs.toronto.edu/~delve/data/datasets.htmlUCIhttp://archive.ics.uci.edu/ml/2) Pubmed frequently has data in supplementary materialhttp://www.ncbi.nlm.nih.gov/pubmed3) Domain specific data repositories

finance, astronomy, biology, ???4) Your local research project5) Simulated data

Page 27: Class 12-13 Notes - Rensselaer Polytechnic Institutehomepages.rpi.edu/~bennek/class/compopt2015/Day13.pdfS12: Classification Homework: Drug Discovery Problem •Cost about $1Billion

S25: Help available for formulating project

Regular office hours

M 2:30 to 3:30

F 9 to 11

Give a call 6899 or email [email protected]

if you want to come by or set up and appointment