Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Class 12-13 Notes
S1
S2
x ∙ 𝑤 = 𝑏 + 1
x ∙ 𝑤 = 𝑏 − 1
x ∙ 𝑤 = 𝑏
S3: Another Proof of margin for the curious:margin is distance from points u and v
on their perspective planes lying along the vector w
1x w b
1x w b
w
v
u 2
2
2 2
2
(1 )1
|| ||
( 1 )
|| ||
margin =|| ||
(1 ) ( 1 )|| ||
|| || || ||
2|| ||
|| ||
2
|| ||
u w v w
bu w b w w b
w
bsimilarly
w
u v
b bw w
w w
ww
w
+
-
S4
S5
x ∙ 𝑤 = 𝑏 + 1
x ∙ 𝑤 = 𝑏 − 1
x ∙ 𝑤 = 𝑏
S6: Approximate Misclassification error with a nicer convex function
Misclassification error
SVM error
a.k.a. hinge loss
-1 0 1yf(x)
1
Misclassification error for point (x,y) : step(-y(xw-b)) SVM Error for point(x,y):=max(1-y(xw+b),0)
S7: AMPL Model File HW3_QSARSVM_mod.txt
; #CONSTANTSparam p; # number of predictor variablesparam n; # number of observationsparam c; # SVM model parameter
set P:={1..p}; # indices of predictor variablesset N:={1..n}; # indices of observations
param X{N,P}; # observation data param Y{N}; # classes of each observation (1 or -1)
#VARIABLES of model var w{P};var b;var z{N};
#OBJECTIVE FUNCTION for SVM classification model minimize objective:c*sum{i in N} z[i] + sum{j in P} w[j]*w[j] ;
#CONSTRAINTSsubject to trainerror {i in N}:Y[i]*( sum {j in P} X[i,j]*w[j]-b ) + z[i] >= 1;
subject to slack {i in N}:z[i] >= 0
S8: AMPL Data File sampleSVM_dat2.txt
#PARAMETERSparam p:= 2; # number of featuresparam n:= 7; # total number of observationsparam c:=10; # constant used in SVM model
# All X Dataparam X :
1 2:= 1 1 02 -1 -23 -1 14 1 25 2 36 0 47 0 -0.5;
# Y Data (must be -1 or 1)param Y :=1 12 13 14 -15 -16 -17 -1;
S9: AMPL Command File HW3_QSARSVM_cmd.txt
# Command file to train and test SVM # Created by Kristin Benentt, March 2015.#Solve SVM Modelsolve;
# Display Solutions weights, bias, and objectivedisplay w;display b;
S8: Calling from NEOS using Minos solver
• http://www.neos-server.org/neos/solvers/nco:MINOS/AMPL.html
S9: Minos Results
S11: Estimating Generalization Error of a Model
S12: Classification Homework: Drug Discovery Problem
•Cost about $1Billion to develop drug•First-year sales > $1Billion/drug•1 drug approved/1000 compounds tested•1 out of 100 drugs succeeds to market•Trend – More drug failures in late stage clinical trials
or even after on market!•Pressure to reduce drug costs/development time
• Data Driven models to predict bioactivites of drugs have potential to accelerate drug discovery process• Efficacy• Toxicity• Absorption• Distribution• Metabolism• Excretion• Biodegradibility
S13: Classification Task
The European REACH regulation requires information on ready biodegradation, which is a screening test to assess the biodegradability of chemicals. At the same time REACH encourages the use of alternatives to animal testing which includes predictions from quantitative structure–activity relationship (QSAR) models.
Your assignment is to build a classification QSAR model to predict ready biodegradation of chemicals based on provided training set, estimate generalization error using the testing set. Estimate biodegradiblity of a new point.
S14: Data Available
Experimental values chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE). The training data sets consists of these 83e chemicals, whether they were readily biodegradabe (1= yes , -1 – no) and 41 molecular descriptors.
A testing set of 42 molecules with known biodegradabilityMystery point with unknown biodegradability
Data files are in form IDNUM, descriptor 1, ….., descriptor 41, classMolecules and Molecular descriptors are proprietary. No details provided except cryptic names in a separate file.
S15: Overall Experimental Design
Data providedTraining set of 833 41 dimensional pointsTest set of 50 41 dimensional points
Train classification model on training set to determine model parameters using NEOS.Calculate error on the training set. Test classification model with the model parameters found on the test set to see how well it works on future data.+ some other stuffMetric of success:small number of points misclassified on test set
S16: Many Data Sets are Published with Associated Papers.
Reference: Mansour et al, Training and testing a linear SVM models to predict biodegradibility of chemicals based on data, 2013
Location: http://pubs.acs.org/doi/abs/10.1021/ci4000213
S18: Regression Data Set: Energy Efficiency Data Set
We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 499 building shapes. The dataset comprises 80 samples and 8 features, aiming to predict two real valued responses.
The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses.
X1 Relative CompactnessX2 Surface AreaX3 Wall AreaX4 Roof AreaX5 Overall HeightX6 OrientationX7 Glazing AreaX8 Glazing Area Distributiony1 Heating Loady2 Cooling Load
Reference A. Tsanas, A. Xifara: 'Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools', Energy and Buildings, Vol. 49, pp. 560-567, 2012
Location http://archive.ics.uci.edu/ml/datasets/Energy+efficiency
UC IRVINE
S20: Homework 3 Extra Credit
• Design and experiment to explore the effect of the tradeoff parameter λ in regression and C in classification on the training and testing errors.
• Describe what you observed in terms of training and testing errors. Indicate which model is best for each problem and why.
3 pts for Regression 3 pts for Classification
S21: Project
Project Description and Checklist will be available on Prof. K’s web page in DM folder.
Deliverables per group:
• Project Proposal - April 3
• Project Progress Report – April 24
• Project Final Report - May 12
• Project Poster Presentation (Grad Only)
May 22 during exam slot
S22: Sample Project Like homework
Find a problem that can be solved using data and a classification or regression model (can be SVM or other optimization based model). • Pick one or more data problems from sources like UCI, public
Papers• Make train and test sets for each problem.• Create one or more predictive models using NEOS (easy to make
many if you change parameters like C)• Run models on the training sets• Evaluate training set errors• Test models on the testing sets, calculate• Discuss your findings. How well did the models work? Which
models were preferable. Compare the performance of the models on the different data set in terms of efficiency, error, interpretability of solutions.
• Extra credit for creativity and exploration• See project description for more details.
S23: Finding Problems or Methods
1) scholar.google.com or pubmed
Find a paper and topic that interests you. Try to duplicate it
2) Start with a machine learning task and look up some relevant papers
Many learning algorithm have an optimization problem as their engine
3) Text book , wikipedia, problem in other class
4) Ask your favorite professor
S24: Finding Data
1) Machine learning repositoriesDELVEhttp://www.cs.toronto.edu/~delve/data/datasets.htmlUCIhttp://archive.ics.uci.edu/ml/2) Pubmed frequently has data in supplementary materialhttp://www.ncbi.nlm.nih.gov/pubmed3) Domain specific data repositories
finance, astronomy, biology, ???4) Your local research project5) Simulated data
S25: Help available for formulating project
Regular office hours
M 2:30 to 3:30
F 9 to 11
Give a call 6899 or email [email protected]
if you want to come by or set up and appointment