Upload
butest
View
107
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
CSE 591: Machine learning and Applications
Jieping YeDepartment of Computer Science & Engineering
Arizona State University
Brief Introduction Dr. Jieping Ye Assistant Professor at CSE Dept. Affiliated with the Center for Evolutionary
Functional Genomics at the Biodesign Institute Research interests: machine learning, data
mining and their applications to bioinformatics Dimensionality reduction Semi-supervised learning Kernel learning Biological image analysis
Outline of lecture Course information
Project
Introduction to ML
Course schedule
Survey
Course Information Instructor: Dr. Jieping Ye Office: BY 568 Phone: 727-7451 Email: [email protected] Web: http://www.public.asu.edu/~jye02/CLASSES/Spring-2007/ Time: TTh 4:40am—5:55pm Office hours: TTh 10:00 am -- 11:45 am Location: BYAC 270
TA: Jianhui Chen Office hours: 3:30 pm — 4:30 pm, Th
Course information (Cont’d) Prerequisite: Basics of linear algebra, a, algorithm design
and analysis.
Course textbook: No textbook is required. (Papers and other materials are available at the class web page)
Objective: An in-depth understanding of some of the important machine learning methods and their applications in bioinformatics and other domains.
Topics: Clustering, regression, classification, semi-supervised learning, feature reduction, manifold learning, ranking, and kernel learning.
Reference books Pattern Classification. Duda, et al. , 2000.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Hastie, et al., 2001.
Kernel Methods in Computational Biology. Scholkopf, et al., editors. 2004.
Kernel Methods for Pattern Analysis. Taylor and Cristianini, 2004.
Introduction to Data Mining. Tan, et al., 2005.
Grading Homework (3): 30%
Project: 40%. Two to three students form a group to carry out a small research project.
A survey of the state-of-art in an area related to this course Machine learning techniques for specific applications A comparative study of several well-known algorithms. Design of a novel algorithm related to this course.
Exam (1): 20%. There will be one open-book exam on 3/22/07.
Class participation: 10%. Students are required to attend the lecture and participate in the class discussion.
A: 90—100, A-: 85—89, B+: 80—84, B: 70—79, C: 60—70
Project Project proposal is due on 2/08/07
One half to one page Topics, references, and plan
The intermediate project report is due on 4/05/07 Five to ten pages
The final project report is due on 4/26/07 Fifteen to twenty pages
Project presentation About 5 minutes
Programming languages Matlab
Tutorials http://www.math.ufl.edu/help/matlab-tutorial/ http://www.math.mtu.edu/~msgocken/intro/node1.ht
ml
R (Statistics) http://www.r-project.org/
Or other languages
What is machine learning? Machine learning is the study of computer systems that
improve their performance through experience. Learn existing and known structures and rules. Discover new findings and structures.
Face recognition Bioinformatics
Supervised learning vs. unsupervised learning
Semi-supervised learning
Machine learning versus data mining
A lot of common topics Clustering Classification Many others
Different focuses ML focuses more on theory (statistics) DM focuses more on applications
Clustering
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
Applications of Cluster Analysis
Understanding Group genes and proteins that have similar
functionality, or group stocks with similar price fluctuations
Summarization Reduce the size of large data sets
Clustering precipitation in Australia
Classification: Definition Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
class
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
TestSet
Training Set
ModelLearn
Classifier
Classification: Application
Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card
transactions on an account.
Character Recognition
Given a digit representation.
What is it’s class?
AT&T have used Neural Networks Support Vector
Machines
Error rates ~1.4%
Inputs are 28x28 greyscale images.
Other applications
Face recognition
Protein function prediction
Cancer detection
Document categorization
Data representation Traditional algorithms work on vectors.
Images can be represented as matrices or vectors.
Abstract data Graphs Sequences 3D structures
Kernel Methods: Basic ideas
Original Space Feature Space
Applications in bioinformatics Protein sequence Protein structure
Data integrationmRNA
expression data
protein-protein interaction data
hydrophobicity data
sequence data
(gene, protein)
Genome-wide data
Curse of dimensionality Large sample size is required for high-dimensional data.
Query accuracy and efficiency degrade rapidly as the dimension increases.
Strategies Feature reduction Feature selection Manifold learning Kernel learning
Manifold learning
A manifold is a topological space which is locally Euclidean.
Intuition: how does your brain store these pictures?
Model selection Choose the best model from a set of different models to
fit to the data
Support Vector Machines (SVM), Linear Discriminant Analysis (LDA)
Models are specified by certain parameters. How to choose the best parameters? Cross-validation (leave one out, k-fold CV)
Machine learning applications Bioinformatics: Hugh amount of biological data
from the human genome project and human proteomics initiative.
Goal: Understanding of biological systems at the molecular level from diverse sources of biological data.
Challenge: Scalability, multiple sources, abstract data. Applications: Microarray data analysis, Protein
classification, Mass spectrometry data analysis, Protein-protein interaction.
Others: Computer vision, information retrieval,
image processing, text mining, web mining, etc.
Course schedule
Survey Why are you taking this course?
What would you like to gain from this course?
What topics are you most interested in learning about from this
course?
Any other suggestions?
Next class Topics
Basics of linear algebra Basics of probability
Readings (available at the class webpage) Mini tutorial on the Singular Value
Decomposition