Lec1-Into

CSE 591: Machine learning and Applications

Jieping YeDepartment of Computer Science & Engineering

Arizona State University

Brief Introduction Dr. Jieping Ye Assistant Professor at CSE Dept. Affiliated with the Center for Evolutionary

Functional Genomics at the Biodesign Institute Research interests: machine learning, data

mining and their applications to bioinformatics Dimensionality reduction Semi-supervised learning Kernel learning Biological image analysis

Outline of lecture Course information

Project

Introduction to ML

Course schedule

Survey

Course Information Instructor: Dr. Jieping Ye Office: BY 568 Phone: 727-7451 Email: [email protected] Web: http://www.public.asu.edu/~jye02/CLASSES/Spring-2007/ Time: TTh 4:40am—5:55pm Office hours: TTh 10:00 am -- 11:45 am Location: BYAC 270

TA: Jianhui Chen Office hours: 3:30 pm — 4:30 pm, Th

Course information (Cont’d) Prerequisite: Basics of linear algebra, a, algorithm design

and analysis.

Course textbook: No textbook is required. (Papers and other materials are available at the class web page)

Objective: An in-depth understanding of some of the important machine learning methods and their applications in bioinformatics and other domains.

Topics: Clustering, regression, classification, semi-supervised learning, feature reduction, manifold learning, ranking, and kernel learning.

Reference books Pattern Classification. Duda, et al. , 2000.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Hastie, et al., 2001.

Kernel Methods in Computational Biology. Scholkopf, et al., editors. 2004.

Kernel Methods for Pattern Analysis. Taylor and Cristianini, 2004.

Introduction to Data Mining. Tan, et al., 2005.

Grading Homework (3): 30%

Project: 40%. Two to three students form a group to carry out a small research project.

A survey of the state-of-art in an area related to this course Machine learning techniques for specific applications A comparative study of several well-known algorithms. Design of a novel algorithm related to this course.

Exam (1): 20%. There will be one open-book exam on 3/22/07.

Class participation: 10%. Students are required to attend the lecture and participate in the class discussion.

A: 90—100, A-: 85—89, B+: 80—84, B: 70—79, C: 60—70

Project Project proposal is due on 2/08/07

One half to one page Topics, references, and plan

The intermediate project report is due on 4/05/07 Five to ten pages

The final project report is due on 4/26/07 Fifteen to twenty pages

Project presentation About 5 minutes

Programming languages Matlab

Tutorials http://www.math.ufl.edu/help/matlab-tutorial/ http://www.math.mtu.edu/~msgocken/intro/node1.ht

ml

R (Statistics) http://www.r-project.org/

Or other languages

What is machine learning? Machine learning is the study of computer systems that

improve their performance through experience. Learn existing and known structures and rules. Discover new findings and structures.

Face recognition Bioinformatics

Supervised learning vs. unsupervised learning

Semi-supervised learning

Machine learning versus data mining

A lot of common topics Clustering Classification Many others

Different focuses ML focuses more on theory (statistics) DM focuses more on applications

Clustering

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Applications of Cluster Analysis

Understanding Group genes and proteins that have similar

functionality, or group stocks with similar price fluctuations

Summarization Reduce the size of large data sets

Clustering precipitation in Australia

Classification: Definition Given a collection of records (training set )

Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible.

A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier

Classification: Application

Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:

Use credit card transactions and the information on its account-holder as attributes.

When does a customer buy, what does he buy, how often he pays on time, etc

Label past transactions as fraud or fair transactions. This forms the class attribute.

Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card

transactions on an account.

Character Recognition

Given a digit representation.

What is it’s class?

AT&T have used Neural Networks Support Vector

Machines

Error rates ~1.4%

Inputs are 28x28 greyscale images.

Other applications

Face recognition

Protein function prediction

Cancer detection

Document categorization

Data representation Traditional algorithms work on vectors.

Images can be represented as matrices or vectors.

Abstract data Graphs Sequences 3D structures

Kernel Methods: Basic ideas

Original Space Feature Space

Applications in bioinformatics Protein sequence Protein structure

Data integrationmRNA

expression data

protein-protein interaction data

hydrophobicity data

sequence data

(gene, protein)

Genome-wide data

Curse of dimensionality Large sample size is required for high-dimensional data.

Query accuracy and efficiency degrade rapidly as the dimension increases.

Strategies Feature reduction Feature selection Manifold learning Kernel learning

Manifold learning

A manifold is a topological space which is locally Euclidean.

Intuition: how does your brain store these pictures?

Model selection Choose the best model from a set of different models to

fit to the data

Support Vector Machines (SVM), Linear Discriminant Analysis (LDA)

Models are specified by certain parameters. How to choose the best parameters? Cross-validation (leave one out, k-fold CV)

Machine learning applications Bioinformatics: Hugh amount of biological data

from the human genome project and human proteomics initiative.

Goal: Understanding of biological systems at the molecular level from diverse sources of biological data.

Challenge: Scalability, multiple sources, abstract data. Applications: Microarray data analysis, Protein

classification, Mass spectrometry data analysis, Protein-protein interaction.

Others: Computer vision, information retrieval,

image processing, text mining, web mining, etc.

Course schedule

Survey Why are you taking this course?

What would you like to gain from this course?

What topics are you most interested in learning about from this

course?

Any other suggestions?

Next class Topics

Basics of linear algebra Basics of probability

Readings (available at the class webpage) Mini tutorial on the Singular Value

Decomposition

Documents

Lec1-Into