Upload
hoangquynh
View
225
Download
1
Embed Size (px)
Citation preview
CS598 Machine Learning in
Computational Biology (Lecture 1: Introduction)
Professor Jian Peng Teaching Assistant: Rongda Zhu
IntroductionInstructor:
• Jian Peng My office location: 2118 SC Office hour: Thursday, 3:15pm-4:45pm Email: [email protected]
• My own research: Computational Biology and Graphical Models
Teaching Assistant: • Rongda Zhu, PhD student ([email protected]) (Department of Computer Science)
• Rongda’s research: Machine Learning and Probabilistic Inference
Course website: http://web.engr.illinois.edu/~jianpeng/teaching/CS598_Fall15/index.htm
Course Information
Schedule (tentative)
• Introductory lectures (Aug 25 to Sep 8) • Biology data analysis • Probabilisitic models
• Student presentations (Sep 8 to Dec 3)
• Research survey • Research article
• Course projects
• Proposal presentation (Oct 6 & 8) • Final presentation (Dec 8 &10)
ObjectivesIntroduction to computational biology
• Important problems in computational biology • Machine learning techniques for data analysis • Understand how methods work
Learning to do research
• Paper presentation • Ability to present key ideas to other people • Ability to ask critical questions
• Course project experience • Hands-on practice with real datasets • Propose and perform independent research • Active participation in the field
Prerequisites
Biology:
• Basic concepts in molecular biology • Reference:
Molecular Biology for Computer Scientists by Lawrence Hunter
Machine Learning:
• Probability and statistics • Optimization • Textbook:
Pattern Recognition and Machine Learning by Christopher Bishop
Grading
• Class attendance: 10%
• Presentation: 30%
• Course Project: 60% • Proposal • Report • Presentation
Presentation
• Discuss papers you would like to present with me at least one week before your presentation
• Research survey (at least five papers) • Methodology: applications to different problems • Research problem: the state-of-the-art methods
• Research article (preferred) • Background: what is the problem? why important? • Methodology: how does it work? • Results: what are the findings? any conclusions?
• Open-ended Q & A and debate
Questions about the presentation?
Course Project
Computational techniques • Novel machine learning algorithms • Efficient algorithms that scale on large datasets • New probabilistic models for biological data
Biological problems • New biological findings • Improvements over existing method • New computational biological problems
The goal is to have something publishable or presentable in conferences or journals.
Course Project
• Proposal presentation (Oct 6 & 8) • written proposal due by Oct 4 • at least four pages • discuss with me about your projects in Sep • 15-min presentation in class • I will also give you a list of potential projects
if you don’t have one by Sep 20.
• Final presentation (Dec 8 &10) • Report due by Dec 12 • at least eight pages • 15-min oral presentation and poster
Course Project
• Team size • one or two • make clear your contribution in the project report
• Implementation • put your code/data on github • get your hands dirty and work on real-world datasets • your contribution should be original
Questions about the course project?
Introduce yourself
Why computational biology is hard?
• High-dimensional
• Noisy
• Huge
• Sparse
Biological Data
Sequence data
• Protein/DNA sequence • Generative and discriminative models for sequences • Deep learning
Matrix data
• Gene expression • Dimensionality reduction and feature selection • Low-rank approximation
Biological Data
Network data
• Molecular network • Random walk algorithms • Graphical models and approximate inference
Heterogeneous data
• Dimensionality reduction • Probabilistic models for data integration • Network-based data integration
Machine Learning
Supervised learning • Prediction:
• classification: SVM, logistic regression, random forest • structured output: CRF, structured SVM
• Feature finding: • Sparse learning: LASSO and elastic nets
Unsupervised learning • Dimensionality reduction and embedding:
• manifold learning: Isomap, LLE, t-SNE • component analysis: PCA, ICA
• Probabilistic modeling: • graphical model: HMM, Bayesian networks, RBM • methodology: variational inference, sampling
Please read “Molecular Biology for Computer Scientists” by Lawrence Hunter
TODO after this class
Examples of my research projects
Protein sequence, structure and function
ACDEEEFGHIKL----MPQRSTVWY ACDE--FGHIKLRMQP----STVWY
sequence
structure function
Network analysis for disease modeling
human disease network
network analysis
new disease biology (potential drug targets)
Pharmacogenomics and cancer genomics
Figure from the DREAM challenge website
Integration of heterogeneous data
“Search” engine for drug discovery
Drug Protein
DiseaseSideeffect
perturbationassociation
association association
Pathway
membership
Cell type
on/off
Mutation
association
interaction
Diffusion Component Analysis
Network embedding
Variational inference
• Discriminance sampling for partition function estimation
• Combining variational inference and sampling approaches
Restricted Boltzmann Machine Deep Boltzmann Machine
Sampling Classification
Approximate inference