Final Report (30% final score) Bin Liu, PhD, Associate Professor

Final Report (30% final score) Bin Liu, PhD, Associate Professor Contents There are two parts: project+report Project (remote homology detection) Report Review the methods for remote homology detection. Point out their advantages and disadvantages. How did you do the experiments? Information for each step. What are your results? What are the advantages, disadvantages, and novelty of your methods? Protein Remote Homology Detection Background Problem definition classification problem: The schematic plot of the hierarchy for the SCOP database Sequence similarity are from high to low Overview Datasetpairwise/pairwise/ 54 families and 4352 proteins. For More information about the dataset, refer to: Li Liao and William Stafford Noble. "Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships." Journal of Computational Biology. 10(6): , 2003. Data set Tab-delimited table 0 = not present; 1 = positive train; 2 = negative train; 3 = positive test; 4 = negative test. Feature extraction Extracting the features from the protein sequences, which can be found at Sequence file file in the supplementary. Sequence file Using your imagination to extract the features that can capture the character of the protein sequences. Dataset construction Based on supplementary files Tab- delimited table and Sequence file , the training sets and test sets can be constructed. Tab- delimited table Sequence file There are totally 54 datasets. Classifiers You are free to choose any classifiers, such as Support Vector Machines (SVMs), Artificial Neural network (ANN), Random Forest (RF), etc. Performance measure ROC score (AUC) The average ROC scores of all the 54 families should be given. Scoring function for the project and report Novelty and completeness: new features, new machine learning models, etc. Write down what makes your method different from others in this field. Does your method work? (40%) Mid results and source code (20%) Results (based on average ROC score) (10%) Report (30%) Important information This is individual work, not team work, so do it alone, but you are free to discuss with others. Due date: 30th April, 2015 (1 month later), all the data should be stored in one ZIP or RAR file and sent to TA viaor QQ. The title of theand your data: your name + student ID. (If your data is too large, contact TA directly). The slides of your presentation should be attached too. Other topic you can choose DNA binding protein identification Dataset is available atProt_dis/data.jspProt_dis/data.jsp Fold recognition Enhancer prediction Problem description DNA-binding proteins are very important components of both eukaryotic and prokaryotic proteomes. As approximately at least 2% of prokaryotic and 3% of eukaryotic proteins are able to bind to DNA, these proteins are important for various cellular processes. Problem description Therefore Developing an efficient model for identifying DNA-binding proteins from non DNA-binding proteins is an urgent research problem. Up to now, Although many efforts have been made in this regard, further effort is needed to enhance the prediction power. Dataset description There are two datasets in this project, including a benchmark dataset and an independent dataset, which are available at course websiteProt_dis/data.jspProt_dis/data.jsp For more information, see the following paper: Task and evaluation Task: Identify DNA-binding proteins from non DNA- binding proteins. Evaluation scheme: 1.Use validation techniques to optimize the parameters of your methods (if any), and obtain the results on the benchmark dataset 2. Train your classifiers on the benchmark dataset, and predict the proteins in the independent dataset. 3. Analysis the feature, and find some interesting patterns. Task and evaluation TP refers to the number of positive samples that are classified correctly; FP denotes the number of negative samples that are classified as positive sample; TN denotes the number of negative samples that are classified correctly; FN denotes that number of positive samples that are classified as negative samples. Task and evaluation Students from other majors. If you are not in CS department, please select one computational task in the field of bioinformatics. Write a review of the state-of-the-art predictors for this task. Discuss their advantages and disadvantages. Discuss the relationship between bioinformatics and your major. Can you use the idea from bioinformatics to your own project? At least 4000 words. Data Driven Machine Learning Approaches for Bioinformatics Case study--protein remote homology detection outline Overview Feature extraction Sequence-based features Profile-based features Other features Classifiers Feature analysis Data Driven Machine Learning Approaches for Bioinformatics Protein Function Data Key idea: Learn from known data and Generalize to unseen data Input: sequence features Output: function category Classifier : Map Input to Output Training Data Test Data Training Test Training: Build a classifier Test: Test the model Prediction New Data Split Several important components in this model Feature extraction. Given a protein, how to extract features only based on the primary sequence? Brainstorming? A study case: remote homology detection and protein-protein interaction Features derived from the primary sequence only. Ngrams. Leslie et al (possible subsequences of amino acids of a fxed length N); SVM-npeptide. Ogul et al (reduced amino acid alphabets) Mismatch kernel and Pattern (TEIRESIAS algorithm) Leslie CS et al and Dong et al 2005. Feature extraction Distance-based approach. Lingner et al 2006 Word correlation matrics. Lingner et al 2008 SVM-pairwise Feature vector is a list of pairwise sequence similarity scores. Liao et al. 2002 Profile-based features Profiles ACDEFGHIKLMNPQRSTVWY 1I V E G Q D A E V G L S P W Brainstorming. How to use the profile feature? Binary profile Dong et al. 2007 N-profile Liu et al. 2008 Order profile Liu et al. 2009 Top-n-grams Liu et al. 2008 ACC Dong et al AC ACC Other features (AAindex-based features) Physicochemical Distance Transformation (PDT) Liu et al. 2012 LSA (latent semantic analysis) Dong et al. 2006 Classifiers SVM kernel combination methodology VBKC Damoulas et al. 2008 Summary To establish a really useful statistical predictor for a biological system: (i) Benchmark dataset; (ii) Feature extraction; (iii)Machine learning algorithm; (iv)Web server or stand alone tools

Documents

Final Report (30% final score) Bin Liu, PhD, Associate Professor