29
LOGO iDNA-Prot|dis: Identifying DNA- Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition Name WangShanyi ID 14S051067

LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Embed Size (px)

Citation preview

Page 1: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

LOGO

iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-

Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition

Name : WangShanyi ID : 14S051067

Page 2: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

DNA-binding protein play crucial roles in various cellular process

Method: iDNA-Prot|dis incorporating the amino acid distance-pair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition

Verification: Jackknife and Independent dataset tests

Web-Server: http://bioinformatics.hitsz.edu.cn/iDNA-Prot_dis/.

Abstract

Page 3: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Introduction

DNA-binding proteins are essential ingredients for both eukaryotic and prokaryotic cell. They can interact with DNA, and play crucial role in various cellular processes.

Page 4: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Introduction

computational methods: 1. the structure-based method major defection: Unavailable for

huge amount

2. the sequence-based method

Page 5: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Introduction

Procedure:(1)Construct or select a valid benchmark dataset to train

and test the predictor; (2)Formulate the samples with an effective mathematical

expression that can truly reflect their intrinsic correlation with the target to be predicted;

(3)Introduce or develop a powerful algorithm (or engine) to operate the prediction;

(4)properly perform cross-validation tests to objectively evaluate its anticipated accuracy;

(5)establish a user-friendly web-server for the predictor that is accessible to the public.

Page 6: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Page 7: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

PseAAC of Distance-Pairs and Reduced Alphabet Scheme

how to effectively formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence order information ???

1 the number of biological sequences with different sequence-orders is extremely high and their lengths vary widely.

2 SVM, neural network, random forest etc can only handle vector but not length-different sequences.

Page 8: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Page 9: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Page 10: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Page 11: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Reduced Amino Acid Alphabet Scheme

high-dimension disaster !!!

(1)unnecessarily increasing the computational time;(2)misrepresentation due to information redundancy or

noise that will lead to poor prediction accuracy; (3)the overfitting problem that will make the predictor with a

serious bias and extremely low capacity for generalization.

Page 12: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Reduced Amino Acid Alphabet Scheme

Page 13: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Reduced Amino Acid Alphabet Scheme

Page 14: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Reduced Amino Acid Alphabet Scheme

Advantages:

cut down the dimension of the PseAAC vector and improve the predictive performance.

Page 15: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Support Vector Machine

SVM is based on the structural risk minimization principle from statistical learning theory.

Page 16: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Evaluation Method of Performance

How to properly examine the prediction quality is a key for developing a new predictor and estimating its potential application value.

cross-validation : independent dataset test, subsampling or K-fold (such as 5-fold, 7-fold, or 10-fold) test, and jackknife

test.

Page 17: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Materials and Method

Evaluation Method of Performance

in literature a set of four metrics called the sensitivity (Sn),specificity (Sp), accuracy (Acc), and Mathew’s correlation coefficient (MCC)

Page 18: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Results and Discussion

Impact of the Pairwise Distance on the iDNA-Prot|dis

pairwise distance d=???

Page 19: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Results and Discussion

Discriminant Visualization and Interpretation

Page 20: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Results and Discussion

Discriminant Visualization and Interpretation

The discriminative power of all the 400 distance amino acid pairs in iDNA-Prot|dis is depicted in Fig. 3A.

Top three most discriminative amino acid pairs are R-R, K-R, and R-K according to the three darkest spots.

Page 21: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Results and Discussion

Discriminant Visualization and Interpretation

A specific DNA-binding protein 1HLV chain A was selected to investigate if the most discriminative distance amino acid pairs R-R reflect the characteristics of this DNA-binding protein.

For d=3,including RR, R*R, and R**R with distance 1, 2,

3, respectively.

Page 22: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Results and Discussion

Discriminant Visualization and Interpretation

Page 23: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Results and Discussion

Reduced Amino Acid Alphabet Scheme

A reduced alphabet is any clustering of amino acids based on some measure of their relative similarity

Question: Which is best ???

Page 24: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Results and Discussion

Comparison with Other Related Methods

Page 25: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Results and Discussion

Comparison with Other Related Methods

provide a graphic illustration to show the performances of the four predictors, the corresponding ROC (receiver operating characteristic) curves

Page 26: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Results and Discussion

Independent Test we also extended the comparison with other methods via

an independent dataset test

Page 27: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Results and Discussion

Web-Server GuideStep 1. Open the web-server by clicking the link at http://bioinformatics.hitsz.edu.cn/iDNA-Prot_dis/ and you will see

its top page. Click on the Read Me button to see a brief introduction about the server.

Step 2. Check the open circle to select which alphabet profile you are to use for conduct prediction.

Step 3. Either type or copy and paste the query proteinsequence into the input box at the center, or you can also

upload your input data by the Browse button. The input sequence should be in the FASTA format.

Step 4. Click on the Submit button to see the predicted result.

Page 28: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

Conclusions

It is anticipated that the iDNA-Prot|dis predictor will become a high throughput tool for both basic research and drug development.

Page 29: LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino

THANKS