Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
sMOL Explorer 1.1
sMOL Explorer*: User’s GuideCopyright © 2007
* Supawadee Ingsriswang; Eakasit Pacharawongsakda (2007), "sMOL Explorer: an open source, web-enabled database and exploration tool for Small MOLecules datasets", Bioinformatics, Vol 23(18), September, pp. 2498- 2500 http://bioinformatics.oxfordjournals.org/cgi/reprint/23/18/2498
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
1
sMOL Explorer 1.1
Table of Contents Page
1. Getting Started with sMOL Explorer1.1. User Registration 31.2. Menus 4
2. Structure Data Management2.1. Direct Entry 72.2. Batch upload 72.3. Data Workspace 11
3. Structural Similarity and Text Search3.1. Structure Search 153.2. Text Search 15
4. Clustering Analysis4.1. Loading or Selecting Data 184.2. Selecting a Clustering Method 194.3. The Clustering Output 20
5. Finding Frequent Substructure5.1. Loading or Selecting Data 225.2. Specifying the minimum support threshold 225.3. The List of Frequent Substructures 22
6. Feature Selection6.1. Loading or Selecting Data 246.2. Selecting a Feature Selection Method 246.3. The Output 25
7. Classification7.1. Loading or Selecting Data 277.2. Testing options 277.3. Selecting a Classifier 287.4. The Classification output 29
8. Utilities8.1. Data Preparation 318.2. File Conversion 318.3. Computing Molecular Descriptors 328.4. Administrator Tasks 33
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
2
sMOL Explorer 1.1
1. Getting started with sMOL Explorer
1.1 User Registration Click on Sign Up in the Login Page to register a user account of sMOL
Explorer. In Sign-Up page, you can follow these two steps (Figure 1-1):
Note: Default username is administrator and password is 1q2w3e4r.
Figure 1-1 Sign-Up page
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
3
sMOL Explorer 1.1
• Type your required information including username, password, first name, last name, email address and telephone number
• Click on the Submit Button to send your request to the system administrator. After you have the granted permission from the administrator, you can sign into sMOL Explorer.
1.2 MenusOnce you sign in, you can begin using sMOL Explorer. The menu bar always
appears at the top of the screen as shown in Figure 1-2. It contains, from left to right, the Structure Registration Menu, the Search Menu, the Data Analysis Menu, the Data Workspace menu, the Utility Menu and the Logout. To navigate menus, drag mouse over the menu title, then left-click (or just click with a single button mouse) on the item you want.
Figure 1-2 Menu bar
Figure 1-3 Structure Registration Menu
The Structure Registration Menu is in the top-left corner of the screen and, when clicked, shows a menu containing three items: (Figure 1-3)
• Direct Entry: Register molecule by molecule into the database• Batch upload: Prepare data of multiple molecules in a data file and upload
into the database• Edit/Delete Compound: Edit or Remove molecules
For more information, see details on how to manage the structure database with sMOL Explorer in Section 2.
Figure 1-4 Search Menu
The Search Menu consists of two items: (Figure 1-4)• Structure Search: Find all the compounds in the database that have the
given structure or substructure. In sMOL Explorer, there are three basic categories: exact structure, substructure and structural similarity searches.
• Text Search: Find the compounds that have information relevant to the query text.
See details in Section 3.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
4
sMOL Explorer 1.1
Figure 1-5 Data Analysis Menu
The Data Analysis Menu lists the tools for exploring data within sMOL Explorer. Click on the following items when you want to: (Figure 1-5)
• Clustering: Cluster the selected molecules based on molecular fingerprints.• Molecular Substructure Miner: Find the list of frequent substructures that
occur in molecules above the minimum support in the dataset. • Feature Selection: Remove irrelevant features from the dataset before
attempting to train a classifier.• Classification: Train and test a model in classifying the compounds.
Detailed information on how to analyze or explore the structure data with sMOL Explorer are in Section 4 to 7.
Figure 1-6 Data Workspace menu
The Data Workspace Menu consists of three workspace categories relating to three common operations: (Figure 1-6)
• Upload Workspace: Manage the previously uploaded data in the Data Workspace.
• Search Workspace: Operate the search results saved in the Data Workspace
• Analysis Workspace: Keep the saved analysis results in the Data WorkspaceTo use the Data Workspace efficiently, go to Section 2.
Figure 1-7 Utility menu
The Utility Menu integrates utilities supporting the file format conversion, computing molecular descriptors of structure, and administrator tasks. The Utility Menu gives four items: (Figure 1-7)
• Data Preparation: Prepare data into the sMOL-defined tab-delimited format. This file will be use as input of data analysis.
• File Conversion: Convert chemical data file formats • Calculate Descriptor: Compute molecular descriptors of the query
structure.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
5
sMOL Explorer 1.1
• Administrator Tasks: Manage user accounts, setup the system configuration and update the URLs to external databases.
See details in Section 8.
The Logout: Click on the Logout when you want to closes the program and logout from the system.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
6
sMOL Explorer 1.1
2. Structure Data Management2.1 Direct Entry
In mode of direct entry, users can add a structure of small molecule into database via the web with several options:
Draw interactively the 2D structure of molecules or Paste SMILES via JChemPaint editor,
Upload the Mol file directly into the database.After submitting the structure data, user can enter the associated screening
data.
2.2 Batch UploadFor batch mode, users can prepare structure and screening data in either
SDF file or sMOL-defined XML file and upload into the database.
SDF-FileThe SDF file format is defined by MDL (Molecular Design Ltd). A SDF
file can contain multiple compounds together with properties and references. The SDF file for sMOL Explorer must contain the following SD fields.
Required SD Fields- Datasource The name of data source.- CompoundID The number of compounds containing in this file- SMILES SMILES string to be used to represent the chemical structure for the
compound being registered. It will be ignored if a chemical structure with atoms is also provided in the SD file format.
- CompoundName The name of compound. - Category The category of compound. It can be :
(1) Natural Product , or(2) Commercially Available , or(3) Semi-synthesis
- CompoundType The type of compound. It can be :
(1) Terpenes / Steroids , or(2) Alkaloids , or(3) Polyketides , or(4) Fatty acids , or(5) Unknown
- Available This field is used to check the permission of user who can view this
compound. It should be “Only registed user” or “Everyone”.
Allowed SD Fields- CASNumber Chemical Abstract Service identification number. (if molecule is in CAS)- PubChemID NCBI's PubChem database identification number. (if molecule is in
PubChem)
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
7
sMOL Explorer 1.1
- KEGGIDKyoto Encyclopedia of Genes and Genomes compound identification
number. (if molecule is in KEGG) - IUPACName
IUPAC or standard chemical name for the compound. - MeltingPoint
Melting point (if solid) or boiling point (if liquid) in degrees Celsius - SpecimenVoucher
A specimen voucher is the remainder from which this compound has been isolated. Currently sMOL Explorer support only one specimen voucher per compound. - Type
The type of organism. It should be “Microbe” or “Plant” - Phylum
The systematic name that represents the biological Phylum of this organism. - Order
The systematic name that represents the biological Order of this organism. - Family
The systematic name that represents the biological Family of this organism.
- GenusThe systematic name that represents the biological Genus of this
organism. - Species
The systematic name that represents the biological Species of this organism. - NumberOfActivities
The number of biological activities of this compound. - ActivityName
The name of biological activity. Note: If the NumberOfActivities is greater than one, it will be
ActivityName concat an order number, for example, ActivityName1, ActivityName2. - ActivityMeasure
- The toxicity and cell viability assessments. It can be : - IC50 : for concentration likely to cause a 50% reduction in light output
from the population. - EC50 : for effective concentration that inhibits growth in 50% of the
tested population. - MIC : for determining the minimal inhibitory concentration. Note: If the NumberOfActivities is greater than one, it will be
ActivityMeasure concat an order number, for example, ActivityMeasure1, ActivityMeasure2. - ActivityValue
The value of bioassay test.Note: If the NumberOfActivities is greater than one, it will be
ActivityValue concat an order number, for example, ActivityValue1, ActivityValue2. - ActivityConfidence
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
8
sMOL Explorer 1.1
The confidence level of biological activities. It can be on of the following items.
(1) Weakly Active(2) Moderately Active(3) Strongly Active(4) UnknownNote: If the NumberOfActivities is greater than one, it will be
ActivityConfidence concat an order number, for example, ActivityConfidence1, ActivityConfidence2. - Application
The utilization of this organism. - NumberOfReferences
The number of publications related to this compound.- ReferenceTitle The title of publication. Note: If the NumberOfReferences is greater than one, it will be
ReferenceTitle concat an order number, for example, ReferenceTitle1, ReferenceTitle2.
- ReferenceAuthor – separate by ;All authors of the publication. If the publication has many authors, use ;
(semi-colon) to separate each author. Note: If the NumberOfReferences is greater than one, it will be
ReferenceAuthor concat an order number, for example, ReferenceAuthor1, ReferenceAuthor2. - ReferenceYear
The year of publication. Note: If the NumberOfReferences is greater than one, it will be
ReferenceYear concat an order number, for example, ReferenceYear1, ReferenceYear2.
- ReferenceJournalThe journal name of publication.Note: If the NumberOfReferences is greater than one, it will be
ReferenceJournal concat an order number, for example, ReferenceJournal1, ReferenceJournal2. - ReferenceVolume
The volume of journal.Note: If the NumberOfReferences is greater than one, it will be
ReferenceVolume concat an order number, for example, ReferenceVolume1, ReferenceVolume2. - ReferencePage
The page of publication in journal. It uses – (dash) for separating start page and end page.
Note: If the NumberOfReferences is greater than one, it will be ReferencePage concat an order number, for example, ReferencePage1, ReferencePage2.
sMOL-defined XMLLike the SDF format, the DTD of sMOL-defined XML is shown below.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
9
sMOL Explorer 1.1
<!DOCTYPE DataSet [<!ELEMENT DataSet ( Compound+ ) ><!ELEMENT Compound ( Structure, Characteristic, Bioresource, Activities, Application, References ) ><!ATTLIST Compound number NMTOKEN #REQUIRED >
<!ELEMENT Structure ( MOL?, SMILES? ) ><!ELEMENT MOL ( #PCDATA ) ><!ELEMENT SMILES ( #PCDATA ) >
<!ELEMENT Characteristic ( Datasource, CASNumber?, PubChemID?, KEGGID?, CompoundName, IUPACName?, MeltingPoint?, Category, CompoundType, Available ) >
<!ELEMENT Datasource ( #PCDATA ) ><!ELEMENT CASNumber ( #PCDATA ) ><!ELEMENT PubChemID ( #PCDATA ) ><!ELEMENT KEGGID ( #PCDATA ) ><!ELEMENT CompoundName ( #PCDATA ) ><!ELEMENT IUPACName ( #PCDATA ) ><!ELEMENT MeltingPoint ( #PCDATA ) ><!ELEMENT Category ( #PCDATA ) ><!ELEMENT CompoundType ( #PCDATA ) ><!ELEMENT Available ( #PCDATA ) >
<!ELEMENT Bioresource ( SpecimenVoucher?, Type?, Phylum?, Order?, Family?, Genus?, Species ) >
<!ELEMENT SpecimenVoucher ( #PCDATA ) ><!ELEMENT Type ( #PCDATA ) ><!ELEMENT Phylum ( #PCDATA ) ><!ELEMENT Order ( #PCDATA ) ><!ELEMENT Family ( #PCDATA ) ><!ELEMENT Genus ( #PCDATA ) ><!ELEMENT Species ( #PCDATA ) >
<!ELEMENT Activities ( Activity+ ) ><!ELEMENT Activity ( Name, Measure, Value, Confidence ) ><!ATTLIST Activity number NMTOKEN #REQUIRED ><!ELEMENT Name ( #PCDATA ) ><!ELEMENT Measure ( #PCDATA ) ><!ELEMENT Value ( #PCDATA ) ><!ELEMENT Confidence ( #PCDATA ) >
<!ELEMENT Application ( #PCDATA ) >
<!ELEMENT References ( Reference+ ) ><!ELEMENT Reference ( Title, Authors, Year, Journal, Volume?, Page? ) ><!ATTLIST Reference no NMTOKEN #REQUIRED ><!ELEMENT Title ( #PCDATA ) ><!ELEMENT Authors ( Author+ ) ><!ELEMENT Author ( FirstName, MiddleName, LastName ) ><!ATTLIST Author no NMTOKEN #REQUIRED ><!ELEMENT FirstName ( #PCDATA ) ><!ELEMENT MiddleName ( #PCDATA ) ><!ELEMENT LastName ( #PCDATA ) ><!ELEMENT Year ( #PCDATA ) ><!ELEMENT Journal ( #PCDATA ) ><!ELEMENT Volume ( #PCDATA ) ><!ELEMENT Page ( Start, End ) ><!ELEMENT Start ( #PCDATA ) ><!ELEMENT End ( #PCDATA ) >]>
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
10
sMOL Explorer 1.1
2.3 Data WorkspaceThe Data Workspace is an environment for handling data being used by each user. It contains the following three types of working spaces relating to three common operations in sMOL Explorer.
• Click on Upload Workspace for managing the previously uploaded data in the Upload Workspace.
• Click on Search Workspace for operating the saved search results in the Search Workspace
• Click on Analysis Workspace for processing the saved analysis results in the Analysis Workspace
Upload WorkSpace:Once a dataset has been uploaded into the sMOL Explorer for analysis, it will be stored in the Upload Workspace.
Figure 2-1 Upload Workspace page.
In the Upload WorkSpace Page, you can perform the following tasks (Figure 2-1)• Input the dataset name or click on to select a data file and
then click on to upload a new dataset• Click on Delete or Download at the row corresponding to the dataset you
want to remove or download from the database respectively. • Click on the Clear All Link to delete all datasets
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
11
sMOL Explorer 1.1
• Click on the dataset name to select and view the data (Figure 2-2). At the bottom of the data page, you can
o Click on the Export Link to export the dataseto Click on the Analyze Link to transfer the dataset to the analysis
main page.o Click on the Clear All to delete the dataset
Figure 2-2 Compounds in each data set.
Search WorkSpace :Each time you search the database in sMOL Explorer, you can select the molecules from search result to be combined in the Search Workspace. In the Search Workspace Page, you can (Figure 2-3)
• Determine the number of data to display in this page• Click on Delete at the row corresponding to the molecule you want to
remove from the Search Workspace. • Click on the generic name at the row corresponding to the molecule you
want to view the molecule data. • At the bottom of the Search Workspace page, you can
o Click on the Export Link to export all the data from the Search Workspace
o Click on the Analyze Link to transfer all the data from the Search Workspace to the analysis main page.
o Click on the Clear All to delete all the data in Search Workspace
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
12
sMOL Explorer 1.1
Figure 2-3 Search workspace page.
Analysis WorkSpace: The analysis space keeps the saved results from the data analysis for each
user. When the data analysis finishes and displays the result, user can click on the Save to WorkSpace button to save the result into Analysis Workspace. The Analysis Workspace Page lists the saved analysis result by types of analysis. In the Analysis Workspace Page, you can perform the following tasks. (Figure 2-4)
• Click on Delete at the row corresponding to the saved result you want to remove from the workspace.
• Click on the dataset name at the row corresponding to the saved result you want to select and view the result.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
13
sMOL Explorer 1.1
Figure 2-4 Analysis workspace page.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
14
sMOL Explorer 1.1
3. Structural Similarity and Text Search3.1 Structure Search
sMOL Explorer supports structure search in three basic categories:• Exact Search• Substructure Search• Similarity Search.To use structure search in sMOL explorer, chemist can paste a molecular
structure into JChemPaint interface or upload a data file, select the database and search type to find the similar compound in database. sMOL Explorer also allows users to search molecules against public accessible databases including PubChem, KEGG, DrugBank and eMolecules via an internet. For similarity search, users must specify the similarity measure such as Tanimoto, Cosine or Simpson and similarity threshold. (Figure 3-1)
3.2 Text SearchIn text search, sMOL Explorer allows users to specify text search terms to find the compounds that have information relevant to the query text as shown in Figure 3-2.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
15
sMOL Explorer 1.1
Figure 3-1 Structure search page.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
16
sMOL Explorer 1.1
Figure 3-2 Text search page.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
17
sMOL Explorer 1.1
4. Clustering Analysis4.1 Loading or Selecting Data
User can directly upload a new dataset or select a dataset from the workspace for clustering analysis. In Figure 4-1, you can perform the following tasks.
• Click on the New Data Set, then enter the filename or click on Click
to browse a data file to be uploaded. Then insert name of data set. (Figure 4-1)
• Otherwise click on the Data Workspace. You can select a type of following workspaces:
o Upload Workspace for the previously uploaded datasetso Search Workspace for the saved search resulto Analysis Workspace for the previously saved results from data
analysisWhen a workspace is selected, you can choose a dataset to restore from the list of datasets previously uploaded/saved in that workspace. (Figure 4-2)
• Type a name in the Result Name Textbox for the clustering result
Figure 4-1 Clustering page with new data set as input.
Figure 4-2 Clustering page with upload data set as input.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
18
sMOL Explorer 1.1
4.2 Selecting a Clustering Method
Presently, clustering methods in sMOL Explorer can be grouped into two types: Partitioning and Hierarchical algorithms. Hierarchical Methods include Agglomerative Nesting Clustering (R: AGNES) and Hierarchical clustering (R: HClust) from R-packages, while partitioning methods are K-Centroids Cluster analysis from R-packages, and Minimum Entropy clustering (Figure 4-3).
Figure 4-3 Clustering algorithm.
• Partitioning methodso K-Centroids
Number of Clusters: Specify the initial number of clusters Family: Select a clustering method such as K-Means, K-
Medians, Angle, Expectation-based Jaccardo Minimum Entropy clustering
Number of Clusters: Specify the initial number of clusters Alpha: Kernel: Hypercube /Guassian Bandwidth Number of K-Means Iterations
• Agglomerative Nesting and Hierarchical Clusteringo Similarity Metric: You can choose one of the following methods for
measuring the similarity between a sample-pair in the dataset. Euclidean Maximum Binary Canberra Manhattan Minkowski
o Clustering Methods: To combine or separate two clusters of data, you need to measure the distance between groups or clusters. Based on the different inter-group distance measures, there are a number of clustering methods to use as below.
Group Average Method Single Linkage Method Complete Linkage Method Ward’s Method Weighted Average Method
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
19
sMOL Explorer 1.1
The other three methods: Mcquitty, Median and Centroid are available only for Hierarchical Clustering.
4.3 The Clustering OutputThe outputs produced are available to you for online inspection, download, and your own analysis. (Figure 4-4 and Figure 4-5)
• Allows the user to download the solutions and visualizations in PDF format.
• To save the clustering result, Click on the
Figure 4-4 Clustering result of K-Centroids Cluster Analysis algorithm.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
20
sMOL Explorer 1.1
Figure 4-5 Clustering result of Agglomerative Nesting Clustering.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
21
sMOL Explorer 1.1
5. Finding Frequent Substructure5.1 Loading or Selecting Data
User can directly upload a new dataset or select a dataset from the workspace. (Figure 5-1)
Figure 5-1 Molecular Substructure Miner.
In Figure 5-1, you can perform the following tasks.• Click on the New Data Set, then enter the filename or click on Click
to browse a data file to be uploaded• Otherwise click on the Data Workspace. You can select a type of following
workspaces: o Upload Workspace contains the previously uploaded datasetso Search Workspace keeps the current search resulto Analysis Workspace includes the previously saved results from data
analysis. When a workspace is selected, you can choose a dataset to restore from the list of datasets previously uploaded/saved in that workspace.
• Type a name in the Result Name Textbox for the frequent substructures result
5.2 Specifying the minimum support thresholdThe minimum support is actually the frequency of small molecules
containing the same substructure. Users must specify a minimum support threshold
and click on for finding the frequent substructures in the dataset.
5.3 The List of Frequent SubstructuresThe output from this analysis normally returns a list of frequent
substructures that occur in molecules above the specified minimum support in the
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
22
sMOL Explorer 1.1
dataset. Users can export the result into a file (XML or tab-delimited format) by
clicking on the or click on the to save it into the analysis workspace. (Figure 5-2) You can also view the previously saved results by clicking on a result name from the Previous Saved Results Box at the right corner of the screen (Figure 5-1).
Figure 5-2 List of frequent substructure.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
23
sMOL Explorer 1.1
6. Feature Selection6.1 Loading or Selecting Data
User can directly upload a new dataset or select a dataset from the workspace for identifying the important attributes. In Figure 6-1, you can perform the following tasks.
• Click on the New Data Set, then enter the filename or click on Click
to browse a data file to be uploaded• Otherwise click on the Data Workspace. You can select a type of following
workspaces: o Upload Workspace contains the previously uploaded datasetso Search Workspace keeps the current search resulto Analysis Workspace includes the previously saved results from data
analysis.When a workspace is selected, you can choose a dataset to restore from the list of previously uploaded/saved datasets in that workspace.
• Type a name in the Result Name Textbox for the feature selection result
Figure 6-1 Feature Selection.
6.2 Selecting a Feature Selection MethodsMOL Explorer provides two feature selection techniques : Variable
selection From Random Forest and Regression Subset Selection. Users can select an algorithm, specify parameters and click on Run button to start the feature selection process.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
24
sMOL Explorer 1.1
• Variable Selection from Random Forest (R: varSelRF)Random forest, a classification algorithm developed by Breiman, is an ensemble of individual tree predictor. Each of unpruned classification trees is built using a bootstrap sample of the data. Each node is split using the best split from random sampling of the variables. Thus, to classify new data, the predicted values from a number of trees are combined into a vote on class identity. During bootstrap iteration and the OOB (out-of-bag) prediction, predicting the data not in the bootstrap sample, random forests estimate the error rate and return several measures of variable importance, which can be used to perform variable or feature selection. The randomForest package and varSelRF package implemented in R are integreted in sMOL Explorer for comparing the importance of the features in classification. sMOL Explorer allows users to tune three parameters of the varSelRF as follows.
o mtryFactor: Enter the multiplication factor of sqrt{number.of.variables} for the number of variables to use for the ntry argument of randomForest
o ntree: Input the number of trees to be generated for the first forest
o ntreeIterat: Input the number of trees to use (ntree of randomForest) for all additional forests
• Regression Subset Selection (R: regsubsets)sMOL Explorer includes the leaps package implemented in R for Regression Subset selection. It performs a search for the best subsets of the variables in x for predicting y in linear regression. There are two parameters:
o Search Method: Choose a method from exhaustive search, forward selection, backward selection or sequential replacement to search
o nvmax: Specify the maximum size of feature subsets to examine
7.3 The Output of Feature SelectionThe output from this analysis normally contains two parts: (Figure 6-2)
• The list of feature subsets that are selected to evaluate their predictive ability
• The final set of selected features from the dataset.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
25
sMOL Explorer 1.1
Figure 6-2 Feature Selection result.
Similar to other analysis, users can export the result into a file (XML or tab-
delimited format) by clicking on the or click on the
to save it into the analysis workspace. You can also view the previously saved results by clicking on a result name from the Previous Saved Results Box at the right corner of the screen.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
26
sMOL Explorer 1.1
7. Classification7.1 Loading or Selecting Data
User can directly upload a new dataset or select a dataset from the workspace to build the classification model. In Figure 7-1, you can perform the following tasks.
• Click on the New Data Set, then enter the filename or click on Click
to browse a data file to be uploaded• Otherwise click on the Data Workspace. You can select a type of following
workspaces: o Upload Workspace contains the previously uploaded datasetso Search Workspace keeps the current search resulto Analysis Workspace includes the previously saved results from data
analysis.When a workspace is selected, you can choose a dataset to restore from the list of datasets previously uploaded/saved in that workspace.
• Type a name in the Result Name Textbox for the classification result
Figure 7-1 Classification.
7.2 Testing optionsUsers can specify the data set or use the training data for testing the
classification model. There are two options in the classifier evaluation: k-fold cross validation and leave one out (LOO) cross validation.
• K-fold cross-validation: The dataset is divided into K subsets. Of the K subsets, a subset is used as the testing data, and the remaining K − 1
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
27
sMOL Explorer 1.1
subsets are put together to form a training dataset. The model evaluation is then repeated K times with each of the K subsets used exactly once as the testing data. In this option, users must enter the number of folds (K) for using in the cross-validation process.
• LOO cross validation: The training set is represented by all the dataset without a sample, and the testing set has only a sample.
7.3 Selecting a ClassifiersMOL Explorer gives five classification algorithms including Naïve Baye, C4.5
Decision Tree, Random Forest, Neural Network and Support Vector Machine from Weka and R packages. Users can select an algorithm with default parameter setting and click on Classify button to train and test the data. To change parameter values for each classification, select the Advanced Setup checkbox before clicking on the Classify button.
Below is how to set the parameter values for each algorithm.• Naïve Baye (weka.classifiers.bayes.NaiveBayes)
o Use kernel density: Select True if you want to use a kernel estimator for numeric attributes rather than a normal distribution and False otherwise.
o Use supervised discretization: Select True if you want to use supervised discretization to convert numeric attributes to nominal ones and False otherwise.
• C4.5 Decision Tree (weka.classifiers.trees.J48)o Use Unpruned Tree: Select False if pruning is performed; otherwise
select True.o Confidence Threshold: Input smaller values if more pruning is
required. o The Minimum Number of Instances per Leaf:o Use Reduced Error Pruning: Select True if you want to use
reduced-error pruning instead of C4.5 pruning and False otherwise.o The Number of Folds for Reduced Error: Determine the number of
folds, K, used for reduced-error pruning. The dataset is divided into K subsets. One subset is used for pruning, the rest K − 1 subsets for growing the tree.
o Use Binary Splits only: Select True if you want to use binary splits when building the tree and False otherwise.
o Seed for Random Data Shuffling : Specify the number for randomizing the data when reduced-error pruning is used.
• Random Forest (weka.classifiers.trees.RandomForest)o Number of Trees: Input the number of trees to be generated.o Number of Features to consider: Specify the number of randomly
chosen attributes o Seed for Random Number Generator: Set the random number seed
to be usedo The Maximum Depth of the Trees: Specify the maximum depth of
the trees, 0 for unlimited.• Neural Network (weka.classifiers.functions.multilayerPerceptron)
o Learning Rateo Momentum Rate
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
28
sMOL Explorer 1.1
o Number of Epochs• Support Vector Machine (R:e1071 package)
o Kernel Function: Specify the Kernel Type used in training and predicting
o Degree: a parameter needed for kernel of type polynomial (default: 3)
o Gamma: a parameter needed for all kernels except linear (default: 1/(data dimension))
o Coef0: a parameter needed for kernels of type polynomial and sigmoid (default: 0)
o Cost: cost of constraints violation (default: 1)—it is the ‘C’-constant of the regularization term in the Lagrange formulation.
7.4 The Classification outputThe classification output consists of three parts:
• Summary: This part provides the summary of classification performance of the model
• Detailed Accuracy by Class: This part indicates how accurate the classification/prediction model can be made for each data class.
• Prediction Result: This part lists the prediction for each individual sample from the validating dataset.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
29
sMOL Explorer 1.1
Figure 7-2 Classification result.
You can also export the classification result into a file (XML or tab-
delimited format) by clicking on the or click on the
to save it into the analysis workspace. To view the previously saved results, click on a result name from the Previous Saved Results Box at the right corner of the screen.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
30
sMOL Explorer 1.1
8.Utilities8.1 Data Preparation
Uploading data for analysis, you need to prepare a data file in sMOL-defined tab-delimited format. In this part, assume that your original file must be in sMOL-defined XML format. You just enter the filename or browse for the original SML
file, and click on the to get the sMOL-defined tab-delimited file. (Figure 8-1)
Figure 8-1 Data preparation.
8.2 File Conversion In the conversion page, you can convert the chemical data file into another
format by the following steps: (Figure 8-2)• Input or browse for molecule files you want to convert• Select the format of the input file• Select the output format • Specify additional options such addition and deletion of hydrogens
• Click the
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
31
sMOL Explorer 1.1
Figure 8-2 File conversion.
8.3 Computing Molecular DescriptorsJust upload a Mol file, select a chemical structure level from Bond, Atomic
and Molecule and click on Calculate Button, sMOL Explorer will generate the molecular descriptors corresponding to the selected chemical structure level. (Figure 8-3)
Figure 8-3 Calculate Molecular Descriptors.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
32
sMOL Explorer 1.1
8.4 Administrator TasksOnly the users with administration permissions can perform the following tasks:
• User ManagementThis part describes basic functions for the user data control:- (Figure 8.4)o At the bottom of the User Management page,
Add New Users: To add new user to sMOL Explorer, click on Add New User to open the Add New User page.
Click on Check ALL or Uncheck All to select or de-select all the users.
Click on Register or Delete after With Selected to register or remove all the selected/checked users.
o Edit User: Click on Edit at the row corresponding to the user that you want to modify the information
o Delete User: Click on Delete at the row corresponding to the user that you want to remove from the system.
Figure 8-4 User management.
• Setup Configuration: This section allows you to edit the following system configurations
o The Number of Data/Thread: sMOL Explorer speeds up the search process using multi-threading, so the data is divided for each thread. This parameter defines the maximum number of data samples per thread.
o Database Configurations: You can change the database parameters including server name, database name, database user account and password.
o The R Home Directory: Specify the directory path to R.o The sMOL Home Directory: Specify the directory path to sMOL
Explorer.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
33
sMOL Explorer 1.1
Figure 8-5 Edit configuration file.
• Setup Public Database Links: Input or Change the URL or WWW address of public databases including Pubchem, KEGG, DrugBank, and eMolecule. (Figure 8-6)
Figure 8-6 Edit public database link.
© 2007 ISL/BIOTEC. All Rights reserved.Information Systems Laboratory (ISL), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand
34