Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
INTEGRATIVE CLUSTERANALYSIS INBIOINFORMATICS
INTEGRATIVE CLUSTERANALYSIS INBIOINFORMATICS
Basel Abu-Jamous Rui Fa and Asoke K NandiBrunel University London UK
This edition first published 2015copy 2015 John Wiley amp Sons Ltd
Registered OfficeJohn Wiley amp Sons Ltd The Atrium Southern Gate Chichester West Sussex PO19 8SQ United Kingdom
For details of our global editorial offices for customer services and for information about how to applyfor permission to reuse the copyright material in this book please see our website at wwwwileycom
The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright Designs and Patents Act 1988
All rights reserved No part of this publication may be reproduced stored in a retrieval system or transmittedin any form or by any means electronic mechanical photocopying recording or otherwise except as permittedby the UK Copyright Designs and Patents Act 1988 without the prior permission of the publisher
Wiley also publishes its books in a variety of electronic formats Some content that appears in print may notbe available in electronic books
Designations used by companies to distinguish their products are often claimed as trademarks All brand namesand product names used in this book are trade names service marks trademarks or registered trademarks of theirrespective owners The publisher is not associated with any product or vendor mentioned in this book
Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts in preparingthis book they make no representations or warranties with respect to the accuracy or completeness of the contentsof this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purposeIt is sold on the understanding that the publisher is not engaged in rendering professional services and neither thepublisher nor the author shall be liable for damages arising herefrom If professional advice or other expertassistance is required the services of a competent professional should be sought
The advice and strategies contained herein may not be suitable for every situation In view of ongoing researchequipment modifications changes in governmental regulations and the constant flow of information relating tothe use of experimental reagents equipment and devices the reader is urged to review and evaluate theinformation provided in the package insert or instructions for each chemical piece of equipment reagent ordevice for among other things any changes in the instructions or indication of usage and for added warningsand precautions The fact that an organization or Website is referred to in this work as a citation andor a potentialsource of further information does not mean that the author or the publisher endorses the information theorganization or Website may provide or recommendations it may make Further readers should be aware thatInternet Websites listed in this work may have changed or disappeared between when this work was written andwhen it is read No warranty may be created or extended by any promotional statements for this work Neither thepublisher nor the author shall be liable for any damages arising herefrom
Library of Congress Cataloging-in-Publication Data
Abu-Jamous BaselIntegrative cluster analysis in bioinformatics Basel Abu-Jamous Dr Rui Fa and Prof Asoke K Nandi
pages cmIncludes bibliographical references and indexISBN 978-1-118-90653-8 (cloth)
1 BioinformaticsndashMathematics 2 Cluster analysis I Fa Rui II Nandi Asoke Kumar III TitleQH3242A24 20155195 3ndashdc23
2014032428
A catalogue record for this book is available from the British Library
Set in 1012pt Times by SPi Publisher Services Pondicherry India
1 2015
ToWael Abu Jamous and Eman Arafat
Hui Jin Yvonne and Molly FaMarion Robin David and Anita Nandi
Brief Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9
Part Two Introduction to Molecular Biology 19
3 The Living Cell 214 Central Dogma of Molecular Biology 33
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119
Part Four Clustering Methods 133
10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181
15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283
Part Five Validation and Visualisation 303
20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385
Appendix 395
Index 409
viii Brief Contents
Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4
131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6
14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8
References 8
2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10
221 Supervised Learning 10222 Unsupervised Learning 11
23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14
26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16
References 17
Part Two Introduction to Molecular Biology 19
3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22
331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23
34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31
35 Discussion and Summary 31References 32
4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41
471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44
48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50
x Contents
49 Discussion and Summary 50References 51
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56
521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59
53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63
54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64
References 65
6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67
621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72
63 The EBI Databases 7364 Species-specific Databases 75
641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78
65 Discussion and Summary 78References 82
7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88
xiContents
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
INTEGRATIVE CLUSTERANALYSIS INBIOINFORMATICS
INTEGRATIVE CLUSTERANALYSIS INBIOINFORMATICS
Basel Abu-Jamous Rui Fa and Asoke K NandiBrunel University London UK
This edition first published 2015copy 2015 John Wiley amp Sons Ltd
Registered OfficeJohn Wiley amp Sons Ltd The Atrium Southern Gate Chichester West Sussex PO19 8SQ United Kingdom
For details of our global editorial offices for customer services and for information about how to applyfor permission to reuse the copyright material in this book please see our website at wwwwileycom
The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright Designs and Patents Act 1988
All rights reserved No part of this publication may be reproduced stored in a retrieval system or transmittedin any form or by any means electronic mechanical photocopying recording or otherwise except as permittedby the UK Copyright Designs and Patents Act 1988 without the prior permission of the publisher
Wiley also publishes its books in a variety of electronic formats Some content that appears in print may notbe available in electronic books
Designations used by companies to distinguish their products are often claimed as trademarks All brand namesand product names used in this book are trade names service marks trademarks or registered trademarks of theirrespective owners The publisher is not associated with any product or vendor mentioned in this book
Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts in preparingthis book they make no representations or warranties with respect to the accuracy or completeness of the contentsof this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purposeIt is sold on the understanding that the publisher is not engaged in rendering professional services and neither thepublisher nor the author shall be liable for damages arising herefrom If professional advice or other expertassistance is required the services of a competent professional should be sought
The advice and strategies contained herein may not be suitable for every situation In view of ongoing researchequipment modifications changes in governmental regulations and the constant flow of information relating tothe use of experimental reagents equipment and devices the reader is urged to review and evaluate theinformation provided in the package insert or instructions for each chemical piece of equipment reagent ordevice for among other things any changes in the instructions or indication of usage and for added warningsand precautions The fact that an organization or Website is referred to in this work as a citation andor a potentialsource of further information does not mean that the author or the publisher endorses the information theorganization or Website may provide or recommendations it may make Further readers should be aware thatInternet Websites listed in this work may have changed or disappeared between when this work was written andwhen it is read No warranty may be created or extended by any promotional statements for this work Neither thepublisher nor the author shall be liable for any damages arising herefrom
Library of Congress Cataloging-in-Publication Data
Abu-Jamous BaselIntegrative cluster analysis in bioinformatics Basel Abu-Jamous Dr Rui Fa and Prof Asoke K Nandi
pages cmIncludes bibliographical references and indexISBN 978-1-118-90653-8 (cloth)
1 BioinformaticsndashMathematics 2 Cluster analysis I Fa Rui II Nandi Asoke Kumar III TitleQH3242A24 20155195 3ndashdc23
2014032428
A catalogue record for this book is available from the British Library
Set in 1012pt Times by SPi Publisher Services Pondicherry India
1 2015
ToWael Abu Jamous and Eman Arafat
Hui Jin Yvonne and Molly FaMarion Robin David and Anita Nandi
Brief Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9
Part Two Introduction to Molecular Biology 19
3 The Living Cell 214 Central Dogma of Molecular Biology 33
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119
Part Four Clustering Methods 133
10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181
15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283
Part Five Validation and Visualisation 303
20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385
Appendix 395
Index 409
viii Brief Contents
Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4
131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6
14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8
References 8
2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10
221 Supervised Learning 10222 Unsupervised Learning 11
23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14
26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16
References 17
Part Two Introduction to Molecular Biology 19
3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22
331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23
34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31
35 Discussion and Summary 31References 32
4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41
471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44
48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50
x Contents
49 Discussion and Summary 50References 51
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56
521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59
53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63
54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64
References 65
6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67
621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72
63 The EBI Databases 7364 Species-specific Databases 75
641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78
65 Discussion and Summary 78References 82
7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88
xiContents
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
INTEGRATIVE CLUSTERANALYSIS INBIOINFORMATICS
Basel Abu-Jamous Rui Fa and Asoke K NandiBrunel University London UK
This edition first published 2015copy 2015 John Wiley amp Sons Ltd
Registered OfficeJohn Wiley amp Sons Ltd The Atrium Southern Gate Chichester West Sussex PO19 8SQ United Kingdom
For details of our global editorial offices for customer services and for information about how to applyfor permission to reuse the copyright material in this book please see our website at wwwwileycom
The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright Designs and Patents Act 1988
All rights reserved No part of this publication may be reproduced stored in a retrieval system or transmittedin any form or by any means electronic mechanical photocopying recording or otherwise except as permittedby the UK Copyright Designs and Patents Act 1988 without the prior permission of the publisher
Wiley also publishes its books in a variety of electronic formats Some content that appears in print may notbe available in electronic books
Designations used by companies to distinguish their products are often claimed as trademarks All brand namesand product names used in this book are trade names service marks trademarks or registered trademarks of theirrespective owners The publisher is not associated with any product or vendor mentioned in this book
Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts in preparingthis book they make no representations or warranties with respect to the accuracy or completeness of the contentsof this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purposeIt is sold on the understanding that the publisher is not engaged in rendering professional services and neither thepublisher nor the author shall be liable for damages arising herefrom If professional advice or other expertassistance is required the services of a competent professional should be sought
The advice and strategies contained herein may not be suitable for every situation In view of ongoing researchequipment modifications changes in governmental regulations and the constant flow of information relating tothe use of experimental reagents equipment and devices the reader is urged to review and evaluate theinformation provided in the package insert or instructions for each chemical piece of equipment reagent ordevice for among other things any changes in the instructions or indication of usage and for added warningsand precautions The fact that an organization or Website is referred to in this work as a citation andor a potentialsource of further information does not mean that the author or the publisher endorses the information theorganization or Website may provide or recommendations it may make Further readers should be aware thatInternet Websites listed in this work may have changed or disappeared between when this work was written andwhen it is read No warranty may be created or extended by any promotional statements for this work Neither thepublisher nor the author shall be liable for any damages arising herefrom
Library of Congress Cataloging-in-Publication Data
Abu-Jamous BaselIntegrative cluster analysis in bioinformatics Basel Abu-Jamous Dr Rui Fa and Prof Asoke K Nandi
pages cmIncludes bibliographical references and indexISBN 978-1-118-90653-8 (cloth)
1 BioinformaticsndashMathematics 2 Cluster analysis I Fa Rui II Nandi Asoke Kumar III TitleQH3242A24 20155195 3ndashdc23
2014032428
A catalogue record for this book is available from the British Library
Set in 1012pt Times by SPi Publisher Services Pondicherry India
1 2015
ToWael Abu Jamous and Eman Arafat
Hui Jin Yvonne and Molly FaMarion Robin David and Anita Nandi
Brief Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9
Part Two Introduction to Molecular Biology 19
3 The Living Cell 214 Central Dogma of Molecular Biology 33
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119
Part Four Clustering Methods 133
10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181
15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283
Part Five Validation and Visualisation 303
20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385
Appendix 395
Index 409
viii Brief Contents
Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4
131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6
14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8
References 8
2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10
221 Supervised Learning 10222 Unsupervised Learning 11
23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14
26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16
References 17
Part Two Introduction to Molecular Biology 19
3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22
331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23
34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31
35 Discussion and Summary 31References 32
4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41
471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44
48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50
x Contents
49 Discussion and Summary 50References 51
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56
521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59
53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63
54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64
References 65
6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67
621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72
63 The EBI Databases 7364 Species-specific Databases 75
641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78
65 Discussion and Summary 78References 82
7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88
xiContents
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
This edition first published 2015copy 2015 John Wiley amp Sons Ltd
Registered OfficeJohn Wiley amp Sons Ltd The Atrium Southern Gate Chichester West Sussex PO19 8SQ United Kingdom
For details of our global editorial offices for customer services and for information about how to applyfor permission to reuse the copyright material in this book please see our website at wwwwileycom
The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright Designs and Patents Act 1988
All rights reserved No part of this publication may be reproduced stored in a retrieval system or transmittedin any form or by any means electronic mechanical photocopying recording or otherwise except as permittedby the UK Copyright Designs and Patents Act 1988 without the prior permission of the publisher
Wiley also publishes its books in a variety of electronic formats Some content that appears in print may notbe available in electronic books
Designations used by companies to distinguish their products are often claimed as trademarks All brand namesand product names used in this book are trade names service marks trademarks or registered trademarks of theirrespective owners The publisher is not associated with any product or vendor mentioned in this book
Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts in preparingthis book they make no representations or warranties with respect to the accuracy or completeness of the contentsof this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purposeIt is sold on the understanding that the publisher is not engaged in rendering professional services and neither thepublisher nor the author shall be liable for damages arising herefrom If professional advice or other expertassistance is required the services of a competent professional should be sought
The advice and strategies contained herein may not be suitable for every situation In view of ongoing researchequipment modifications changes in governmental regulations and the constant flow of information relating tothe use of experimental reagents equipment and devices the reader is urged to review and evaluate theinformation provided in the package insert or instructions for each chemical piece of equipment reagent ordevice for among other things any changes in the instructions or indication of usage and for added warningsand precautions The fact that an organization or Website is referred to in this work as a citation andor a potentialsource of further information does not mean that the author or the publisher endorses the information theorganization or Website may provide or recommendations it may make Further readers should be aware thatInternet Websites listed in this work may have changed or disappeared between when this work was written andwhen it is read No warranty may be created or extended by any promotional statements for this work Neither thepublisher nor the author shall be liable for any damages arising herefrom
Library of Congress Cataloging-in-Publication Data
Abu-Jamous BaselIntegrative cluster analysis in bioinformatics Basel Abu-Jamous Dr Rui Fa and Prof Asoke K Nandi
pages cmIncludes bibliographical references and indexISBN 978-1-118-90653-8 (cloth)
1 BioinformaticsndashMathematics 2 Cluster analysis I Fa Rui II Nandi Asoke Kumar III TitleQH3242A24 20155195 3ndashdc23
2014032428
A catalogue record for this book is available from the British Library
Set in 1012pt Times by SPi Publisher Services Pondicherry India
1 2015
ToWael Abu Jamous and Eman Arafat
Hui Jin Yvonne and Molly FaMarion Robin David and Anita Nandi
Brief Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9
Part Two Introduction to Molecular Biology 19
3 The Living Cell 214 Central Dogma of Molecular Biology 33
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119
Part Four Clustering Methods 133
10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181
15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283
Part Five Validation and Visualisation 303
20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385
Appendix 395
Index 409
viii Brief Contents
Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4
131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6
14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8
References 8
2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10
221 Supervised Learning 10222 Unsupervised Learning 11
23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14
26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16
References 17
Part Two Introduction to Molecular Biology 19
3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22
331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23
34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31
35 Discussion and Summary 31References 32
4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41
471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44
48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50
x Contents
49 Discussion and Summary 50References 51
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56
521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59
53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63
54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64
References 65
6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67
621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72
63 The EBI Databases 7364 Species-specific Databases 75
641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78
65 Discussion and Summary 78References 82
7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88
xiContents
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
ToWael Abu Jamous and Eman Arafat
Hui Jin Yvonne and Molly FaMarion Robin David and Anita Nandi
Brief Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9
Part Two Introduction to Molecular Biology 19
3 The Living Cell 214 Central Dogma of Molecular Biology 33
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119
Part Four Clustering Methods 133
10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181
15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283
Part Five Validation and Visualisation 303
20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385
Appendix 395
Index 409
viii Brief Contents
Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4
131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6
14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8
References 8
2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10
221 Supervised Learning 10222 Unsupervised Learning 11
23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14
26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16
References 17
Part Two Introduction to Molecular Biology 19
3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22
331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23
34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31
35 Discussion and Summary 31References 32
4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41
471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44
48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50
x Contents
49 Discussion and Summary 50References 51
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56
521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59
53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63
54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64
References 65
6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67
621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72
63 The EBI Databases 7364 Species-specific Databases 75
641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78
65 Discussion and Summary 78References 82
7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88
xiContents
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
Brief Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9
Part Two Introduction to Molecular Biology 19
3 The Living Cell 214 Central Dogma of Molecular Biology 33
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119
Part Four Clustering Methods 133
10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181
15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283
Part Five Validation and Visualisation 303
20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385
Appendix 395
Index 409
viii Brief Contents
Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4
131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6
14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8
References 8
2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10
221 Supervised Learning 10222 Unsupervised Learning 11
23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14
26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16
References 17
Part Two Introduction to Molecular Biology 19
3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22
331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23
34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31
35 Discussion and Summary 31References 32
4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41
471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44
48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50
x Contents
49 Discussion and Summary 50References 51
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56
521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59
53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63
54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64
References 65
6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67
621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72
63 The EBI Databases 7364 Species-specific Databases 75
641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78
65 Discussion and Summary 78References 82
7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88
xiContents
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283
Part Five Validation and Visualisation 303
20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385
Appendix 395
Index 409
viii Brief Contents
Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4
131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6
14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8
References 8
2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10
221 Supervised Learning 10222 Unsupervised Learning 11
23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14
26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16
References 17
Part Two Introduction to Molecular Biology 19
3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22
331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23
34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31
35 Discussion and Summary 31References 32
4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41
471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44
48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50
x Contents
49 Discussion and Summary 50References 51
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56
521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59
53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63
54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64
References 65
6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67
621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72
63 The EBI Databases 7364 Species-specific Databases 75
641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78
65 Discussion and Summary 78References 82
7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88
xiContents
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
Contents
Preface xix
List of Symbols xxi
About the Authors xxiii
Part One Introduction 1
1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4
131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6
14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8
References 8
2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10
221 Supervised Learning 10222 Unsupervised Learning 11
23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14
26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16
References 17
Part Two Introduction to Molecular Biology 19
3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22
331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23
34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31
35 Discussion and Summary 31References 32
4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41
471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44
48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50
x Contents
49 Discussion and Summary 50References 51
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56
521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59
53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63
54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64
References 65
6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67
621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72
63 The EBI Databases 7364 Species-specific Databases 75
641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78
65 Discussion and Summary 78References 82
7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88
xiContents
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16
References 17
Part Two Introduction to Molecular Biology 19
3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22
331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23
34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31
35 Discussion and Summary 31References 32
4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41
471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44
48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50
x Contents
49 Discussion and Summary 50References 51
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56
521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59
53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63
54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64
References 65
6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67
621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72
63 The EBI Databases 7364 Species-specific Databases 75
641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78
65 Discussion and Summary 78References 82
7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88
xiContents
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
49 Discussion and Summary 50References 51
Part Three Data Acquisition and Pre-processing 53
5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56
521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59
53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63
54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64
References 65
6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67
621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72
63 The EBI Databases 7364 Species-specific Databases 75
641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78
65 Discussion and Summary 78References 82
7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88
xiContents
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91
73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100
74 Discussion and Summary 103References 104
8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111
831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112
84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113
85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115
References 117
9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120
931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121
94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis
of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129
95 Discussion and Summary 129References 131
xii Contents
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
Part Four Clustering Methods 133
10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136
1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137
103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140
104 Clusters and Partitions 140105 Discussion and Summary 140
References 141
11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144
1121 Principles 1441122 Variations 1461123 Applications 150
113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152
114 Discussion and Summary 153References 154
12 Hierarchical Clustering 157121 Introduction 157122 Principles 158
1221 Agglomerative Methods 1581222 Divisive Methods 162
123 Discussion and Summary 164References 165
13 Fuzzy Clustering 167131 Introduction 167132 Principles 168
1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170
xiiiContents
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176
133 Discussion 177References 177
14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182
1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190
143 Discussion 193References 194
15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199
1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212
153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222
154 Discussion 223References 224
16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228
1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232
163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238
xiv Contents
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243
164 Resources 243165 Discussion 244
References 245
17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249
1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257
174 Discussion 261References 262
18 Biclustering 265181 Introduction 265182 Overview 266
1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267
183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274
184 Discussion 278References 280
19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283
1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286
193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287
194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290
1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292
196 Graph-based Clustering 293
xvContents
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
197 Consensus Clustering 295198 Biclustering 296199 Summary 297
References 298
Part Five Validation and Visualisation 303
20 Numerical Validation 305201 Introduction 305202 External Criteria 306
2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308
203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309
204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317
205 Discussion and Summary 318References 320
21 Biological Validation 323211 Introduction 323212 GOAnalysis 323
2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337
References 338
xvi Contents
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339
2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358
223 Summary 359References 361
Part Six New Clustering Frameworks Designed for Bioinformatics 363
23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366
2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368
233 SMART Framework 369234 Implementations 370
2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374
235 Enhanced SMART 377236 Examples 378237 Discussion 383
References 383
24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385
2421 Partition Generation 386
xviiContents
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388
243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393
References 394
Appendix 395
Index 409
xviii Contents
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
Preface
Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput
biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to
the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as
biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is
designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected
collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to
take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means
BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI
London UKFeb 2015
xx Preface
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
List of Symbols
X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the
kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ
The covariance matrixThe within-cluster covariance matrix
T The matrix transpose operatorminus1 The matrix inverse
det The determinant operatorexp The exponential function
The cardinality of a setThe Euclidean norm
Θ The total parameter set
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
G Number of groupsτ The mixing parameter
The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set
xxii List of Symbols
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
About the Authors
Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning
Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers
Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three
fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)
xxiv About the Authors
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
Part OneIntroduction
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
1Introduction to Bioinformatics
11 Introduction
Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may
develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this
fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed
Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics
in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics
12 The ldquoOmicsrdquo Era
A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to
indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to
information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-
matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse
13 The Scope of Bioinformatics
The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data
4 Integrative Cluster Analysis in Bioinformatics