30

Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised
PG2922
File Attachment
Thumbnailjpg

INTEGRATIVE CLUSTERANALYSIS INBIOINFORMATICS

INTEGRATIVE CLUSTERANALYSIS INBIOINFORMATICS

Basel Abu-Jamous Rui Fa and Asoke K NandiBrunel University London UK

This edition first published 2015copy 2015 John Wiley amp Sons Ltd

Registered OfficeJohn Wiley amp Sons Ltd The Atrium Southern Gate Chichester West Sussex PO19 8SQ United Kingdom

For details of our global editorial offices for customer services and for information about how to applyfor permission to reuse the copyright material in this book please see our website at wwwwileycom

The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright Designs and Patents Act 1988

All rights reserved No part of this publication may be reproduced stored in a retrieval system or transmittedin any form or by any means electronic mechanical photocopying recording or otherwise except as permittedby the UK Copyright Designs and Patents Act 1988 without the prior permission of the publisher

Wiley also publishes its books in a variety of electronic formats Some content that appears in print may notbe available in electronic books

Designations used by companies to distinguish their products are often claimed as trademarks All brand namesand product names used in this book are trade names service marks trademarks or registered trademarks of theirrespective owners The publisher is not associated with any product or vendor mentioned in this book

Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts in preparingthis book they make no representations or warranties with respect to the accuracy or completeness of the contentsof this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purposeIt is sold on the understanding that the publisher is not engaged in rendering professional services and neither thepublisher nor the author shall be liable for damages arising herefrom If professional advice or other expertassistance is required the services of a competent professional should be sought

The advice and strategies contained herein may not be suitable for every situation In view of ongoing researchequipment modifications changes in governmental regulations and the constant flow of information relating tothe use of experimental reagents equipment and devices the reader is urged to review and evaluate theinformation provided in the package insert or instructions for each chemical piece of equipment reagent ordevice for among other things any changes in the instructions or indication of usage and for added warningsand precautions The fact that an organization or Website is referred to in this work as a citation andor a potentialsource of further information does not mean that the author or the publisher endorses the information theorganization or Website may provide or recommendations it may make Further readers should be aware thatInternet Websites listed in this work may have changed or disappeared between when this work was written andwhen it is read No warranty may be created or extended by any promotional statements for this work Neither thepublisher nor the author shall be liable for any damages arising herefrom

Library of Congress Cataloging-in-Publication Data

Abu-Jamous BaselIntegrative cluster analysis in bioinformatics Basel Abu-Jamous Dr Rui Fa and Prof Asoke K Nandi

pages cmIncludes bibliographical references and indexISBN 978-1-118-90653-8 (cloth)

1 BioinformaticsndashMathematics 2 Cluster analysis I Fa Rui II Nandi Asoke Kumar III TitleQH3242A24 20155195 3ndashdc23

2014032428

A catalogue record for this book is available from the British Library

Set in 1012pt Times by SPi Publisher Services Pondicherry India

1 2015

ToWael Abu Jamous and Eman Arafat

Hui Jin Yvonne and Molly FaMarion Robin David and Anita Nandi

Brief Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9

Part Two Introduction to Molecular Biology 19

3 The Living Cell 214 Central Dogma of Molecular Biology 33

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119

Part Four Clustering Methods 133

10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181

15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283

Part Five Validation and Visualisation 303

20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385

Appendix 395

Index 409

viii Brief Contents

Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4

131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6

14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8

References 8

2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10

221 Supervised Learning 10222 Unsupervised Learning 11

23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14

26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16

References 17

Part Two Introduction to Molecular Biology 19

3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22

331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23

34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31

35 Discussion and Summary 31References 32

4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41

471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44

48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50

x Contents

49 Discussion and Summary 50References 51

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56

521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59

53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63

54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64

References 65

6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67

621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72

63 The EBI Databases 7364 Species-specific Databases 75

641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78

65 Discussion and Summary 78References 82

7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88

xiContents

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 2: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

INTEGRATIVE CLUSTERANALYSIS INBIOINFORMATICS

INTEGRATIVE CLUSTERANALYSIS INBIOINFORMATICS

Basel Abu-Jamous Rui Fa and Asoke K NandiBrunel University London UK

This edition first published 2015copy 2015 John Wiley amp Sons Ltd

Registered OfficeJohn Wiley amp Sons Ltd The Atrium Southern Gate Chichester West Sussex PO19 8SQ United Kingdom

For details of our global editorial offices for customer services and for information about how to applyfor permission to reuse the copyright material in this book please see our website at wwwwileycom

The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright Designs and Patents Act 1988

All rights reserved No part of this publication may be reproduced stored in a retrieval system or transmittedin any form or by any means electronic mechanical photocopying recording or otherwise except as permittedby the UK Copyright Designs and Patents Act 1988 without the prior permission of the publisher

Wiley also publishes its books in a variety of electronic formats Some content that appears in print may notbe available in electronic books

Designations used by companies to distinguish their products are often claimed as trademarks All brand namesand product names used in this book are trade names service marks trademarks or registered trademarks of theirrespective owners The publisher is not associated with any product or vendor mentioned in this book

Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts in preparingthis book they make no representations or warranties with respect to the accuracy or completeness of the contentsof this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purposeIt is sold on the understanding that the publisher is not engaged in rendering professional services and neither thepublisher nor the author shall be liable for damages arising herefrom If professional advice or other expertassistance is required the services of a competent professional should be sought

The advice and strategies contained herein may not be suitable for every situation In view of ongoing researchequipment modifications changes in governmental regulations and the constant flow of information relating tothe use of experimental reagents equipment and devices the reader is urged to review and evaluate theinformation provided in the package insert or instructions for each chemical piece of equipment reagent ordevice for among other things any changes in the instructions or indication of usage and for added warningsand precautions The fact that an organization or Website is referred to in this work as a citation andor a potentialsource of further information does not mean that the author or the publisher endorses the information theorganization or Website may provide or recommendations it may make Further readers should be aware thatInternet Websites listed in this work may have changed or disappeared between when this work was written andwhen it is read No warranty may be created or extended by any promotional statements for this work Neither thepublisher nor the author shall be liable for any damages arising herefrom

Library of Congress Cataloging-in-Publication Data

Abu-Jamous BaselIntegrative cluster analysis in bioinformatics Basel Abu-Jamous Dr Rui Fa and Prof Asoke K Nandi

pages cmIncludes bibliographical references and indexISBN 978-1-118-90653-8 (cloth)

1 BioinformaticsndashMathematics 2 Cluster analysis I Fa Rui II Nandi Asoke Kumar III TitleQH3242A24 20155195 3ndashdc23

2014032428

A catalogue record for this book is available from the British Library

Set in 1012pt Times by SPi Publisher Services Pondicherry India

1 2015

ToWael Abu Jamous and Eman Arafat

Hui Jin Yvonne and Molly FaMarion Robin David and Anita Nandi

Brief Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9

Part Two Introduction to Molecular Biology 19

3 The Living Cell 214 Central Dogma of Molecular Biology 33

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119

Part Four Clustering Methods 133

10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181

15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283

Part Five Validation and Visualisation 303

20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385

Appendix 395

Index 409

viii Brief Contents

Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4

131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6

14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8

References 8

2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10

221 Supervised Learning 10222 Unsupervised Learning 11

23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14

26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16

References 17

Part Two Introduction to Molecular Biology 19

3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22

331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23

34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31

35 Discussion and Summary 31References 32

4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41

471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44

48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50

x Contents

49 Discussion and Summary 50References 51

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56

521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59

53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63

54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64

References 65

6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67

621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72

63 The EBI Databases 7364 Species-specific Databases 75

641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78

65 Discussion and Summary 78References 82

7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88

xiContents

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 3: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

INTEGRATIVE CLUSTERANALYSIS INBIOINFORMATICS

Basel Abu-Jamous Rui Fa and Asoke K NandiBrunel University London UK

This edition first published 2015copy 2015 John Wiley amp Sons Ltd

Registered OfficeJohn Wiley amp Sons Ltd The Atrium Southern Gate Chichester West Sussex PO19 8SQ United Kingdom

For details of our global editorial offices for customer services and for information about how to applyfor permission to reuse the copyright material in this book please see our website at wwwwileycom

The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright Designs and Patents Act 1988

All rights reserved No part of this publication may be reproduced stored in a retrieval system or transmittedin any form or by any means electronic mechanical photocopying recording or otherwise except as permittedby the UK Copyright Designs and Patents Act 1988 without the prior permission of the publisher

Wiley also publishes its books in a variety of electronic formats Some content that appears in print may notbe available in electronic books

Designations used by companies to distinguish their products are often claimed as trademarks All brand namesand product names used in this book are trade names service marks trademarks or registered trademarks of theirrespective owners The publisher is not associated with any product or vendor mentioned in this book

Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts in preparingthis book they make no representations or warranties with respect to the accuracy or completeness of the contentsof this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purposeIt is sold on the understanding that the publisher is not engaged in rendering professional services and neither thepublisher nor the author shall be liable for damages arising herefrom If professional advice or other expertassistance is required the services of a competent professional should be sought

The advice and strategies contained herein may not be suitable for every situation In view of ongoing researchequipment modifications changes in governmental regulations and the constant flow of information relating tothe use of experimental reagents equipment and devices the reader is urged to review and evaluate theinformation provided in the package insert or instructions for each chemical piece of equipment reagent ordevice for among other things any changes in the instructions or indication of usage and for added warningsand precautions The fact that an organization or Website is referred to in this work as a citation andor a potentialsource of further information does not mean that the author or the publisher endorses the information theorganization or Website may provide or recommendations it may make Further readers should be aware thatInternet Websites listed in this work may have changed or disappeared between when this work was written andwhen it is read No warranty may be created or extended by any promotional statements for this work Neither thepublisher nor the author shall be liable for any damages arising herefrom

Library of Congress Cataloging-in-Publication Data

Abu-Jamous BaselIntegrative cluster analysis in bioinformatics Basel Abu-Jamous Dr Rui Fa and Prof Asoke K Nandi

pages cmIncludes bibliographical references and indexISBN 978-1-118-90653-8 (cloth)

1 BioinformaticsndashMathematics 2 Cluster analysis I Fa Rui II Nandi Asoke Kumar III TitleQH3242A24 20155195 3ndashdc23

2014032428

A catalogue record for this book is available from the British Library

Set in 1012pt Times by SPi Publisher Services Pondicherry India

1 2015

ToWael Abu Jamous and Eman Arafat

Hui Jin Yvonne and Molly FaMarion Robin David and Anita Nandi

Brief Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9

Part Two Introduction to Molecular Biology 19

3 The Living Cell 214 Central Dogma of Molecular Biology 33

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119

Part Four Clustering Methods 133

10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181

15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283

Part Five Validation and Visualisation 303

20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385

Appendix 395

Index 409

viii Brief Contents

Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4

131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6

14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8

References 8

2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10

221 Supervised Learning 10222 Unsupervised Learning 11

23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14

26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16

References 17

Part Two Introduction to Molecular Biology 19

3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22

331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23

34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31

35 Discussion and Summary 31References 32

4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41

471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44

48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50

x Contents

49 Discussion and Summary 50References 51

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56

521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59

53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63

54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64

References 65

6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67

621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72

63 The EBI Databases 7364 Species-specific Databases 75

641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78

65 Discussion and Summary 78References 82

7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88

xiContents

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 4: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

This edition first published 2015copy 2015 John Wiley amp Sons Ltd

Registered OfficeJohn Wiley amp Sons Ltd The Atrium Southern Gate Chichester West Sussex PO19 8SQ United Kingdom

For details of our global editorial offices for customer services and for information about how to applyfor permission to reuse the copyright material in this book please see our website at wwwwileycom

The right of the author to be identified as the author of this work has been asserted in accordance with theCopyright Designs and Patents Act 1988

All rights reserved No part of this publication may be reproduced stored in a retrieval system or transmittedin any form or by any means electronic mechanical photocopying recording or otherwise except as permittedby the UK Copyright Designs and Patents Act 1988 without the prior permission of the publisher

Wiley also publishes its books in a variety of electronic formats Some content that appears in print may notbe available in electronic books

Designations used by companies to distinguish their products are often claimed as trademarks All brand namesand product names used in this book are trade names service marks trademarks or registered trademarks of theirrespective owners The publisher is not associated with any product or vendor mentioned in this book

Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their best efforts in preparingthis book they make no representations or warranties with respect to the accuracy or completeness of the contentsof this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purposeIt is sold on the understanding that the publisher is not engaged in rendering professional services and neither thepublisher nor the author shall be liable for damages arising herefrom If professional advice or other expertassistance is required the services of a competent professional should be sought

The advice and strategies contained herein may not be suitable for every situation In view of ongoing researchequipment modifications changes in governmental regulations and the constant flow of information relating tothe use of experimental reagents equipment and devices the reader is urged to review and evaluate theinformation provided in the package insert or instructions for each chemical piece of equipment reagent ordevice for among other things any changes in the instructions or indication of usage and for added warningsand precautions The fact that an organization or Website is referred to in this work as a citation andor a potentialsource of further information does not mean that the author or the publisher endorses the information theorganization or Website may provide or recommendations it may make Further readers should be aware thatInternet Websites listed in this work may have changed or disappeared between when this work was written andwhen it is read No warranty may be created or extended by any promotional statements for this work Neither thepublisher nor the author shall be liable for any damages arising herefrom

Library of Congress Cataloging-in-Publication Data

Abu-Jamous BaselIntegrative cluster analysis in bioinformatics Basel Abu-Jamous Dr Rui Fa and Prof Asoke K Nandi

pages cmIncludes bibliographical references and indexISBN 978-1-118-90653-8 (cloth)

1 BioinformaticsndashMathematics 2 Cluster analysis I Fa Rui II Nandi Asoke Kumar III TitleQH3242A24 20155195 3ndashdc23

2014032428

A catalogue record for this book is available from the British Library

Set in 1012pt Times by SPi Publisher Services Pondicherry India

1 2015

ToWael Abu Jamous and Eman Arafat

Hui Jin Yvonne and Molly FaMarion Robin David and Anita Nandi

Brief Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9

Part Two Introduction to Molecular Biology 19

3 The Living Cell 214 Central Dogma of Molecular Biology 33

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119

Part Four Clustering Methods 133

10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181

15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283

Part Five Validation and Visualisation 303

20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385

Appendix 395

Index 409

viii Brief Contents

Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4

131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6

14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8

References 8

2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10

221 Supervised Learning 10222 Unsupervised Learning 11

23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14

26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16

References 17

Part Two Introduction to Molecular Biology 19

3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22

331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23

34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31

35 Discussion and Summary 31References 32

4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41

471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44

48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50

x Contents

49 Discussion and Summary 50References 51

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56

521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59

53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63

54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64

References 65

6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67

621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72

63 The EBI Databases 7364 Species-specific Databases 75

641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78

65 Discussion and Summary 78References 82

7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88

xiContents

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 5: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

ToWael Abu Jamous and Eman Arafat

Hui Jin Yvonne and Molly FaMarion Robin David and Anita Nandi

Brief Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9

Part Two Introduction to Molecular Biology 19

3 The Living Cell 214 Central Dogma of Molecular Biology 33

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119

Part Four Clustering Methods 133

10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181

15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283

Part Five Validation and Visualisation 303

20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385

Appendix 395

Index 409

viii Brief Contents

Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4

131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6

14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8

References 8

2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10

221 Supervised Learning 10222 Unsupervised Learning 11

23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14

26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16

References 17

Part Two Introduction to Molecular Biology 19

3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22

331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23

34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31

35 Discussion and Summary 31References 32

4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41

471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44

48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50

x Contents

49 Discussion and Summary 50References 51

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56

521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59

53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63

54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64

References 65

6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67

621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72

63 The EBI Databases 7364 Species-specific Databases 75

641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78

65 Discussion and Summary 78References 82

7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88

xiContents

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 6: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

Brief Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 32 Computational Methods in Bioinformatics 9

Part Two Introduction to Molecular Biology 19

3 The Living Cell 214 Central Dogma of Molecular Biology 33

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 556 Databases Standards and Annotation 677 Normalisation 878 Feature Selection 1099 Differential Expression 119

Part Four Clustering Methods 133

10 Clustering Forms 13511 Partitional Clustering 14312 Hierarchical Clustering 15713 Fuzzy Clustering 16714 Neural Network-based Clustering 181

15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283

Part Five Validation and Visualisation 303

20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385

Appendix 395

Index 409

viii Brief Contents

Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4

131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6

14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8

References 8

2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10

221 Supervised Learning 10222 Unsupervised Learning 11

23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14

26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16

References 17

Part Two Introduction to Molecular Biology 19

3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22

331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23

34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31

35 Discussion and Summary 31References 32

4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41

471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44

48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50

x Contents

49 Discussion and Summary 50References 51

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56

521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59

53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63

54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64

References 65

6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67

621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72

63 The EBI Databases 7364 Species-specific Databases 75

641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78

65 Discussion and Summary 78References 82

7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88

xiContents

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 7: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

15 Mixture Model Clustering 19716 Graph Clustering 22717 Consensus Clustering 24718 Biclustering 26519 Clustering Methods Discussion 283

Part Five Validation and Visualisation 303

20 Numerical Validation 30521 Biological Validation 32322 Visualisations and Presentations 339

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 36524 Tightness-tunable Clustering (UNCLES) 385

Appendix 395

Index 409

viii Brief Contents

Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4

131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6

14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8

References 8

2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10

221 Supervised Learning 10222 Unsupervised Learning 11

23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14

26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16

References 17

Part Two Introduction to Molecular Biology 19

3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22

331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23

34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31

35 Discussion and Summary 31References 32

4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41

471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44

48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50

x Contents

49 Discussion and Summary 50References 51

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56

521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59

53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63

54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64

References 65

6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67

621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72

63 The EBI Databases 7364 Species-specific Databases 75

641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78

65 Discussion and Summary 78References 82

7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88

xiContents

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 8: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

Contents

Preface xix

List of Symbols xxi

About the Authors xxiii

Part One Introduction 1

1 Introduction to Bioinformatics 311 Introduction 312 The ldquoOmicsrdquo Era 413 The Scope of Bioinformatics 4

131 Areas of Molecular Biology Subject to Bioinformatics Analysis 5132 Data Storage Retrieval and Organisation 5133 Data Analysis 5134 Statistical Analysis 6135 Presentation 6

14 What Do Information Engineers and Biologists Need to Know 715 Discussion and Summary 8

References 8

2 Computational Methods in Bioinformatics 921 Introduction 922 Machine Learning and Data Mining 10

221 Supervised Learning 10222 Unsupervised Learning 11

23 Optimisation 1124 Image Processing Bioimage Informatics 1325 Network Analysis 14

26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16

References 17

Part Two Introduction to Molecular Biology 19

3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22

331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23

34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31

35 Discussion and Summary 31References 32

4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41

471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44

48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50

x Contents

49 Discussion and Summary 50References 51

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56

521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59

53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63

54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64

References 65

6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67

621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72

63 The EBI Databases 7364 Species-specific Databases 75

641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78

65 Discussion and Summary 78References 82

7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88

xiContents

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 9: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

26 Statistical Analysis 1527 Software Tools and Technologies 1528 Discussion and Summary 16

References 17

Part Two Introduction to Molecular Biology 19

3 The Living Cell 2131 Introduction 2132 Prokaryotes and Eukaryotes 2133 Multicellularity 22

331 Unicellular and Multicellular Organisms 22332 Stem Cells and Cell Differentiation 23

34 Cell Components 24341 Plasma Membrane and Transport Proteins 25342 Cytoplasm 25343 Extracellular Matrix 26344 Centrosome and Microtubules 26345 Actin Filaments and the Cytoskeleton 27346 Nucleus 27347 Vesicles 27348 Ribosomes 28349 Endoplasmic Reticulum 283410 Golgi Apparatus 293411 Mitochondrion and the Energy of the Cell 293412 Lysosome 303413 Peroxisome 31

35 Discussion and Summary 31References 32

4 Central Dogma of Molecular Biology 3341 Introduction 3342 Central Dogma of Molecular Biology Overview 3343 Proteins 3444 DNA 3745 RNA 3946 Genes 4047 Transcription and Post-transcriptional Processes 41

471 Post-transcriptional Processes 43472 Gene-specific TFs 44473 Post-transcriptional Regulation 44

48 Translation and Post-translational Processes 45481 The Genetic Code 45482 tRNA and Ribosomes 46483 The Steps of Translation 47484 Polyribosomes (Polysomes) 49485 Post-translational Processes 50

x Contents

49 Discussion and Summary 50References 51

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56

521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59

53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63

54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64

References 65

6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67

621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72

63 The EBI Databases 7364 Species-specific Databases 75

641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78

65 Discussion and Summary 78References 82

7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88

xiContents

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 10: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

49 Discussion and Summary 50References 51

Part Three Data Acquisition and Pre-processing 53

5 High-throughput Technologies 5551 Introduction 5552 Microarrays 56

521 DNAMicroarrays 56522 Protein Microarrays 58523 Carbohydrate Microarrays (Glycoarrays) 59524 Other Types of Microarrays 59

53 Next-generation Sequencing (NGS) 60531 DNA Sequencing 60532 RNA Sequencing (Transcripome Analysis) 62533 Metagenomics 63534 Other Applications of Sequencing 63

54 ChIP on Microarrays and Sequencing 6355 Discussion and Summary 64

References 65

6 Databases Standards and Annotation 6761 Introduction 6762 NCBI Databases 67

621 Literature Databases (PubMed PMC the Bookshelf and MeSH) 68622 GenBank (Nucleotide Database) 69623 Reference Sequences (RefSeq) Database 69624 Gene Database 70625 Protein Database 70626 Gene Expression Omnibus 71627 Taxonomy and HomoloGene Databases 71628 Sequence Read Archive 72629 Genomic and Epigenomic Variations 726210 Other NCBI Databases 72

63 The EBI Databases 7364 Species-specific Databases 75

641 Animals 75642 Plants 76643 Fungi 76644 Archaea and Bacteria 77645 Viruses 78

65 Discussion and Summary 78References 82

7 Normalisation 8771 Introduction 8772 Issues Tackled by Normalisation 88

xiContents

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 11: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

721 Within-slide and Between-slides Normalisation 88722 Normalisation Based on Non-differentially Expressed Genes 88723 Background Correction 89724 Logarithmic Transformation 90725 Intensity-dependent Bias ndash (MA) Plots 91726 Replicates and Summarisation 91

73 Normalisation Methods 92731 Microarray Suite 5 (MAS 50) 92732 Robust Multi-array Average (RMA) 92733 Quantile Normalisation 95734 Locally Weighted Scatter-plot Smoothing (Lowess) Normalisation 96735 Scaling Methods 98736 Model-based Expression Index (MBEI) 99737 Other Normalisation Methods 100

74 Discussion and Summary 103References 104

8 Feature Selection 10981 Introduction 10982 FS and FG ndash Problem Definition 11083 Consecutive Ranking 111

831 Forward Search (Most Informative First Admitted) 111832 Backward Elimination (Least Useful First Eliminated) 112

84 Individual Ranking 112841 Information Content 112842 SNR Criteria 113

85 Principal Component Analysis 11586 Genetic Algorithms and Genetic Programming 11587 Discussion and Summary 115

References 117

9 Differential Expression 11991 Introduction 11992 Fold Change 12093 Statistical Hypothesis Testing ndash Overview 120

931 p-Values and Volcano Plots 121932 The Multiple-hypothesis Testing Problem 121

94 Statistical Hypothesis Testing ndash Methods 123941 t-Statistic Modified t-Statistics and the Analysis

of Variance (ANOVA) 123942 B-Statistic 125943 Fisherrsquos Exact Test 126944 Likelihood Ratio Test 128945 Methods for Over-dispersed Poisson Distribution 129

95 Discussion and Summary 129References 131

xii Contents

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 12: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

Part Four Clustering Methods 133

10 Clustering Forms 135101 Introduction 135102 Proximity Measures 136

1021 Distance Metrics for Discrete Feature Objects 1361022 Distance Metrics for Continuous Feature Objects 137

103 Clustering Families 1371031 Partitional Clustering 1371032 Hierarchical Clustering 1381033 Fuzzy Clustering 1391034 Neural Network-based Clustering 1391035 Mixture Model Clustering 1391036 Graph-based Clustering 1391037 Consensus Clustering 1401038 Biclustering 140

104 Clusters and Partitions 140105 Discussion and Summary 140

References 141

11 Partitional Clustering 143111 Introduction 143112 k-Means and its Applications 144

1121 Principles 1441122 Variations 1461123 Applications 150

113 k-Medoids and its Applications 1511131 Principles 1511132 Variations 1511133 Applications 152

114 Discussion and Summary 153References 154

12 Hierarchical Clustering 157121 Introduction 157122 Principles 158

1221 Agglomerative Methods 1581222 Divisive Methods 162

123 Discussion and Summary 164References 165

13 Fuzzy Clustering 167131 Introduction 167132 Principles 168

1321 Fuzzy c-Means 1681322 Probabilistic c-Means 170

xiiiContents

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 13: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

1323 Hybrid c-Means 1711324 GustafsonndashKessel Algorithm 1741325 GathndashGeva Algorithm 1741326 Fuzzy c-Shell 1751327 FANNY 1761328 Other Fuzzy Clustering Algorithms 176

133 Discussion 177References 177

14 Neural Network-based Clustering 181141 Introduction 181142 Algorithms 182

1421 SOM 1821422 GLVQ 1841423 Neural-gas 1851424 ART 1861425 OPTOC 1891426 SOON 190

143 Discussion 193References 194

15 Mixture Model Clustering 197151 Introduction 197152 Finite Mixture Models 199

1521 Various Mixture Models 1991522 Non-Bayesian Methods 2041523 Bayesian Methods 212

153 Infinite Mixture Models 2201531 DPM Model 2201532 CRP Mixture Model 2221533 SBP Mixture Model 222

154 Discussion 223References 224

16 Graph Clustering 227161 Introduction 227162 Basic Definitions 228

1621 Graph and Adjacency Matrix 2281622 Measures and Metrics 2291623 Similarity Matrices 232

163 Graph Clustering 2331631 Graph Cut Clustering 2331632 Spectral Clustering 2341633 AP Clustering 2361634 Modularity-based Clustering 238

xiv Contents

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 14: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

1635 Multilevel Graph Partitioning and Hypergraph Partitioning 2421636 Markov Cluster Algorithm 243

164 Resources 243165 Discussion 244

References 245

17 Consensus Clustering 247171 Introduction 247172 Overview 248173 Consensus Functions 249

1731 PndashP Comparison 2491732 CndashC Comparison 2531733 MIC Voting 2541734 MndashM Co-occurrence 257

174 Discussion 261References 262

18 Biclustering 265181 Introduction 265182 Overview 266

1821 Statement of the Biclustering Problem 2661822 Types of Biclusters 2661823 Classification of Biclustering 267

183 Biclustering Methods 2681831 Variance-minimisation Biclustering Methods 2681832 Correlation-maximisation Biclustering Methods 2701833 Two-way Clustering Methods 2731834 Probabilistic and Generative Methods 274

184 Discussion 278References 280

19 Clustering Methods Discussion 283191 Introduction 283192 Hierarchical Clustering 283

1921 Yeast Cell Cycle Data 2841922 Breast Cancer 2841923 Diffuse Large B-Cell Lymphoma 286

193 Fuzzy Clustering 2871931 DNA Motifs Clustering 2871932 Microarray Gene Expression 287

194 Neural Network-based Clustering 289195 Mixture Model-based Clustering 290

1951 Examples of Finite Mixture Models 2911952 Examples of Infinite Mixture Models 292

196 Graph-based Clustering 293

xvContents

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 15: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

197 Consensus Clustering 295198 Biclustering 296199 Summary 297

References 298

Part Five Validation and Visualisation 303

20 Numerical Validation 305201 Introduction 305202 External Criteria 306

2021 Rand Index 3062022 Adjusted Rand Index 3072023 Jaccard Index 3082024 Normalised Mutual Information 308

203 Internal Criteria 3082031 Adjusted Figure of Merit 3082032 CLEST 309

204 Relative Criteria 3102041 Minimum Description Length 3102042 Minimum Message Length 3112043 Bayesian Information Criterion 3112044 Akaikersquos Information Criterion 3122045 Partition Coefficient 3122046 Partition Entropy 3122047 FukuyamandashSugeno Index 3122048 XiendashBeni Index 3132049 CalinskindashHarabasz Index 31320410 Dunnrsquos Index 31320411 DaviesndashBouldin Index 31420412 I Index 31420413 Silhouette 31420414 Object-based Validation 31520415 Geometrical Index 31620416 Validity Index 31620417 Generalised Parametric Validity 317

205 Discussion and Summary 318References 320

21 Biological Validation 323211 Introduction 323212 GOAnalysis 323

2121 GO Term Enrichment 326213 Upstream Sequence Analysis 331214 Gene-network Analysis 333215 Discussion and Summary 337

References 338

xvi Contents

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 16: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

22 Visualisations and Presentations 339221 Introduction 339222 Methods and Examples 339

2221 Profile Patterns 3392222 Bar Chart 3412223 Error Bar 3422224 Pie Chart 3452225 Box Plot 3452226 Histogram 3462227 Scatter Plot 3472228 Venn Diagram 3482229 Tree View 35022210 Heat Map 35322211 Network Graph 35422212 Low-dimension Display 35522213 Receiver Operating Characteristics Curves 35522214 KaplanndashMeier Plot 35622215 Block Diagram 358

223 Summary 359References 361

Part Six New Clustering Frameworks Designed for Bioinformatics 363

23 Splitting-Merging Awareness Tactics (SMART) 365231 Introduction 365232 Related Work 366

2321 SSCL 3662322 SSMCL 3672323 ULFMM 3672324 VBGM 3672325 PFClust 3682326 DBSCAN 368

233 SMART Framework 369234 Implementations 370

2341 SMART-CL 3702342 SMART-FMM 3722343 SMART-MFA 374

235 Enhanced SMART 377236 Examples 378237 Discussion 383

References 383

24 Tightness-tunable Clustering (UNCLES) 385241 Introduction 385242 Bi-CoPaM Method 385

2421 Partition Generation 386

xviiContents

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 17: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

2422 Relabelling 3862423 Fuzzy Consensus Partition Matrix Generation 3882424 Binarisation 388

243 UNCLES Method - Other Types of External Specifications 390244 MndashN Scatter Plots Technique 391245 Parameter-free UNCLES with MndashN Plots 393246 Discussion and Summary 393

References 394

Appendix 395

Index 409

xviii Contents

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 18: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

Preface

Bioinformatics is a new and interdisciplinary field which is concerned with the development ofmethods for recording organising and analysing biological data With the advent of humangenome-sequencing microarrays and high-performance computing developing software toolsto generate useful biological knowledge has been a major activityClustering techniques have been increasingly used in the analysis of high-throughput

biological datasets Although many generic clustering methods have been used successfullyto analyse biological datasets many specific properties of those datasets require customisedmethods which are specifically designed to meet such properties Therefore both aspects ndashthe design and the application of clustering methods in this field ndash have been under activeinvestigations and considerations by many research groups around the world Indeed therehave been many activities in assimilating the work in this field especially by designingstate-of-the-art methods which uniquely address various issues that are particularly relevantin biological dataThis book attempts to outline the complete pathway from the basics of molecular biology to

the generation of biological knowledge It is supplied with an introductory part to molecularbiology at a level which can be understood by researchers coming from a numeric backgroundsuch as computer scientists and information engineers The introductory part helps those read-ers to get introduced to the basic biological knowledge needed to appreciate the specific appli-cations of the methods in this book The book also explains the structure and properties of manytypes of high-throughput datasets commonly found in biological studies including public repo-sitoriesdatabases pre-processing like normalisation and the identification of differentiallyexpressed genes A major part of the book will cover various clustering methods and cluster-ing-validation techniques as well as their applications to biological datasets representing anintegrative analysis It should be remarked that not all clustering methods have been utilisedin bioinformatics yet Some of the most recent state-of-the-art clustering methods whichcan deal with specific problems that appear in biological datasets are paid much attentionespecially in how they and their possible successors could be used to enhance the pace ofbiological discoveries in the future Although proposed in the context of bioinformatics some

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 19: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

specialised sophisticated methods can also be used by other researchers to apply them to otheranalogous problems Therefore the general community of researchers in the field of machinelearning are also targeted by most of the contents of this bookThere are books mainly focusing on various aspects related to microarrays such as

biological experimental design image processing identification of differentially expressedgenes supervised classification etc while covering clustering analysis at a less thorough levelIn any case these tend to focus on clustering of microarray datasets rather than considering awider range of biological datasets Yet other books provide more thorough coverage ofclustering than the aforementioned books but they do not provide sufficient background inthe field of molecular biology for researchers coming from numeric backgrounds to appreciateand understand the origins of the available datasets These kinds of book tend to be mainlydata-clustering books and naturally belong to the computational side of bioinformatics ratherthan to the interface between the computational and the biological sides This book does sit atthis interface and goes far beyond those currently available in that it presents some biologicalpreliminaries as well as some state-of-the-art clustering methods that are specifically designedto suit specific issues which appear in biological datasetsAs the field is still developing such a book cannot be definitive or complete This book is

designed to target researchers in bioinformatics ranging from entry level researchers (eg seniorbachelor students and master students) to the most senior researchers (eg heads of researchgroups) It is hoped that graduate students should be able to learn enough basics before studyingjournal papers researchers in related fields should be able to get a broad perspective on whathas been achieved and current researchers in this field should be able to use it as a referenceFurther to the material provided in the book a companion website hosting a selected

collection of software and links to publicly available datasets can be accessed by using thefollowing URL httpscodegooglecompintegrative-cluster-bioinformaticsA work of this magnitude will unfortunately contain errors and omissions We would like to

take this opportunity to apologise unreservedly for all such indiscretions in advance We wouldwelcome comments and corrections please send them by email to aknandiieeeorg or byany other means

BASEL ABU-JAMOUS RUI FA AND ASOKE K NANDI

London UKFeb 2015

xx Preface

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 20: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

List of Symbols

X Data matrixxn The nth data objectS Similarity matrixN Number of data objectsM Number of features or samplesK Number of clustersD Dissimilarity between two input data objectsS Similarity between two input data objectsZ Partition (vector sequence with the length of N)U Partition matrix (with N rows and K columns)ukn Entry in the partition matrix indicating the membership of the nth data object in the

kth clusterC Cluster set containing K clusters C1hellip CKCk The kth cluster containing the indices of those data objects belonging to itck The centroid of the kth clusterκ Kernel functionm The fuzzifierμ The mean vectorΣΣ

The covariance matrixThe within-cluster covariance matrix

T The matrix transpose operatorminus1 The matrix inverse

det The determinant operatorexp The exponential function

The cardinality of a setThe Euclidean norm

Θ The total parameter set

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 21: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

G Number of groupsτ The mixing parameter

The likelihood functionA The adjacency matrixB The incident matrixD The degree matrixL The Laplacian matrixG A graphV The set of verticesE The set of edgesM The modularity matrixQ The modularity valueNV Number of verticesNE Number of edgesθ A smaller parameter set

xxii List of Symbols

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 22: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

About the Authors

Basel Abu-Jamous received his BSc degree in computer engineering from the University ofJordan Amman Jordan in 2010 He received his MSc degree in information and intelligenceengineering from the University of Liverpool Liverpool UK in 2011 He was awarded the SirRobin Saxby Prize for the 20102011 academic year based on his performance in his MScdegree Mr Abu-Jamous started his PhD degree in electrical engineering and electronics atthe University of Liverpool in 2011 and then transferred with his supervisor Professor AsokeK Nandi to complete his PhD degree at Brunel University London UK He was awarded aprize in the posters section of the 6th Annual Student Research Conference 2013 at BrunelUniversity London Currently he is a research assistant at the Department of Electronic andComputer Engineering at Brunel University London a position to which he was appointedin January 2015 Mr Abu-Jamous has authored or co-authored several journal and internationalconference papers His research interests include bioinformatics computational biology andthe broader areas of information engineering and machine learning

Dr Rui Fa received his PhD degree in electrical engineering and information systems fromthe University of Newcastle UK in 2007 From January 2008 to September 2010 he heldresearch positions in the University of York and the University of Leeds working in radarsignal processing and wireless communication projects From October 2010 Dr Fa extendedhis research fields to bioinformatics and computational biology and joined the University ofLiverpool involving in a collaborative research project with the Universities of OxfordCambridge and Bristol which is funded by the National Institute for Health Research (NIHR)Dr Fa joined Brunel University London as a senior research fellow in 2013 His current researchinterests include bioinformatics and computational biology systems biology machine learn-ing Bayesian statistics statistical signal processing and network science Dr Fa has authoredand co-authored more than 60 peer-reviewed journal and conference papers

Professor Asoke K Nandi joined Brunel University London in April 2013 as the Head ofElectronic and Computer Engineering He received a PhD from the University of Cambridge

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 23: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

and since then has worked in many institutions including CERN Geneva SwitzerlandUniversity of Oxford Oxford UK Imperial College London UK University of StrathclydeGlasgow UK and University of Liverpool Liverpool UK His research spans many differenttopics including bioinformatics communications machine learning (feature selection featuregeneration classification clustering and pattern recognition) and signal processingIn 1983 Professor Nandi was a member of the UA1 team at CERN that discovered the three

fundamental particles known as W+ Wminus and Z0 providing the evidence for the unificationof the electromagnetic and weak forces which was recognized by the Nobel Committee forPhysics in 1984 He has been honoured with the Fellowships of the Royal Academy ofEngineering (UK) and the Institute of Electrical and Electronics Engineers (USA) He is aFellow of five other professional institutions including the Institute of Physics (UK) theInstitute of Mathematics and its Applications (UK) the British Computer Society (UK) theInstitution of Mechanical Engineers (UK) and the Institution of Engineering and Technology(UK) His publications have been cited more than 16 000 times and his h-index is 60 (GoogleScholar)

xxiv About the Authors

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 24: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

Part OneIntroduction

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 25: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

1Introduction to Bioinformatics

11 Introduction

Interesting research fields emerge through the collaboration of researchers from differentsometimes distant disciplines Examples include biochemistry biophysics quantum informa-tion science systems engineering mechatronics business information systems managementinformation systems geophysics biomedical engineering cybernetics art history media tech-nology and others This marriage between disciplines yields findings which blend the views ofdifferent areas over the same subject or set of dataThe stimuli leading to such collaborations are numerous For example one discipline may

develop tools that generate types of data that require another discipline to analyse In othercases one field scratches a layer of unknowns to discover that significant parts of its scopeare actually based on the principles of another field such as the low-level biological studiesof the chemical interactions in the cells which delivered biochemistry as an interdisciplinaryfield Other interdisciplinary fields emerged because of their complementary involvement inbuilding different parts of the same target system or in understanding different sides of the sameresearch question for example mechatronics engineering aims at building systems which haveboth mechanical and electronic parts such as all modern automobiles Interdisciplinary areaslike business information systems and management information systems have emerged due tothe high demand for information systems which target business and management aspectsalthough generic information systems would meet many of those requirements a customisedfield focusing on such applications is indeed more efficient given such high demandThe interdisciplinary field of this bookrsquos focus is bioinformatics The motive behind this

fieldrsquos emergence is the increasingly expanding generation of massive raw biological data fol-lowing the developments in high-throughput techniques in the last couple of decades The scaleof this high-throughput data is orders of magnitude higher than what can be efficiently analysed

Integrative Cluster Analysis in Bioinformatics First Edition Basel Abu-Jamous Rui Fa and Asoke K Nandicopy 2015 John Wiley amp Sons Ltd Published 2015 by John Wiley amp Sons Ltd

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics

Page 26: Thumbnail - download.e-bookshelf.de13 Fuzzy Clustering 167 14 Neural Network-based Clustering 181. ... 2.2 Machine Learning and Data Mining 10 2.2.1 Supervised Learning 10 2.2.2 Unsupervised

in a manual fashion Consequently information engineers were recruited in order to contributeto data analysis by employing their computational methods Cycles of computational analysissharing of results interdisciplinary discussions and abstractions have led and are still leadingto many key discoveries in biology and medicine This success has attracted many informationengineers towards biology and many biologists towards information engineering to meet ina potentially rich intersection area which itself has grown in size to establish the field ofbioinformatics

12 The ldquoOmicsrdquo Era

A new suffix has been introduced to the English language in this era of high-throughput dataexpansion that is ldquo-omicsrdquo and its relatives ldquo-omerdquo and ldquo-omicrdquo This started in the 1930swhen the entire set of genes carried by a chromosome was called the genome blending thewords ldquogenerdquo and ldquochromosomerdquo (OED 2014) Consequently the analysis of the entiregenome was called genomics and many known research journals carried the term ldquogenomerdquoor ldquogenomicsrdquo in their titles such as Genomics Genome Research Genome Biology BMCGenomics Genome Medicine the Journal of Genetics and Genomics (JGG) and othersThe -ome suffix was not kept exclusive for the genome it has been rather generalised to

indicate the complete set of some type of molecule or object The proteome is the completeset of proteins in a cell tissue or organism Similarly are the transcriptome metabolome gly-come and lipidome for the complete sets of transcripts metabolites glycans (carbohydrates)and lipids In a respective order large-scale studies of those complete sets are known asproteomics transcriptomics metabolomics glycomics and lipidomics The -ome suffix wasfurther generalised to include the complete sets of objects other than basic molecules Forexample the microbiome is the complete set of microorganisms (eg bacteria microscopicfungi etc) in a given environment such as a building a sample of soil or the human gut(Kembel et al 2014) More omic fields have also emerged such as agrigenomics (the appli-cation of genomics in agriculture) pharmacogenomics and pharmacoproteomics (the applica-tion of genomics and proteomics to pharmacology) and othersAll of those biological fields of omics involve high-throughput datasets which are subject to

information engineering involvements and therefore reside at the core focus of bioinformaticresearch An even higher level of omics analysis involves integrative analysis of many types ofomic datasets OMICS a Journal of Integrative Biology is a journal which targets researchstudies that consider such collective analysis at different levels from single cells to societiesMore types of high-throughput omic datasets are expected to emerge The role of bioinfor-

matics as an interdisciplinary field will be more important This is not only because each ofthose omic datasets is massive in size when considered individually it is also because ofthe size of information hidden in the relations between those generally heterogeneous datasetswhich requires more sophisticated computational methods to analyse

13 The Scope of Bioinformatics

The scope of bioinformatics includes the development of methods techniques and toolswhich target storage retrieval organisation analysis and presentation of high-throughputbiological data

4 Integrative Cluster Analysis in Bioinformatics