[Studies in Big Data] Astronomy and Big Data Volume 6 ||

Studies in Big Data 6

Astronomyand Big Data

Kieran Jay EdwardsMohamed Medhat Gaber

A Data Clustering Approachto Identifying Uncertain GalaxyMorphology

Studies in Big Data

Volume 6

Series editor

Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Polande-mail: [email protected]

For further volumes:

http://www.springer.com/series/11970

About this Series

The series “Studies in Big Data” (SBD) publishes new developments and advancesin the various areas of Big Data- quickly and with a high quality. The intent is tocover the theory, research, development, and applications of Big Data, as embeddedin the fields of engineering, computer science, physics, economics and life sciences.The books of the series refer to the analysis and understanding of large, complex,and/or distributed data sets generated from recent digital sources coming from sen-sors or other physical instruments as well as simulations, crowd sourcing, socialnetworks or other internet transactions, such as emails or video click streams andother. The series contains monographs, lecture notes and edited volumes in Big Dataspanning the areas of computational intelligence incl. neural networks, evolutionarycomputation, soft computing, fuzzy systems, as well as artificial intelligence, datamining, modern statistics and Operations research, as well as self-organizing sys-tems. Of particular value to both the contributors and the readership are the shortpublication timeframe and the world-wide distribution, which enable both wide andrapid dissemination of research output.

Kieran Jay Edwards · Mohamed Medhat Gaber

Astronomy and Big Data

A Data Clustering Approachto Identifying Uncertain GalaxyMorphology

ABC

Kieran Jay EdwardsUniversity of PortsmouthSchool of ComputingHampshireUnited Kingdom

Mohamed Medhat GaberRobert Gordon UniversitySchool of Computing Science and

Digital MediaAberdeenUnited Kingdom

ISSN 2197-6503 ISSN 2197-6511 (electronic)ISBN 978-3-319-06598-4 ISBN 978-3-319-06599-1 (eBook)DOI 10.1007/978-3-319-06599-1Springer Cham Heidelberg New York Dordrecht London

Library of Congress Control Number: 2014937454

c© Springer International Publishing Switzerland 2014This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part ofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,broadcasting, reproduction on microfilms or in any other physical way, and transmission or informationstorage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodologynow known or hereafter developed. Exempted from this legal reservation are brief excerpts in connectionwith reviews or scholarly analysis or material supplied specifically for the purpose of being enteredand executed on a computer system, for exclusive use by the purchaser of the work. Duplication ofthis publication or parts thereof is permitted only under the provisions of the Copyright Law of thePublisher’s location, in its current version, and permission for use must always be obtained from Springer.Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violationsare liable to prosecution under the respective Copyright Law.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoes not imply, even in the absence of a specific statement, that such names are exempt from the relevantprotective laws and regulations and therefore free for general use.While the advice and information in this book are believed to be true and accurate at the date of pub-lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for anyerrors or omissions that may be made. The publisher makes no warranty, express or implied, with respectto the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Dedicated to the memory of my father,Rodney James Edwards

- Kieran Jay Edwards

Preface

From the shifting of our beloved earth away from the centre of the universe, todiscovering the billions of galaxies and stars that make up our intergalactic neigh-bourhood, Astronomy continues to surprise and astound us. For a long time, empir-ical results have been the only means of describing the natural phenomenon. Theage of enlightenment has witnessed the change to experimental generalisation andmodelling. With the advent of computers, computational science was the naturalprogression, where simulation results can efficiently approve or discredit scientifichypothesis. In a visionary talk by the prominent Microsoft researcher Jim Gray, justbefore he went missing in 2007, he described, in an almost poetic manner, how dataexploration has been set as the fourth paradigm in scientific research.

The research reported in this book is conducted in the realm of this fourthparadigm. The billions of galaxies and stars that sky surveys collect every day us-ing state-of-the-art telescopes have overwhelmed astronomers and cosmologists. In-evitably, computational physics has fallen short at addressing the deluge of data, andnew tools are needed. The term Big Data has been recently coined to describe suchlarge volumes of data that arrive at high velocity. An unusual idea has found itsway to prominent astronomers, that is to involve the public in some of the tasksthat require manual inspection of a huge number of images. Galaxy Zoo has been apioneering project that sought the help of the public in classifying galaxies into itstwo main categories of morphology, spiral and elliptical, as have been classified byEdwin Hubble some eighty years ago. This categorisation is of paramount impor-tance to physicists, astronomers and cosmologists in their quest to find the theory ofeverything.

In this book, we report on how we used data mining, more specifically clustering,to identify galaxies that the public has shown some degree of uncertainty for as towhether they belong to one morphology type or another. The research shows theimportance of transition between different data mining techniques in an insightfulworkflow. Clustering enabled us to identify discriminating features in the analyseddata sets, adopting our novel feature selection approach, namely Incremental Fea-ture Selection (IFS). We have then used state-of-the-art classification techniques,Random Forests and Support Vector Machines, to validate the acquired results.

VIII Preface

The research reported in this book evidences that data mining is both science andart. It is important to design an insightful workflow based on intermediate results.Thus, such a workflow is interactive and adaptable.

We hope the readers find this book enjoyable and beneficial for their futureresearch, and for our quest, as mankind, towards the scientific truth.

Portsmouth, United Kingdom Kieran Jay EdwardsAberdeen, United Kingdom Mohamed Medhat Gaber

March 2014

Acknowledgements

The authors are thankful to academic and research staff at the Institute of Cosmologyand Gravitation of the University of Portsmouth for the fruitful discussion on theresults of the research reported in this book. It is also worth acknowledging all themembers of the Galaxy Zoo project who have made the data used here publiclyavailable. Thanks are also due to our families for their continuous support and love.

Kieran is deeply grateful to his mother, Rosita Edwards, for the incredible loveand support that she has provided and for never losing faith in him. He also ac-knowledges the love and support of his extended family, including Gilbert Kwa,Shelly Kwa, Karen Poh and Tom Hoyle.

Mohamed acknowledges the support of the family for bearing with him thelong time committed to his research work, including what is reported in this book.Many thanks are due to parents, Dr. Medhat Gaber and Mrs. Mervat Fathy; wife,Dr. Nesreen Hassaan; and children, Abdul-Rahman (Boudy) and Mariam.

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Book Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Astronomy, Galaxies and Stars: An Overview . . . . . . . . . . . . . . . . . . . . 52.1 Why Astronomy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Galaxies and Stars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Galaxy Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 The Big Bang Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Astronomical Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 Data Mining: Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Applications and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Galaxy Zoo: Citizen Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Galaxy Zoo/SDSS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Data Pre-processing and Attribute Selection . . . . . . . . . . . . . . . . . . . 223.5 Applied Techniques/Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Adopted Data Mining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.1 CRoss-Industry Standard Process for Data Mining

(CRISP-DM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 K-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 Sequential Minimal Optimisation . . . . . . . . . . . . . . . . . . . . . . 374.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.5 Incremental Feature Selection (IFS) Algorithm . . . . . . . . . . . . . . . . . 404.6 Pre- and Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.6.2 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

XII Contents

5 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.1 Galaxy Zoo Table 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2 Data Mining the Galaxy Zoo Mergers . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Extensive SDSS Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3.1 Isolating and Re-Clustering Galaxies Labelledas Uncertain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3.2 Extended Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Development of Data Mining Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.1 Waikato Environment for Knowledge Analysis (WEKA) . . . . . . . . . 49

6.1.1 WEKA Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.1.2 Initial Experimentation on Galaxy Zoo Table 2 Data Set . . . 516.1.3 Experiments with Data Mining the Galaxy Zoo Mergers

Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.1.4 Further Experimentation on the SDSS Data . . . . . . . . . . . . . 516.1.5 Uncertain Galaxy Re-Labelling and Re-Clustering . . . . . . . 516.1.6 Random Forest and SMO Experimentation . . . . . . . . . . . . . . 67

6.2 R Language and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.2.1 RStudio Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.3 MySQL Database Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.4 Development of Knowledge-Flow Models . . . . . . . . . . . . . . . . . . . . . 786.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7 Experimentation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.1 Galaxy Zoo Table 2 Clustering Results . . . . . . . . . . . . . . . . . . . . . . . . 837.2 Clustering Results of Lowest DBI Attributes . . . . . . . . . . . . . . . . . . . 847.3 Extensive SDSS Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.4 Results of Uncertain Galaxy Re-Labelling and Re-Clustering . . . . . 867.5 Results of Further Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8.1.1 Experimental Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 908.2 Future Work and Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8.2.1 Analysis of Data Storage Representation . . . . . . . . . . . . . . . . 908.2.2 Output Storage Representation . . . . . . . . . . . . . . . . . . . . . . . . 918.2.3 Data Mining and Storage Workflow . . . . . . . . . . . . . . . . . . . . 928.2.4 Development and Adoption of Data Mining Techniques . . . 928.2.5 Providing Astronomers with Insights . . . . . . . . . . . . . . . . . . . 92

8.3 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Chapter 1Introduction

“I wanted to point out that almost everything about science is changing because ofthe impact of information technology. Experimental, theoretical, and computationalscience are all being affected by the data deluge, and a fourth, data-intensive scienceparadigm is emerging.” by Jim Gray (1944 - 2007)

The fourth paradigm [19], to which it is now referred, describes the emergence ofdata mining within the various scientific disciplines, including that of astronomy.The Sloan Digital Sky Survey (SDSS) alone possesses, at present, over 1,000,000galaxies, 30,000 stars and 100,000 quasars collated into several data sets [141]. Withsuch copious amounts of data constantly being acquired from various astronomicalsurveys, it now becomes imperative that an automated model to processing this databe developed so as to be able to generate useful information. The goal of this ap-proach is to then produce an outcome that will result in effective human learning.It is the process of characterising the known, assigning the new and discoveringthe unknown in such a data-intensive discipline that encompasses what astronomi-cal data mining is all about [28]. Big Data is the recently coined term, describingtechnologies that deal with large volumes of data arriving at high speed. This is thetypical description of what our state-of-the-art telescopes are capturing every dayfrom stars and galaxies to back holes and dark matter.

1.1 Background

Data mining is defined as the process of discovering non-trivial, interesting patternsand models from large data repositories. More recently, an important developmentin the scientific arena based on crowd sourcing, known as citizen science, has sur-faced. This provides the users with an interface through which they can interact withscientific data repositories, facilitating data labelling/tagging for use by scientists.We argue that the most successful example in this area is the Galaxy Zoo project,where a large collection of galaxy images are annotated by citizen scientists (non-professional users). Figure 1.1 shows the frontpage of the Galaxy Zoo website.

K.J. Edwards and M.M. Gaber, Astronomy and Big Data, 1Studies in Big Data 6,DOI: 10.1007/978-3-319-06599-1_1, c© Springer International Publishing Switzerland 2014

2 1 Introduction

Fig. 1.1 Galaxy Zoo Website: galaxyzoo.org

A broad and widely adopted classification of data mining techniques categorisesthem into two groups, supervised and unsupervised. The aim of supervised tech-niques is to predict the value of unknown measurements/features based on theknowledge of a set of other known features. Classification techniques lie in the heartof the supervised learning category, where such unknown measurements to be pre-dicted are categorical in nature. On the other hand, unsupervised techniques providea descriptive representation of the data. Clustering and association rule mining areboth the dominating approaches in this category.

Various classification techniques such as Naıve Bayes, as demonstrated by Hen-rion et al. [75] and Kamar et al. [87], C4.5, as exemplified by Calleja and Fuentes[37], Gauci et al. [65] and Vasconcellos et al. [149] and Artificial Neural Net-works(ANN), as shown by Banerji et al. [22], appear to be the more popular choicesof methods when processing astronomical data. However, it is observed that cluster-ing is far less used in this field. Research carried out by Baehr et al. [17] involvingcalculating the Davies-Bouldin Index (DBI) values of the various attributes to de-termine the best combination for identifying correlations between morphologicalattributes and user-selected morphological classes motivated the direction of thisproject.

1.2 Aims and Objectives

Most of the galaxies in the Galaxy Zoo project are labelled as Uncertain. Thisis partially due to the 80% voting threshold used to determine, with confidence,the morphology of each galaxy. Although such a high threshold is desirable, the

1.3 Book Organisation 3

outcome has significantly increased uncertainty instead of helping astronomers andcosmologists in their quest to unfold the facts about our universe. Thus, we aim touse intelligent data analysis techniques to resolve this problem.

An analysis of the Galaxy Zoo Table 2 classification data set is carried out andan investigation into how to determine the categories of those galaxies labelled asUncertain using an unsupervised approach is initiated. We also review the astro-nomical data mining landscape and detail the progress that has been made thus fartowards effective classification in this field. The various algorithms that have beenutilised to this effect will be studied and its results briefly analysed with particularfocus being given to the K-Means clustering algorithm, which is the cynosure ofthis book.

The main research carried out here centres on developing a heuristic techniquetowards attribute selection in order to help further increase the overall accuracies ofgalaxy classification, together with the utilisation of the K-Means algorithm, in anunsupervised setting. The aim is to be able to provide astronomers with a means toeffectively assign each galaxy to the right category as accurately and efficiently aspossible.

1.3 Book Organisation

This book is organised as follows. Chapter 2 takes a brief look at the history ofAstronomy, describes the formation, make-up and populations of stars and galaxiesand takes a look at the conception and evolution of the Big Bang theory. Chapter 3provides a thorough review of the emerging area of Astronomical Data Mining. Wethen analyse the problem of labelling uncertain galaxies in Galaxy Zoo Table 2 inChapter 4. Details of the methodology used to address the problem are provided inChapter 5. The implementation of this methodology is in turn discussed in Chapter6. Experimental results are discussed in Chapter 7, before the book is concluded inChapter 8 with a summary and directions to future work.

Chapter 2Astronomy, Galaxies and Stars: An Overview

“Recognize that the very molecules that make up your body, the atoms that constructthe molecules, are traceable to the crucibles that were once the centers of high massstars that exploded their chemically rich guts into the galaxy, enriching pristine gasclouds with the chemistry of life so that we are all connected to each other biologically,to the earth chemically and to the rest of the Universe atomically. That makes me smileand I actually feel quite large at the end of that. Its not that we are better than theUniverse. We are part of the Universe. We are in the Universe and the Universe is inus.” by Neil deGrasse Tyson

This chapter provides the reader with the required background knowledge in as-tronomy, that, in turn, facilitates the understanding of related terms that will appearin subsequent chapters. A special treatment of galaxies and their morphologies isprovided, as this is the focus of our research reported in this book.

2.1 Why Astronomy?

Astronomy dates back as far as the Mayans, ancient Chinese and the Harappans,also known as the Indus Valley Civilisation. Astronomy was used as a means ofkeeping track of time and predicting future events which was achieved through acombination of religion, astrology and the meticulous study of the positions andmotions of various celestial bodies. It is generally believed that priests were the firstprofessional astronomers, the pioneers of the field.

The real renaissance of astronomy began in the 1500s when Nicholaus Coper-nicus, a Polish university-trained Catholic priest, mathematician and astronomer,proposed a heliocentric model of our Universe in which the Sun, rather than theEarth, is in the center of the Solar System. Figure 2.1 graphically illustrates thismodel. Just before he died in 1543, he published a book entitled De revolutionibusorbium coelestium (On the Revolutions of the Celestial Spheres) which became oneof the most important contributions towards the scientific revolution [67].

Following this, in 1609, German astronomer and mathematician Johannes Kepleraccurately mapped, with the help of Danish nobleman Tycho Brahe’s observations[145], the motions of the planets through the Solar System in what he described


6 2 Astronomy, Galaxies and Stars: An Overview

Fig. 2.1 The Heliocentric model (credit: Nicolaus Copernicus)(source: De revolutionibusorbium coelestium)

as the Three Laws of Planetary Motion [133]. These three laws are described asfollows:

1. The Elliptical Orbit Law - The orbits of the planets are ellipses, with the Sunat one focus of the ellipse.

2. The Equal-Area Law - A line joining any given planet to the Sun sweeps outequal areas in equal times as that planet traverses around the ellipse.

3. The Law of Periods - The ratio of the squares of the revolutionary periods of twoplanets (P) are directly proportional to the ratio of the cubes of their semi-majoraxes (a):

P12

P22 = a1

3

a23 , where the time units for the periods and the distance units for the

lengths of the semi-major axes are assumed to be consistent between the twoplanets and subscripts 1 and 2 distinguish the values between planet 1 and 2respectively.

2.1 Why Astronomy? 7

Fig. 2.2 A replica of Galileo Galilei’s telescope (source: museumvictoria.com.au)

Many historians believed that Kepler’s laws were, for the most part, ignored upuntil the publishing of English mathematician and physicist Sir Isaac Newton’s Prin-cipia Mathematica [120] but research has since shown otherwise. Soon after, Italianphysicist, mathematician, astronomer and philosopher Galileo Galilei created hisfirst telescope, as seen in Figure 2.2, made various improvements to it and, with it,was able to view the moon, observe a supernova, verify the different phases of Venusand even discover sunspots. More importantly, his discoveries served to solidify theclaims of the heliocentric model [74].

Even up until the late 1920s, debate over whether other galaxies were, in fact,island universes made up of billions upon billions of stars or simply nearby nebulaewas still very much ubiquitous. Furthermore, it was not until 1992 that the firstconfirmed detection of the existence of exoplanets, or extrasolar planets, planetswhich live outside the Solar System, was made [126].

When we think of astronomy, the technology behind the science is likely whatwould first cross the mind of most of us. If there is one thing that cannot be disputed,it is that history has shown us how the science of Astronomy has always pushed theboundaries and limits of technology and science and very much continues to do soto this day. An excellent example of that being the James Webb Space Telescope,a large infrared telescope with a 6.5-meter diameter gold-coated beryllium primarymirror, a maturation of the Next Generation Space Telescope, which is planned forlaunch in 2018 for a five-year mission and is designed to assist thousands of as-tronomers in studying every phase in the history of our Universe [64]. Figure 2.3shows an artist’s impression of what the telescope will look like once launched anddeployed.

Astronomy, today, continues to capture the hearts, minds and imaginations ofmany. As our forefathers before us have done, and their forefathers before them,we continue to look to the sky for answers. Astronomy is an incredibly importantscience simply because it is one of the best tools that we have with which to aid usin our neverending search for answers; the search for answers to the origin of ourcivilisation and for our ultimate fate, to our place as a civilisation and as an occupantin this vast Cosmos and our uniqueness.


Fig. 2.3 An artist’s impression of what the James Webb Space Telescope will look like(credit: jwst.nasa.gov)

This search is what makes us who we are, human, and this search will continuefor generations to come, accompanied by the continued vast improvements that willbe made to technology. Unfortunately, in today’s world, while the pursuit of an-swers to these questions remains paramount, there has been an increase in challengeagainst the importance of Astronomy and the research in this field and this is ex-pressed rather poetically in the following quote:

“Preserving knowledge is easy. Transferring knowledge is also easy. But making newknowledge is neither easy nor profitable in the short term. Fundamental researchproves profitable in the long run, and, as importantly, it is a force that enriches theculture of any society with reason and basic truth.” by Ahmed Zewali, winner of theNobel Prize in Chemistry (1999).

2.2 Galaxies and Stars

As this book heavily utilises the terms morphology and galaxy, we offer a formaldefinition to describe what morphology, in relation to galaxies, is and provide a

2.2 Galaxies and Stars 9

Fig. 2.4 Left: NGC 1132 - An elliptical galaxy dubbed fossil group due to its vast concen-trations of dark matter Right: Messier 64 (M64) - A spiral galaxy, the result of a collisionbetween two galaxies. Due to the spectacular dark band of dust surrounding the galaxy’sbright nuclues, its been nicknamed by some as the Evil Eye galaxy (credit: hubblesite.org).

brief look into the historical research that has shaped our understanding today ofwhat these terms mean and represent.

A galaxy is defined as a populous system of stars, dust, dark matter and gasesthat are all bound together by gravity. There are numerous sizes that a galaxy canpossess in terms of the number of stars that live within it, ranging anywhere from10 million to 10 trillion stars. There are two common general shapes that a galaxycan take, either spiral or elliptical. Many variations within each also exist as wellas less-common shapes such as toothpicks or rings [16]. Figure 2.4 provides classicexamples of a spiral and an elliptical galaxy.

A star, our Sun being a perfect example, is essentially a sphere of immenselyhot gas, mainly that of hydrogen, partly helium and with minute traces of othervarious gases. Within its core, an incredible amount of energy is generated throughthe process of fusion in which smaller atoms smash together at great speeds to formlarger atoms. Likewise with galaxies, astronomers also have a process for classifyingstars. They are grouped into spectral types. By spectral, we refer to the temperatureand brightness of the surfaces of the stars [115]. Table 2.1 lists the different spectralclasses.

One of the most popular astronomy questions asked by many is: How many starsand galaxies are there in the Universe?. Consider, for a moment, our human abilityto count. If we had perfect eyesight, travelled to both the Northern and SouthernHemispheres and experienced the absence of the moon, providing an ideal, perfectlyclear and dark sky, given such an ideal situation, we might be able to cover an area


Table 2.1 The Spectral Sequence

Spectral Class Principal Characteristics Temperature(K)O Hot Blue Stars 28000-50000B Blue White Stars 9900-28000A White Stars 7400-9900F Whitish Stars 6000-7400G Yellow Stars 4900-6000K Orange Red Stars 3600-4900M Cool Red Stars 2000-3600

of up to 9000 stars. With a decent telescope, that figure skyrockets to 15 millionstars. With an observatory, we’d be looking at stars in the billions. There is no doubtthat that is quite an extraordinary, staggering figure. Bear in mind, as well, the factthat this is counting only the stars that live within our own galaxy. We have not evenbegun to consider the multitude of stars that reside in the billions of other galaxiesout there! This still does not answer that age-old question though and, for the timebeing, remains unanswered in accurate terms. The fact is, there is no exact figure.However, given the continuous progression of science and technology, it is currentlyestimated that over 70 thousand million million million (70 sextillion) stars exist inour Universe. Take note, the repeated use of the word million is by no means atypographical error! According to a study by Nolan et al. [123], it was through theuse of a $690 million telescope, used to study pulsars, gamma ray bursts, black holesand neutron stars, dubbed the Fermi telescope, that they were able to determine thatour Universe has an average of 1.4 stars per 100 billion cubic light-years. This meansthat the distance between two stars is approximately 4,150 light-years! A light-year,though sounding very much like a measurement of time, is actually a measurementof distance. It is defined as a unit of astronomical distance equivalent to the distancelight can travel in a single year (9.4607× 1012km, which works out to be just under9.5 trillion kilometres or close to 6 million million miles!).

Galaxies are no easier to count than stars, not by a long shot. Even with theworld’s best equipment available, we are only able to observe a fraction of our Uni-verse. The estimates for the number of galaxies that exist, based on the HubbleSpace Telescope observations back in 1999, stood at between 100 and 200 billionand, soon after, doubled when a new camera was installed on it. More recently, aGerman supercomputer simulation put the estimate for the total number of galaxiesto exist in our Universe at approximately 500 billion, with many of them older thanour own Milky Way [43].

2.2.1 Galaxy Morphology

Galaxy morphology is a visual grouping system used by astronomers for galaxyclassification. Almost a hundred years have now passed since it was first discoveredthat galaxies were independent systems subjected to morphology and mergers [153].

2.2 Galaxies and Stars 11

Fig. 2.5 NGC-3603: An Open Cluster of Stars (source: astronomy.com)

The most famous system for morphological classification, known as the Hubblesequence, was devised by American astronomer Edwin Powell Hubble [82, 83] in1936 as seen in Figure 2.6.

Alternatively, because the complete sequence of galaxy morphology resemblesthat of a tuning fork, a result of the spiral series roughly being in parallel, the Hub-ble sequence also became colloquially known as the Hubble tuning fork diagram. Itwas also in 1936, when Hubble released his book Realm of the Nebulae [83], thatresearch in the area became abundant and the study of galaxy morphology became awell established sub-field of optical astronomy. Since then, there have been numer-ous studies [86, 99, 52, 139, 40] based on and proposed revisions [50, 131, 93, 94]made to the Hubble sequence that have been published. Hubble is generally re-garded, in the field, as one of the most important observational cosmologists of the


Fig. 2.6 The Hubble Sequence, invented by Edwin Hubble in 1936 (source: sdss.org)

20th century who played an important role in the establishment of the field of extra-galactic astronomy [121].

2.3 The Big Bang Theory

When analysing scientific theories to explain how the Universe came into existence,the Big Bang theory is unquestionably dominant. Most people, at one point in timeor another, would have come across it. The Big Bang theory is currently the mostconsistent cosmological model for the early development of our Universe and is in-line with observations made of its past and present states. It was in the 1950s thatBelgian priest, astronomer and physics professor Georges Lemaıtre first proposedwhat he dubbed his hypothesis of the primeval atom which, after since having nu-merous scientists build upon it, formed the modern synthesis today known as the BigBang theory [102]. Figure 2.7 shows a graphical depiction of the Big Bang theory.

In 1948, American cosmologist Ralph Asher Alpher and American scientistRobert Herman published a prediction that took the Big Bang theory into consider-ation. They predicted that if the theory were true, the glow of light, of atoms firstformed 300,000 years after the Big Bang, would be visible today [12]. Almost 20years later, in 1964, American Nobel laureate, physicist and radio astronomer ArnoAllan Penzias and American Nobel laureate and astronomer Robert Woodrow Wil-son of Bell Labs managed to identify this light when they accidentally discovered

2.3 The Big Bang Theory 13

Fig. 2.7 A Graphical Illustration of the Big Bang Theory (Credit: NASA/WMAP ScienceTeam)

a microwave signal that was thought to be unwanted noise and attempted to filter itout. This led to their discovery of Cosmic Microwave Background [117], creatingthe strongest evidence to date in support of the Big Bang theory [146].

The Big Bang theory suggests that, approximately 13.8 billion years ago, the Uni-verse suddenly started rapidly expanding from this incredibly small, hot and densestate, also refered to as the Singularity, and eventually cooled enough to cause en-ergy to form the building blocks of life: protons, neutrons and electrons; thus theUniverse was born. It is through the observations made to the timeline of the BigBang, the extrapolation of the expansion of the Universe backwards in time, that weare able to begin to understand how the formations of these light elements as well asthat of galaxies came about. The equation used to represent the age of the Universeis as follows:

(13.798± 0.037)∗109years or (4.354± 0.012)∗1017seconds

The Big Bang theory also states that the Universe is continuously expanding eventoday. As a result of this continuous expansion, the distance between galaxies hasincreased exponentially, creating unimaginable distances between them [66]. Thepopularity of this theory increased significantly after microwave background radi-ation was discovered in the 1970s [106, 119]. However, there were problems withthis theory that surfaced nearing the end of the 1970s which made it seem largely


Table 2.2 A Summary of the Evidence for the Big Bang theory

Evidence ConclusionCosmic Microwave Back-ground

Background radiation observed is, in fact, the remainsof energy produced 300,000 years after the Big Bang

Redshift is observed whenstudying light from othergalaxies

Other galaxies are continuously and rapidly movingaway from us

Redshift appears greater in lightthat comes from more distantgalaxies as opposed to closergalaxies

The change in redshift indicates that the universe is ex-panding and that it originated from a single point

incompatible. These, for example the Domain Wall problem [100, 51, 98], the Pri-mordial Monopole problem [107, 105, 54] and the Gravitino problem [53, 55, 15],were eventually resolved through a plethora of studies. Table 2.2 summarises theevidence that exists in favour of the Big Bang theory:

2.4 Summary

In this chapter, we provided the reader with an overview of the field of Astron-omy with emphasis on galaxies and their morphologies. This quick and interestingjourney in our Universe provides the necessary background knowledge for the inter-disciplinary research reported in this monograph. We have seen that the wonderfulcosmos surprises us with new discoveries. With every new fact revealed about ourUniverse, we find that what we think we know is still far from what we hope toknow!

In the following chapter, we cross the bridge between the two disciplines in ourresearch, namely, data mining and astronomy.

Chapter 3Astronomical Data Mining

“GALEX, as a whole, produced 20 terabytes of data, and that’s actually not that largetoday–In fact, it’s tiny compared to the instruments that are coming, which are going tomake these interfaces even more important. We have telescopes coming that are goingto produce petabytes (a thousand terabytes) of data. Already, it’s difficult to downloada terabyte; a petabyte would be, not impossible, but certainly an enormous waste ofbandwidth and time. It’s like me telling you to download part of the Internet and searchit yourself, instead of just using Google.” by Alberto Conti

Various research projects have been conducted in an attempt to explore and improvethe classification process of astronomical images as well as enhance the study of theclassified data. The growing interest in this area is, to a large extent, attributed to theintroduction of citizen science projects like Galaxy Zoo that host copious amountsof such data and encourage the public to involve themselves in classifying and cat-egorising these images. It is also due, in large part, to the profusion of data beingcollated by numerous sky surveys like the Sloan Digital Sky Survey (SDSS) which,at present, hosts an imaging catalogue of over 350 million objects and is continu-ously growing [4]. With such copious amounts of data, it would not be unreasonableto state that our capacity today for acquiring data has certainly far outstripped ourcapacity to analyse it, there is no doubt that manual processing has long becomeimpractical, creating a need for automated methods of analysis and study. This iswhere data mining comes in, thus creating a new paradigmatic approach, dubbedfairly recently as the fourth paradigm [19, 77]. Data mining has emerged as an im-portant field of study at the time of convergence of large data repositories’ sizesand the computational power of the high performing computational facilities [62].Finding its roots in machine learning, pattern recognition, statistics, and databases,data mining has grown tremendously over the past two decades. Recently, data min-ing faces the challenge of the ever increasing size of data repositories, thanks toadvances in both hardware and software technologies. Among the very large datarepositories come the astronomical data produced by sky surveys.

The volume of literature involving machine learning with Galaxy Zoo is not ex-tremely expansive in this regard but does show an array of algorithms like NaıveBayes, kNN, C4.5 and even Artificial Neural Networks being used in an attempt


16 3 Astronomical Data Mining

to enhance our current understanding, make predictions (e.g. galaxy collisions) andcalculate probabilities more accurately and reliably. However, clustering techniqueslike partitioning clustering represented by K-Means and Hierarchical clustering arenot as widely used in astronomical data mining for reasons which will be discussed.

In this chapter, we give an overview of the data mining field. We look, briefly, atthe Galaxy Zoo project and its conception and conduct a more detailed analysis ofthe data that both it and the SDSS provide and the various projects in data miningand machine learning that have stemmed from this citizen science project. The focuswill be on the various methods of pre-processing, the algorithms and techniquesused and the resulting outcomes of such applications.

3.1 Data Mining: Definition

The definition of data mining, to this day, still has yet to be universally agreed upon.With varying descriptions, it is not hard to confuse one’s self as to what data miningis and what it represents [92, 14, 72, 132, 88]. In this book, our view is an uncom-plicated one. Data mining is akin to discovery. That is to say that the goal of datamining is to conduct analysis and produce an outcome that will result in effectivehuman learning. It is the process of characterising the known, assigning the newand discovering the unknown which encompasses, in a generalised view, what datamining represents [28]. More specifically, data mining is the application of certainspecialised algorithms, designed to extract patterns, to data sets [57].

The terms data mining and knowledge discovery have been used to describe twointerrelated, but different concepts in the early days of the field. Knowledge discov-ery has been used to describe the complete process from data cleansing to patternvisualisation. Data mining used to be described as the heart of the knowledge dis-covery workflow. Nowadays the two terms are used interchangeably.

Data mining techniques are usually categorised into descriptive and predictive.Descriptive techniques include methods that can give general description of the data.Association rule mining and clustering techniques belong to this category. Predictivetechniques include methods that build models that can be used to predict the valueof one or more attribute. Regression and classification techniques represent this cat-egory. Another categorisation that is borrowed from machine learning is supervisedand unsupervised learning techniques. The former describes techniques that buildthe models based on the value of a particular attribute, hence the term supervised.On the other hand, unsupervised learning methods build the models utilising all theattributes in the same way.

3.1.1 Applications and Challenges

With the amount of data being acquired regularly, adding up no less than in therange of terabytes [24, 73], announcing the era of Big Data [47], and data miningbeing a suite of techniques that is universally applicable to data of practically any

3.2 Galaxy Zoo: Citizen Science 17

nature, data mining has been applied successfully to a number of application areas.Examples of these areas include:

• Medical/Patient Database Mining - With the introduction of data mining intoareas such as health, analyses of patient databases have been carried out for amultitude of purposes such as discovering potential contributing factors towardspre-term birth using exploratory factor analysis and the effectiveness of emer-gency services through Bayesian networks [152, 129, 112].

• Business/Finance Data Mining - Determining good and bad loans through de-tailed analyses of customer databases and building predictor models on whatstrategies to employ to attract customers towards products are just some of theareas to which data mining has been applied [96, 30].

• Government Data Mining - Data mining has also gained popularity within gov-ernment as a means of monitoring the effectiveness of applied programmes to-wards its citizens and ferreting out fraud [41, 85, 137].

The main key issues of data mining focus on security and privacy. With all theselarge databases, some potentially containing very personal information of individu-als, being scrutinised, the question beckons; who is monitoring those who are study-ing this data? How do we know for certain whether our data is safe from prying eyesor that the person(s) studying it will not exploit it? It is for reasons such as this thatsome are still wary of the utilisation of data mining techniques. One such exam-ple can be seen in the Total Information Awareness project [156], a federal projectinitiated by the Department of Defence in the United States in 2001 that hosted asurveillance database with the objective of tracking terror suspects through deepanalyses of credit card purchase histories, telephone records and travel itineraries.This has led to an area of research involving discreet, privacy-preserving data min-ing techniques and the means in which to be able to analyse data without beingable to exploit it [44, 9, 8, 108, 147, 151]. It is worth noting that the privacy issuehas surrounded the Big Data analytics with some news stories1. The issue has beenrecently discussed in the Thirty-third SGAI International Conference on ArtificialIntelligence, which was held in December 2013 in Cambridge2.

3.2 Galaxy Zoo: Citizen Science

Due to the enormous number of images (approximating in the millions) producedby various platforms such as the Hubble Space Telescope and the Sloan Digital SkySurvey (SDSS), having a small group of scientists manually review and classify theastronomical objects in these images is no easy feat. As a result, the citizen scienceproject, Galaxy Zoo, was formed to seek help from volunteer citizen scientists to

1 http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

2 http://www.bcs-sgai.org/ai2013/?section=panel

http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/



http://www.bcs-sgai.org/ai2013/?section=panel


Fig. 3.1 Galaxies Resembling Latin Alphabets (source: space.com)

manually classify thousands upon millions of galaxies, thus painting a more detailedpicture of our universe.

Galaxy Zoo, which went live in 2007, already has over 200 million classifica-tions with approximately 3000 prominent mergers in the SDSS positively identifiedby more than 200,000 volunteers. It promotes scientific outreach through variousmeans, the most popular of which is its forum. Members can discuss their finds,start their own investigations and network with other like-minded citizen scientists.Professional scientists have also come to the forum to seek help in classificationsthat are not part of the Galaxy Zoo decision tree. There are even non-scientific ap-plications that have been created in conjunction with Galaxy Zoo, like the hunt forgalaxies that resemble the shapes of each of the twenty six alphabets of the Englishlanguage to form a galaxy font (see Figure 3.1)[21].

Discoveries such as the Red Spiral Galaxies, Green Peas and Hanny’s Voorwerpwere all a result of unusual, unique finds that teach participants of Galaxy Zoo to ac-tively investigate our cosmos and not simply rely on what science already teaches us.For example, when the Green Peas, these seemingly unresolved round point sourcesthat appeared green, were first discovered, further identification by interested par-ties was required. This involved identifying common characteristics and generatinga signal-to-noise measure that is unique to the Green Peas. Once all of this was ac-complished, the data was provided to Galaxy Zoo investigators for further analysis[130]. Perhaps the most famous of these discoveries is that of Hanny’s Voorwerp,discovered by Dutch school teacher Hanny, which is an emission line nebula neigh-bouring the spiral galaxy IC 2497 [60]. See Figure 3.2 for the six Green Pea galaxiesstudied at the University of Michigan3.

3 http://www.astronomy.com/news/2013/04/green-pea-galaxies-could-help-astronomers-understand-early-universe

http://www.astronomy.com/news/2013/04/green-pea-galaxies-could-help-astronomers-understand-early-universe

http://www.astronomy.com/news/2013/04/green-pea-galaxies-could-help-astronomers-understand-early-universe

3.3 Galaxy Zoo/SDSS Data 19

Fig. 3.2 Six Green Pea Galaxies (source: astronomy.com)

With the exponential rise in the amount of data collected, the use of Galaxy Zoohas also become increasingly popular in the field of data mining to employ machinelearning algorithms to convert Galaxy Zoo data from observation into information.This information can then be used to support existing hypotheses, create new theo-ries and make predictions on events like galaxy mergers and morphology. Not every-one thinks as positively as this, however. As not all scientists are experts in the useand manipulation of databases or statistics, some are doubtful and hesitant towardsthe use and capability of machine learning [19].

This chapter is aimed at demonstrating the power that machine learning has inaccomplishing these tasks efficiently and as accurately as possible. These discover-ies, coupled with the various research projects that have been developed through theextensive study and use of Galaxy Zoo, have proven it to be an important contributorto both the astronomy and data mining disciplines.

3.3 Galaxy Zoo/SDSS Data

Galaxy Zoo data can be obtained by visiting the Galaxy Zoo website4 and down-loading any of the seven publicly available data sets. The attributes available in-clude, as shown in Table 3.1, OBJID, RA and DEC which are all similarly used inthe SDSS database for unique galaxy identification. The SPIRAL, ELLIPTICAL andUNCERTAIN attributes provide the final classifications of each of the galaxies in the

4 http://data.galaxyzoo.org/

http://data.galaxyzoo.org/


Table 3.1 Sample of Galaxy Zoo Data Set Attributes

Attribute DescriptionOBJID Unique ID for each galaxy

RA Right ascensionDEC Declination

NVOTE Total no. of votes acquired for each galaxyP EL Elliptical morphology scoreP CW Clockwise spiral morphology score

P CS DEBIASED Final debiased clockwise spiral morphology scoreP EL DEBIASED Final debiased elliptical morphology score

SPIRAL Class labelELLIPTICAL Class labelUNCERTAIN Class label

data set. It is important to note that a vast majority of the galaxies listed in these datasets are classified as UNCERTAIN because of a debiasing function and a thresholdof 0.8 applied to the final voting scores [109].

The Sloan Digital Sky Survey (SDSS) has recently released its latest data collec-tion for SDSS-III, DR9 [10], boasting over 1.2 billion catalogued objects consistingof galaxies, stars, quasars and more. It is from the introduction of these variousSDSS data releases [3, 1, 2, 4, 7, 6, 11] over the last 10 years that over 3500 papershave been authored in an attempt to study, analyse and interpret it.

To be able to better understand this data, it is important to be familiar with theSDSS photometric system which is commonly utilised in description. Fukugita etal. [61] describe it as containing five colour bands (u’, g’, r’, i’ and z’) which dividethe entire range from the atmospheric ultraviolet cutoff at 3000 A(Angstrom) tothe sensitivity limit of silicon Charge-Coupled Devices (CCDs) at 11,000 A, intofive essentially non-overlapping pass bands. In essence, each letter designates, asdescribed in Table 3.2, a particular section of the electromagnetic spectrum.

Table 3.2 Descriptions of the Five Filter Bands

Filter Descriptionu’ Peaks at 3500Ag’ A blue-green band centred at 4800Ar’ Red passband centred at 6250Ai’ Far red filter centred at 7700Az’ Near-infrared passband centred at 9100A

There are four main categories of data that are housed in the SDSS database; im-ages, spectra, photometric and spectroscopic data. Photometric and spectroscopicdata are most commonly used with data mining. Photometric data provides feature-identifying attributes of galaxies such as brightness, texture and size. This data

3.3 Galaxy Zoo/SDSS Data 21

has been used for purposes such as the classification of galaxies possessing ActiveGalactic Nuclei (AGN) [35], predicting galaxy mergers [17] and detecting anoma-lies in cross-matched astronomical data sets [76] to name just a few. Spectroscopicdata, on the other hand, provides assorted measurements of each object’s spectrumlike redshift and spectral type which has been utilised, for example, to identify cata-clysmic variables in order to estimate orbital periods [144]. One problem pertainingto the use of spectroscopic data that still lacks a solution is the identification ofcluster membership absent spectroscopic redshifts [18].

Obtaining data from the SDSS database can be achieved by visiting the SDSSwebsite5 (this is for the latest release, DR9) and submitting MySQL queries to therelevant tables for the required attributes [142, 143]. Each galaxy in the database isuniquely identifiable by its object ID and also by a combination of its right ascensionand declination which forms, in the query, a unique composite key.

Table 3.3 Sample of SDSS Database Attributes

Attribute DescriptionisoAGrad z Gradient of the isophotal major axisexpMag u Exponential fit

expMagErr u Exponential fit errortexture r Measurement of surface texturelnLDeV g DeVaucouleurs fit ln(likelihood)lnLExp r Exponential disk fit ln(likelihood)

isoA z Isophotal major axis

Table 3.3 provides a minute fraction sample of the attributes obtainable from thePhotoObjAll table in the SDSS database. It is notable that each attribute is linked toone of the five photometric colour bands (i.e. u’, g’, r’, i’ and z’). The same data,but with redshift adjustment incorporated, can also be queried from the same tableif desired. Queries for photometric data can take the following form:

SELECTa.expRad_g, a.deVRad_g, a.expRad_r, a.expRad_i,a.expRad_z, a.deVRad_r, a.expRad_u, a.deVRad_i,a.deVRad_z, a.isoA_g, a.lnLDeV_g, a.isoBGrad_r,a.lnLDeV_r, a.lnLDeV_i, a.isoA_r, a.lnLExp_r,a.lnLExp_i, a.isoBGrad_i, a.lnLExp_z, a.isoAGrad_i,a.isoPhiGrad_g, a.isoAGrad_r, a.lnLDeV_z,a.petroRad_u, a.texture_g, a.deVAB_u, a.modelMag_z,a.dered_z, a.expMag_i, a.isoAGrad_g, a.isoPhiGrad_r,a.lnLDeV_u, a.isoPhiGrad_i, a.isoColcGrad_g,a.isoColcGrad_r

5 http://www.sdss3.org/dr9

http://www.sdss3.org/dr9


FROM #x x, #upload u, PhotoTag p, PhotoObjAll aWHERE u.up_id = x.up_id and x.objID=p.objID andp.objID=a.objIDORDER BY x.up_id

In the above query, 35 attributes are called from the PhotoObjAll table which islinked to the PhotoTag table by its objID. Together with the query, a list of rightascension and declination values has to be provided. A sample of this is as such:

name ra decA1 0.00171 -10.37381A2 0.00308 -9.22228A3 0.00429 -10.94667A4 0.00575 15.50972A5 0.00646 -0.09258A6 0.00654 -9.49453A7 0.00775 0.71925A8 0.00833 15.69717A9 0.00875 15.88172A10 0.01004 14.82194

3.4 Data Pre-processing and Attribute Selection

Data pre-processing, as argued by Zhang et al. [164], is one of the most importantsteps in the data mining process. It carries its own set of issues, dependent on thenature of the data set(s) to be used. The steps taken during pre-processing are alsodetermined by the algorithm that is subsequently used. Some algorithms may requirean objects attributes to be numerical or categorical which, in the case of stars andgalaxies, are mainly numerical. Conversion between the two types is also possiblethrough methods such as scalarisation and binning.

Often, data requires cleaning. This is either the result of noisy data (e.g. humanor computer error at data entry) or inconsistent data (e.g. functional dependency vi-olation). It is commonly the case that the acquired astronomical data contain one ormore invalid or missing values. Rectifying this issue is done either by interpolatinga value for that field by using other information, or removing that particular objectaltogether and using the remaining data as some algorithms cannot accept objectswith missing field values [19, 71]. As an interesting side note, significant researcheffort has also gone into this issue of resolving missing attribute values [5, 68, 164].

Baehr et al. [17] faced this issue in their study which used 6,310 objects, eachcontaining 76 attributes including the merger/non-merger nominal attribute. Consid-erable pre-processing was required since both C4.5 decision tree and cluster analysisalgorithms were chosen for this study. All attributes not representing morphologicalcharacteristics were removed. As for missing or bad values, since estimating thesevalues was not possible, the objects were removed. Lastly, a concentration index

3.4 Data Pre-processing and Attribute Selection 23

was generated while distance-dependent attributes were made distance-independentvia redshift.

The final sizes of data sets that are used do vary depending on the study. However,the integrity of the objects in a data set always takes precedence over the overallsize of the set [37, 90, 118, 149]. For example, the study by Vasconcellos et al.[149] presented the acquisition of over one million objects that made up six classes.Objects were selected having the required features suitable to the study. However,as their study only required the use of spectral class star and galaxy, that made up884,378 objects in total of the original set. Further removals of objects, includingthose containing non-physical values, left a final sample size of 884,126 objects.This final data set was formed through the use of training examples, each with itsown set of attributes that eventually led to a fixed class.

McConnell and Skillicorn [118] came up with 5 suitable data sets totalling 36,159objects with 41 classes. A training set was then created via resampling the data. Thisprocess was then repeated on all 5 data sets. This resulted in one third of the datanot getting sampled at all and, instead, becoming a test set. They acknowledged thatastronomical data sets are typically partitioned by objects and attributes, possessinghorizontal and vertical partitioning respectively. As their study involved a distributeddata mining approach for both horizontally and vertically partitioned data, they builta predictor locally on each partition so that these get consolidated together with theirpredictions.

An example of an extremely small data set is that of a study composed by Callejaand Fuentes [37]. They compared the performances of three machine learning al-gorithms (i.e. Naıve Bayes, C4.5 and Random Forest) using the classification ofimages produced by numerous digital sky surveys as a medium. A total of 292 im-ages were used. A unique feature of this study is the way the images to be classifiedwere prepared. There are two stages to this method before the machine learningphase begins. Each image is rotated horizontally, centred and then cropped in theanalysis stage of the experiment. In the data compression phase, the dimensionalityof the data is reduced to find a set of features. These are then used to facilitate themachine learning process.

Training and testing sets are indicated as very important, as is their build, particu-larly with supervised methods [75, 135, 136, 149]. Kamar et al. [87], amongst othermethods used to clean and prepare their data, divided their 816,246 objects fromthe Galaxy Zoo data set into training, validation and testing sets, each containing628,354, 75,005 and 112,887 objects respectively. Task characteristics, characteris-tics of the voters of that particular task and votes acquired for that task were definedas well. They also reported that data consisting of both the log entries of voters col-lected during Galaxy Zoo’s operation as well as the voters’ reports containing theirvote and information about them (e.g. time and day that the report was received)was used for their experiment.

Banerji et al. [22], in their study of galaxy morphology, took a different approachto their research, however, as their study involved the use of artificial neural net-works, unlike most studies in this area. They have stated the importance of choosingthe appropriate input parameters, those that show, in this case, marked differences


across the three morphological classes in their study. The artificial neural networkuses these parameters, as well as the training set, to derive its own morphologicalclassifications.

Wagstaff and Laidler [155] have shown how, with clustering algorithms, thechoice of method(s) for dealing with missing values in a given data set, particu-larly with astronomical data, can be particularly important. There are a multitudeof reasons for this, ranging from unfavourable observing conditions to instrumentlimitations. As a result of this, they developed a clustering analysis algorithm calledK-Means Clustering with Soft Constraints (KSC) [154] and applied it to 1507 ob-jects from the SDSS database. The study concluded that simply removing missingvalues and then clustering the data set can potentially produce misleading results.

As we can see from these studies, data sets used in machine learning can varygreatly in size depending on its purpose and the algorithm(s) used. The proper defi-nition of required features and training and test sets play a very important role in thefinal outcome. Furthermore, missing, incomplete or noisy data are commonplace inthese huge astronomical data sets. This only serves to further strengthen the pointthat the pre-processing of data is a very crucial step in the entire knowledge dis-covery process. Akin to that is the attribute selection process. Identifying the bestdata set(s), object candidates and attributes for use greatly affects the final results[136, 149].

3.5 Applied Techniques/Tasks

A whole host of techniques and algorithms have been applied to the Galaxy Zoodata sets including C4.5 (a decision tree based classifier), Naıve Bayes, RandomForest and Artificial Neural Networks to name just a few. In a broad sense, thealgorithms used can be divided into supervised and unsupervised methods. Withsupervised methods, training sets are relied upon for use with classification. Thesemethods are trained on the given set of objects and the result is used to map otherobjects similarly. Unsupervised methods do not require the use of a training sethowever. Without training data, methods such as clustering and certain artificialneural network approaches infer group structure and model based on similarities[157, 158].

As an example of the use of supervised methods, Calleja and Fuentes [37] usedthree algorithms, that being Naıve Bayes, C4.5 and Random Forest, all of which areimplemented in WEKA [70] (see Figure 3.3 for categorised classifiers in WEKA Ex-plorer). The study was designed to assert the best classifier algorithm for use in theproblem of galaxy classification. For all three algorithms, 10-fold cross-validationwas consistently used. The C4.5 algorithm was implemented using a confidence fac-tor of 0.25 as well as pruning, and the Random Forest algorithm employed the useof 13 trees for all experiments. It is undeniable in their final results that the RandomForest algorithm provided the best accuracy for all galaxy classes, surpassing C4.5and Nave Bayes with 91.64% accuracy for the three-class case, 54.72% accuracyfor the five-class case and 48.62% accuracy for the seven-class case. We have also

3.5 Applied Techniques/Tasks 25

Fig. 3.3 WEKA Explorer offers a variety of classification techniques

adopted Random Forest in our study reported in this monograph. Our choice wasbased on the notable success of the technique, not only for astronomical data sets,but for other scientific and business applications.

Vasconcellos et al. [149] also utilized a similar supervised method for their study,the Decision Tree (DT) method. As they used the WEKA Java software pack-age which comes with 13 different DT algorithms, they used the Cross-Validationmethod to compute the completeness function for each of these algorithms alongwith all sets of internal parameters to optimise parameters in maximising com-pleteness. Once that was achieved, each algorithms performance was tested. Theresults showed that the Functional Trees algorithm was most optimal for this study.A training set was then chosen to construct the final DT for classification. This in-volved taking all 884,126 objects from the database and finally narrowing it down


to 240,712 objects with 13 attributes. The resultant DT was then applied to the finalclassification task. What this showed, when this DT was applied to data from theSDSS that used an axis-parallel DT to assign probability of an objects class type,using 561,070 objects, was that this DT performed similarly to the axis-parallel treebut with lower contamination rates of approximately 3%.

Baehr et al. [17] also made use of decision trees, attempting to predict the co-alescence of pairs of galaxies by generating a tree in WEKA using the C4.5 algo-rithm based on three arguments; binarySplits, confidenceFactor and minNumObj.The information gain for each of the acquired attributes for 6310 objects was firstcalculated and then three trees, one trained on all instances, one trained on mergerinstances with stronger Galaxy Zoo user confidence and one trained similarly onmerger instances but with weaker confidence, were generated. The tree that used allmergers resulted in approximately 70% accuracy with 66% precision and 68% recallwhich was considered less useful compared to the other two trees. What was alsonoticeable was the fact that the strongest predicting attributes appeared associatedwith the SDSS green filter waveband which turns out to be very crucial as the greenband carries a disproportionate amount of data compared to others.

A study by Kamar et al. [87], that solely involved the use of the Naıve Bayesalgorithm, focused on harnessing the power of machine learning as a means to solv-ing crowdsourcing tasks. This was achieved by constructing Bayesian predictivemodels from data sets and using these models to fuse contributions both human andmachine together to predict worker behaviour in an attempt to better guide decisionson hiring and routing workers. With the Bayesian-structured learning acquired fromthe Galaxy Zoo data, a Direct model was generated which was able to predict thecorrect answer of a given task and predict the next received vote. Subsequently,Naıve Bayes and Iterative Bayes models were also generated for accuracy compari-son with a Baseline model that classifies all task instances as the most likely correctanswer in the training set. The results in this study are also fairly conclusive. Whenthe number of votes is small, both the Iterative Bayes and Naıve Bayes models per-form better than the Direct model. However, once the number of votes increases andgets fairly large, the Direct model provides the greatest accuracy.

Instance-based data mining, more specifically the k-Nearest Neighbour algo-rithm, has also been used in this combination of astronomy and data mining[20, 150, 27]. The study by Ball et al. [20] involved the use of a data set containing55,746 objects classified as Quasars by the SDSS and 7,642 objects cross-matchedfrom this set to data from the Galaxy Evolution Explorer (GALEX). The resultsrevealed the ideal parameters to be 22 +/- 5 nearest neighbours (NN) with a dis-tance weighting (DW) of 3.7 +/- 0.5. While there are no regions of catastrophicfailure (i.e. groups of objects assigned a redshift very different from the true value)in the final published results when the algorithm assigns redshifts to quasars, furtherimprovement is noted as certainly feasible.

Martinez-Gonzalez et al. [116] have shown an example of the utilisation of theExpectation-Maximization (EM) algorithm with astronomical data by using it toiteratively estimate the power spectrum of cosmic microwave background (CMB)(see Figure 3.4 for a map of the CMB). The EM algorithm was able to positively

3.5 Applied Techniques/Tasks 27

Fig. 3.4 A map of the CMB created by the COBE satellite (credit: NASA, DMR, COBEProject) – source: BBC website

provide a straightforward mechanism for reconstructing the CMB map. They ac-knowledge, as well, that the EM algorithm is highly useful when a many-to-manymapping is involved. The main advantage of the application of the EM algorithmin this study is due to there being unknown data. Parametrizing the unknown dataallows the EM process to return the best set of free parameters.

The use of the EM algorithm, in an unsupervised setting, is also exemplifiedby Kirshner et al. [91] who applied the probabilistic-learning technique in orderto spatially orient various galaxy shapes. They successfully classified, with a highdegree of accuracy, the various classes of galaxies through model-based galaxy clus-tering which clusters these objects based on morphological properties using cross-validation.

Another area that has been generating interest in astronomy and machine learn-ing is the casting of predictions through the use of artificial neural networks. Banerjiet al. [22] used an artificial neural network that was trained with three sets of inputparameters and were able to clearly distinguish between the different morpholog-ical classes (i.e. Early Types, Spirals, Point Sources/Artefacts) depending on theassigned parameters. The neural net probability of an object belonging to a classcompared to the percentage of genuine objects of that class that are discardedthrough this probability are plotted, thus providing the optimum probability thresh-old for the correct neural network for each morphological class. During this study,it was found, however, that if an object had a neural net probability of more than0.1 in the Point Source/Artefact class; it was possible it also had a probability ofmore then 0.5 in the Spiral class. As a result, some objects were poorly classifiedby the neural network and some were placed in more than one class as a result of


Table 3.4 Summary of Research Reviewed

Objective(s) Technique(s) AppliedTo improve the task of estimating photometric redshiftsusing SDSS & GALEX Data

kNN (IBk)

To develop a new modified algorithm for outlier detec-tion

K-Nearest Neighbor DataDistributions (KNN-DD) &PC-OUT

To derive a multidimensional index to support approxi-mate nearest-neighbour queries over large databases

DBIN (Density-Based In-dexing) over K-Means

To develop a procedure for computing a refined startingcondition from a given initial one

K-Means

To develop a scalable implementation of theExpectation-Maximization (EM) algorithm, basedon a decomposition of the basic statistics the algorithmneeds

Expectation-Maximization(EM) algorithm

Estimating cosmic microwave background power spec-trum and map reconstruction


Applying probabilistic model-based learning to auto-matically classify galaxies


To develop a scalable clustering framework designedfor iterative clustering

Scalable K-Means

To determine when the coalescence of two galaxiestakes place

C4.5 (Information GainAnalysis)

To automate the classification process of galaxies Artificial Neural Network(ANN)

A comparison of three algorithms in the task of galaxyclassification

Naıve Bayes , C4.5 (J48) &Random Forest

Comparing performances when distinguishing betweenspiral, elliptical galaxies and other galactic objects

CART, C4.5 (J48) & Ran-dom Forest

To apply developed Bayesian formalism in order tostudy star/galaxy classification accuracies

Naıve Bayes

A comparison of the performances of three algorithmsin morphological galaxy classification

Support Vector Machines ,Random Forests & NaıveBayes

A comparison of the efficiency of 13 different decisiontree algorithms after being applied to star/galaxy classi-fication data from the SDSS

J48, J48graft, NBTree,BFTree, ADTree, FT, LMT,LADTree, Simple Cart,REPTree, Decision Stump,Random Tree & RandomForest

To estimate the accuracy of the photometric redshiftsfor several SDSS data catalogues

Self-Organising Mapping(SOM) (RMSE)

To explore Bayesian classifier combinations for the pur-pose of imperfect decision combination

Variational BayesianInference

3.6 Summary and Discussion 29

the probabilities. After adding profile-fitting and adaptive shape parameters to theirinitial results, the final results revealed that 92% of Early Types, 92% of Spirals and96% of Point Sources/Artefacts were correctly classified. This showed that with 12carefully chosen parameters, the neural network results provide a greater than 90%accuracy in classifications compared to those already made in the Galaxy Zoo dataset.

Table 3.4 provides an overview of some of the work that has been discussed inthis chapter involving classifying astronomical data, comparing various methodolo-gies and work that has been done to improve existing clustering algorithms andclassifying techniques.

3.6 Summary and Discussion

In this chapter, we looked at how data mining, as a science, is defined and what itcontributes to this data-deluged world. There are many varying definitions of whatdata mining is and so, to minimise confusion, we defined it as being akin to dis-covery and the application of certain specialised algorithms, designed to extractpatterns, to data sets. Methods of pre-processing astronomical data have also beendiscussed and it was shown that, with astronomical data in particular, removing badvalues is not always advisable as it can produce misleading results. The sizes of datasets are also shown to vary greatly depending on the study and the attribute selectionprocess is demonstrated to be exceptionally important.

We see a lot of work done on clustering algorithms in areas like density-based indexing over K-Means, refining the initial points for K-Means clustering,scaling both the Expectation Maximization (EM) and the K-Means algorithmsto large databases and refining the EM algorithm’s starting points for clustering[25, 32, 31, 33, 58, 81, 124]. Improvements are constantly being made to thesetechniques and their applications universally.

Wozniak et al. [161] conducted a comparison of the effectiveness of SupportVector Machines (SVM) and unsupervised methods including K-Means and Auto-class. With SVM, a preliminary efficiency of 95% was obtained after isolating aselected few defined classes against the rest of the sample used, outperforming theunsupervised method. However, they acknowledge that this result is to be expectedas supervised methods tend to perform better under these circumstances. As such,unsupervised methods like K-Means should not be underestimated.

In addition to this, Jagannathan and Wright [84] have developed the concept ofarbitrarily partitioned data, a generalisation of both horizontally and vertically parti-tioned data, and have also developed a privacy-preserving K-Means algorithm overthe concept of arbitrarily partitioned data which uses a novel privacy-preservingprotocol. While their results have shown that there still is the occasional data leak,with further improvements made on the privacy-preserving K-Means algorithm, itis yet another reason why the K-Means algorithm should not be immediately dis-counted. Although privacy is not an issue when dealing with astronomical data, datatransformations made to enforce privacy have proven to be adaptable to other tasks,especially in data pre-processing.


Berkhin [26] have conducted an in-depth review of the various hierarchical(i.e. Agglomerative and Divisive algorithms) and partition clustering (i.e. Relo-cation, Probabilistic, K-Means, K-Medoids and Density-Based algorithms) tech-niques amongst others and have also shown the use of clustering algorithms tocome with certain properties that require careful analysis and consideration for suc-cessful implementation. Some of these properties include attribute type, scalability,handling outliers, data order dependency, reliance on a priori knowledge and high-dimensional data. They also show that, with clustering algorithms being a key areaof research, many of these techniques have been improved upon to successfullytackle these issues.

The application of data mining to large data sets in the field of astronomy is in-creasing in popularity. This is, in part, due to citizen science projects like GalaxyZoo that are designed to reach out to both professional scientists and the generalpublic alike. As a result of attempting to manually classify these continuously grow-ing data sets that contain millions of objects, the use of computational classificationand identification has become increasingly necessary. From a search of relevant lit-erature, we identified uses of various machine learning algorithms for the purposeof classification and also artificial neural networks for making predictions includesolving crowdsourcing tasks, identifying features that lead to galaxy coalescencesand distinguishing between different morphological classifications based on predic-tive models.

At present, it is an exciting time with maturing hardware and software solutionshanding Big Data problems, with astronomy is no exceptional. The results of stud-ies covered in this chapter indicate potential future research in this combined field.With accuracies of approximately 90%, depending on algorithm and applied vari-ables (e.g. data set integrity, confidence factor, binary splitting, pruning), there is nodoubt that extended research into enhancing these methods (e.g. the use of a mas-sive parallel computational environment such as MapReduce framework [48]), theirapplications and the existing results is feasible [17, 87, 138]. The astronomical datamining landscape is one that is constantly and consistently growing. With sky sur-veys such as the Sloan Digital Sky Survey producing terabytes of data daily, it is notsurprising to see researchers in the fields of data mining and astronomy collaborat-ing more so. Opportunities of this collaboration have been recently highlighted byBorne [29] in reference on how the Big Data technologies can be exploited.

Chapter 4Adopted Data Mining Methods

“This means that, whereas statistics has placed emphasis on modelling and inference,data mining has placed substantially more emphasis on algorithms and search.” byDavid J. Hand [63]

To conduct the research reported in this monograph, extensive analysis of the GalaxyZoo and SDSS data sets and the various algorithms utilised is necessary in order toassess the needed requirements. The principal requirement, however, is to be ableto successfully identify the actual morphologies of the galaxies labelled as Uncer-tain in the Galaxy Zoo data set. In this chapter, the adopted methodology will beanalysed and shown to be the best fit for this project, together with a review of theK-Means algorithm and the entropy-based Information Gain feature selection tech-nique which are the methods chosen for clustering and assessing the importance ofthe features, respectively. The innovative heuristic algorithm, required for obtainingthe best attribute selection and that has been developed through this project, willalso be presented and discussed in detail along with the pre- and post- processingmethods that were utilised throughout the data mining process.

4.1 CRoss-Industry Standard Process for Data Mining(CRISP-DM)

The late 1980s/early 1990s saw the inception of the term Knowledge Discovery inDatabases (KDD) which generated great interest and, eventually, led to the hurrieddevelopment and design of efficient data mining algorithms capable of overcomingall the shortfalls of data analysis to produce new knowledge. It was only in the early2000s that a new methodology, CRISP-DM, was published, eventually becomingthe basic standard for data mining project management [113].

As shown in Figure 4.1, the CRISP-DM reference model [42] reflects the sixproject phases for data mining and the respective relationships between them:

• Business Understanding - The goal is to fully understand, from a business per-spective, the objectives and requirements of the project and to then define thedata mining problem and design a plan to achieve those objectives.


32 4 Adopted Data Mining Methods

Fig. 4.1 The CRISP-DM Model

• Data Understanding - This involves data acquisition and analysis. It is importantto understand what the data is about, what features it may possess and what pre-processing may be required.

• Data Preparation - All pre-processing tasks such as attribute selection, cleaningand normalising encompass this phase.

• Modelling - This phase has the potential of cycling back to data preparation,depending on the technique(s) selected based on the data mining problem type.

• Evaluation - Once arrived at this phase of the project, the data has been thor-oughly cleaned and analysed and models have been carefully designed. It is herethat careful comparison of the models to the original requirements must be madeto evaluate its correctness.

• Deployment - The complexity of this phase will vary, depending on the nature ofthe project and the requirements of the client(s). At times, the model or the newknowledge acquired may require presentation.

CRISP-DM is presently regarded, not only just as a project methodology, butalso as a means of promoting data mining as an engineering process [127]. In light

4.2 K-Means Algorithm 33

of this, extensive research and comparisons between it and standard software en-gineering models have been carried out in order to assess CRISP-DMs suitabilityand usefulness [160]. Marban et al. [114], for example, have concluded that whileCRISP-DM, at present, lacks some of the software engineering processes in enoughdetail to support much larger, complex projects, it still can be considered an engi-neering standard. With additions and refinements made to the model, it certainly canbe designed to meet all standards set forth in IEEE Std 1074 and ISO 12207.

In the following sections, we shall discuss the three data mining techniques weused in this research project. The rationality behind adopting those techniques isdiscussed in subsequent chapters, as this is related to intermediate results achieved.

4.2 K-Means Algorithm

The K-Means algorithm is one of the most popular clustering techniques available,used extensively in both industrial and scientific applications for cluster analysis.Originally proposed back in 1956 by Hugo Steinhaus [140], a modified version ofthe algorithm was later published by Stuart Lloyd in 1982 [110] which, today, hasbecome the default choice of tool for clustering. The fact that it is used less often incomparison to other algorithms in the field of astronomical data mining is of interestto this research.

The K-Means algorithm is known as a partitional or nonheirarchical cluster-ing technique in which the aim is to partition n objects into k clusters where eachobject belongs to the cluster with the closest mean [81]. This is an iterative, non-deterministic approach which starts with an assignment for each object as described

in the following equation given an initial set of k means m(1)1 , . . . ,m(1)

k , where each

x j is allocated to one S(t)i :

S(t)i = {x j : ||x j −m(t)i || ≤ ||x j −m(t)

i∗ ||∀i∗ = 1, . . . ,k} (4.1)

This is followed by the calculation of the new means which is to become thenewly appointed centroid of the cluster:

m(t+1)i =

1

|S(t)i | ∑x j∈S(t)i

x j (4.2)

The iteration of these two steps will continue until convergence is achieved.When this occurs, the assignments of the centroids no longer change. The num-ber of iterations required to achieve convergence can vary greatly which makes thisalgorithm potentially computationally intensive particularly with extremely largedata sets. The other issue that the K-Means algorithm presents initialisation sensi-tivity. Two different initialisations can, in fact, lead to significantly different results.However, there are a number of variants of this algorithm that have been developedto address this and other problems, significantly improving its efficiency and ef-fectiveness [13, 81, 89, 97]. Algorithm 1 shows the steps needed for the K-Meansprocedure.


Algorithm 1. K-means Clustering AlgorithmData: k number of clustersData: D ∈ R

n×m datasetData: maxIterate maximum number of iterationsRandomly select k points in D ;Assign the k points to C: cluster centroids ;i ← 0 ;repeat

Assign each point in D to its closet c ∈C ;Calculate the mean value among all the m attributes for the points attracted by eachc ∈C ;i ← i+1 ;

until ∀c ∈C did not move from the previous iteration OR i = maxIterate;

4.3 Support Vector Machines

Support Vector Machines (SVMs) are successful supervised learning and predictionmodels, also referred to as non-probabilistic binary linear classifiers that learn byexample to assign one of two possible labels to given objects. The technique wasfirst developed by Cortes and Vapnik [46] in 1995 and are now popularly used forregression analysis and classification. The popularity of SVMs can be attributed totheir strong mathematical foundations and the several salient properties that theyhave which are rarely seen in other techniques. Since their introduction and thefocus of SVMs in Vapnik’s various publications [46, 148], the last 15 years haveseen SVMs gain momentum both it its adoptions and in its research.

It can be said that, to be able to fully understand and grasp SVMs, one needssimply to understand four abstractions [36, 122]:

• The Separating Hyperplane - A hyperplane is a straight line in a high- dimen-sional space and a separating hyperplane is one that can successfully separatetwo sets of points fully.

• The Maximum-Margin Hyperplane - Similar in principle to a regular hyper-plane, the maximum-margin hyperplane is a unique hyperplane solution that sep-arates the two sets of points fully, but adopts the maximal distance from any givenexpression profile. By defining the distances from the expression vectors to theseparating hyperplane, the SVM adopts the maximum-margin hyperplane, thusincreasing its accuracy in classification.

• The Soft Margin - A soft margin is an allowance for some of the anomalous ex-pression profiles to remain misclassified due to the fact that the data, as a whole,is originally non-linearly separable. This soft margin is user-specified so that abalance can be met between margin size and hyperplane violations.

• The Kernel Function - A kernel function projects data from a space that is lowerin dimension to one that is higher in dimension, allowing, if selected and im-plemented efficiently, for complete separation of two sets of points that were

4.3 Support Vector Machines 35

Fig. 4.2 The Effect of a Kernel Function

previously inseparable. For example, Figure 4.2 shows an example of the effectof a kernel function.

The main drawback that SVMs have lies in its dependence on data set size, whichcauses their complexity to grow exponentially, making them less favourable for usein large-scale data mining or machine learning. If the number of features is greaterthan the number of available samples, there is a good chance of poor performance.The good news, however, is that this has created yet another area of active researchtowards improvements and enhancement to the methodology of SVMs [45, 49, 59].One such improvement that is of interest to us and that is also in-line with the themeof this book is the incorporation of hierarchical clustering with SVMs to overcomethis very issue of handling large-scale data sets [162].

On the other hand, advantages of using SVMs can be summarised as follows:

• Maintains effectiveness even in cases where the number of dimensions is greaterthan the number of presented samples and also in high dimensional spaces.

• Memory efficient as SVMs utilise a subset of training points referred to as sup-port vectors.

• Versatile as both common as well as custom kernels can both be specified for thedecision function as required.

Employing SVMs to solve linear problems can be defined and described mathe-matically as follows. Given a data set D, a set of n points with the class that xi belongsto being determined by yi having one of two possible values (yi= -1 or yi = 1):

D = {(xi,yi) | xi ∈�p,yi ∈ {−1,1}}ni=1 (4.3)

Each x j is a real vector with p-dimensions. The objective is to determine themaximum-margin hyperplane that fully divides the set of points into two classes


Fig. 4.3 SVM Margins

(yi = -1, yi = +1). Any given solution can be described as the set of points x thatsatisfies the following:

w · x− b = 0 (4.4)

where w is the vector to the hyperplane and · refers to the dot product. If the dataset is indeed linearly separable, it is possible to select two hyperplanes that intersectthe data such that there are no points between them and then attempt to maximisetheir distance. These two hyperplanes take the value -1 and 1 as follows:

w · x− b =−1 and w · x− b = 1 (4.5)

The distance between the two hyperplanes is therefore calculated as 2‖w‖ . ‖w‖ is

what needs to be minimised. In this book, when we utilise SVMs, we select sequen-tial minimal optimisation to solve the quadratic programming (QP) optimisationproblem.

4.3 Support Vector Machines 37

4.3.1 Sequential Minimal Optimisation

Sequential Minimal Optimisation (SMO) was designed at Microsoft Research byJohn Platt in 1998 [128] and, today, is widely used in the training of SVMs. Essen-tially, SMO is an iterative algorithm that breaks the quadratic programming (QP)problem into a series of smaller, easier-to-solve sub-problems which are then solvedanalytically. The beauty of SMO is that it opts to solve the smallest possible optimi-sation problem at every step which, for a standard SVM QP, involves two Lagrangemultipliers. After selecting two such multipliers and calculating the optimal values,the SVM gets updated to reflect these new values.

According to Platt [128], the QP problem is defined as follows:

maxα

n

∑i=1

αi − 12

n

∑i=1

n

∑j=1

yiy jK(xi,x j)αiα j (4.6)

such that:

0 ≤ αi ≤C, f or i = 1,2, . . . ,n ,n

∑i=1

yiαi = 0 (4.7)

yi ∈ {−1,+1} is a binary label and xi is an input vector. K(xi,x j) represents thekernel function and C refers to the hyperparameter of an SVM, both of which areuser-defined. The variables αi are Lagrange multipliers, designed for finding thelocal minima and maxima of a given function.

Once broken down into a series of smallest possible sub-problems, for any twoof the Lagrange multipliers α1 and α2, they are reduced as follows:

0 ≤ α1,α2 ≤C, y1α1 + y2α2 = k (4.8)

The SMO algorithm then repeats the following steps iteratively until convergenceis acquired:

1. Find a Lagrange multiplier α1 that violates the Karush-Kuhn-Tucker (KKT) con-ditions for the optimisation problem.

2. Pick a second Lagrange multiplier α2 and optimise the pair (α1,α2).

It is important to note that there are some heuristics used in the selection of thetwo αi variables.

The main advantage of SMO can be found in its analytical approach to the ac-quisition of the solution. While other algorithms scale at the very least cubically inthe number of training patterns, Platt’s SMO only scales quadratically. The break-ing down of the problem into smaller problems means that the time taken to reach asolution for the QP problem is shortened significantly. Because of this break down,SMO also avoids the manipulation of large matrices, preventing the possibility ofnumerical precision problems. Additionally, the matrix storage required is minimal


Fig. 4.4 Random Forests is termed after the natural tree forests

such that even larger-scaled SVM training problems can fit inside the memory of astandard workstation or PC, for a slightly moderate size data set.

SMO has become so popular that improvements and modifications such as theaddition of parallelisation, fixed-threshold and improved regression training havesince been published [38, 101, 103, 163].

4.4 Random Forests

Originally conceived by Leo Breiman and Adele Cutler [34], Random Forests wascoined from random decision forests which was first proposed in Bell Labs in 1995by Tin Kam Ho [78]. The term “forests” is named after the natural forest of trees(See Figure 4.4)

Random Forests is what is referred to as an ensemble learning method usedfor classification problems. This is a method that utilises multiple models in or-der to acquire better predictive results as opposed to a single stand-alone model.With Random Forests, an ensemble of decision trees are constructed and the classthat is the mode of the all classes generated by the individual trees is the output.

4.4 Random Forests 39

In Breiman’s paper [34], Random Forests add an additional layer of randomness tobagging, in which successive trees are each independently constructed using a boot-strap sample of the data set, and also has each node split using the best among asubset of predictors randomly chosen at that node, which is usually set to the squareroot of the total number of features (

√F , where F is the total number of features in

the data set).Provided a set of training data as follows:

Dn = {(Xi,Yi)}ni=1 (4.9)

the weighted neighbourhood scheme [104] predicts a query point X as such:

Y =n

∑i=1

Wi(X)Yi (4.10)

The set of points Xi where Wi(X)> 0 are referred to as the neighbours of X . Wecan therefore write, given a forest of M trees, the prediction of the m-th tree for Xas follows:

Tm(X) =n

∑i=1

Wim(X)Yi (4.11)

where Wim = 1/km if X and Xi are in the same leaf in the m-th tree or Wim = 0otherwise and km is the number of training samples which fall into the same leaf asX in the m-th tree.

As such, the prediction of the entire forest can be written in this way [104]:

F(X) =1M

M

∑m=1

Tm(X) =1M

M

∑m=1

n

∑i=1

Wim(X)Yi =n

∑i=1

(1M

M

∑m=1

Wim(X))Yi (4.12)

The above equation shows that the prediction is, in fact, a weighted average ofthe various values of Yi with weights:

Wi(X) =1M

M

∑m=1

Wim(X) (4.13)

The neighbours of X in this description are the points Xi which fall into the sameleaf as that of X in at least one tree. As such, there is complexity on the training setstructure and in the dependence of the neighbourhood of X .

The Random Forests procedure is given in Algorithm 2.In order to make a prediction, a new sample traverses the tree and is assigned

the label of the training sample in the node it finally ends up in. This is repeatedover all trees and the mode vote of all the trees is reported as the Random Forests’prediction.


Algorithm 2. Random Forests AlgorithmData: N: number of trees in the forestData: S: number of features to split onResult: A vector of trees −→RFCreate an empty vector −→RF ;for i = 1 → N do

Create an empty tree Ti ;repeat

Sample S out of all features F using Bootstrap sampling ;Create a vector of the S features −→FS ;Find Best Split Feature B(−→FS) ;Create A New Node using B(−→FS) in Ti ;

until No More Instances To Split On;Add Ti to the −→RF ;

4.5 Incremental Feature Selection (IFS) Algorithm

A novel heuristic algorithm used for optimising attribute selection to maximise theaccuracy of classes-to-clusters evaluation, we termed Incremental Feature Selection(IFS), has been developed through this research project as shown in Algorithm 3.

Algorithm 3. Incremental Feature SelectionInput: attr: Array of all attributesOutput: bestAttributesArray: The best selection of attributes, arranged in order

of information gain valuefor i = 0 to attr.length do

Calculate information gain and store in arrayOfInfoGain← IG(attr[i]);endSort arrayOfInfoGain in descending order;Add class label to bestAttributesArray;Add arrayOfInfoGain[0] to bestAttributesArray;Cluster with attributes from bestAttributesArray and save accuracy as score;for i = 1 to arrayOfInfoGain.length do

Add arrayOfInfoGain[i] to bestAttributesArray;Cluster with attributes from bestAttributesArray and save accuracy asnewScore;if newScore < score then

Remove arrayOfInfoGain[i] from bestAttributesArray;endelse

score = newScore;end

end

4.6 Pre- and Post-processing 41

A list of attributes is provided and their respective information gain values arecalculated and arranged in descending order. The following equation defines theformula for calculating the information gain (IG) for an attribute where A representsall attributes and O represent all objects in the data set, value(x,a) is a functionthat returns the value of x∈O with regards to the attribute a∈A and H refers to theentropy.

IG(O,a) = H(O)− ∑v∈values(a)

|{x ∈ O|value(x,a) = v)}||O| .H({x ∈ O|value(x,a) = v)})

(4.14)

After all the information gain values have been acquired, the 1st attribute, withthe highest information gain level, together with the class label, is clustered andthe accuracy is recorded. The 2nd attribute is then added in and re-clustering isperformed. The accuracies are then compared. If the accuracy decreases after addingthe 2nd attribute, it is removed and the 3rd attribute is then added in. If the accuracyincreases or remains unchanged, the 2nd attribute remains and the 3rd is then addedin. This process iterates heuristically until all attributes are processed. What is leftat the end of the algorithms run is the optimal combination of attributes providingthe best possible accuracy for classes-to-clusters evaluation.

4.6 Pre- and Post-processing

It is absolute truth that if you cluster flawed data, your output will be nothing shortof flawed as well [95]. In the case of the data acquired from both Galaxy Zoo andthe Sloan Digital Sky Survey, this is no exception. Issues that had to be addressedincluded invalid, missing values and the normalisation of all attribute data. Post-processing, in the case of this research, was highly iterative but not extensive.

4.6.1 Pre-processing

It is always the case that it is the pre-processing stage that takes the most time,approximately 80%, of any data mining project to complete. However, getting thedata as clean as possible is crucial to obtaining results which are as accurate aspossible. In an attempt to decrease sparseness, the Galaxy Zoo table 2 morphol-ogy class label, which was initially broken down into three columns (i.e. Uncer-tain, Spiral, Elliptical) and had a 1 to represent its derived classification and a 0for the remaining two, was combined into one column labelled CLASS. After pro-cessing the centre point right ascension and centre point declination values for eachindividual galaxy, submitting numerous queries to the SDSS database and arriv-ing at the initial data set, it was immediately noticed upon analysis that there wasnoise present. A handful of attributes contained the value -9999 in approximately70-80% of all their entries. A number of objects also possessed a similarly signif-icant amount of -9999 values for their attributes. These attributes and objects were


eventually removed. Large variances in the values of the different attributes werealso observed. One object would possess a value of, say, -152.3161 while anotherobject would possess a value of 15.2885 for the same attribute. As a result of this,the final step involved normalisation, as shown below where x represents the origi-nal value and xnew represents the final normalised value, for all values to fit into therange [0.0, 1.0].

xnew =x− xmin

xmax − xmin(4.15)

4.6.2 Post-processing

Evaluation of all the resulting clustering accuracies and modifying the clusteringmodels and contained attributes was a necessary and significant part of the wholeprocess. This is also in accordance with the CRISP-DM model.

4.7 Summary

As stated in the introduction to this chapter, successful identification of galaxy mor-phologies is the key requirement in this research. In order to achieve this, our IFSalgorithm will have to be deployed in a series of experimental models. The unsuper-vised classes-to-clusters evaluation tool will be utilised after applying the K-Meansclustering algorithm, providing the required accuracy measurement for each of theexperiments implemented through the use of XML-written knowledge-flow modelsin WEKA, which are detailed in chapter 6.

Chapter 5Research Methodology

“Now my method, though hard to practise, is easy to explain; and it is this. I proposeto establish progressive stages of certainty.” by Francis Bacon (1561 - 1626)

The entire research methodological process, which was directed in accordance withthe CRISP-DM model, is detailed in this chapter. It is noted that this process in-cluded an iterative re-designing of numerous clustering experiments based on newdiscoveries which was necessary in order to enhance the resulting accuracies andsolidify the direction of this research work.

Initial experimentation began extensively on the Galaxy Zoo Table 2 data setwhich, in hindsight, was ineffective as what was crucially required for clusteringwas the set of actual morphology-identifying attributes for each galaxy (e.g. pet-rosian radius, isophotal major and minor axis), as opposed to their voting data offof the Galaxy Zoo database. However, this is not to say that the entire time spentprocessing the Galaxy Zoo Table 2 data set was a vain attempt. The analysis hascertainly provided a much more comprehensive understanding of the galaxies andthe voting system towards their respective morphologies. It was also eventually de-termined, through this, that the important features required from the Galaxy ZooTable 2 data set were the morphology class label and the centre point right ascen-sion and centre point declination used to uniquely identify each galaxy within theSloan Digital Sky Survey (SDSS) database.

5.1 Galaxy Zoo Table 2

The galaxy morphological classification voting data from the Galaxy Zoo Table 2data set was obtained and 65,535 galaxies were analysed. It was observed that a sig-nificant majority of all galaxies, approximately 63%, have been classified as “Un-certain”. The reason for this is due to the process in which the final votes for eachgalaxy are calculated and interpreted. After all votes are counted, a classification de-biasing correction function is applied to all the scores. Each galaxy is then subjectedto a threshold of 80%. This means that a galaxy will only be classified as Spiral or


44 5 Research Methodology

Fig. 5.1 Pie Chart of Galaxy Zoo Table 2 Final Morphological Classifications

Elliptical if at least 80% of the final voting score leans towards it. If not, the galaxywill be classified as Uncertain. As a result, some of the galaxies were found to havejust short of 80% of their votes cast towards either Spiral or Elliptical but still endedup being classified as Uncertain because of this threshold. An advantage of such ahigh threshold, however, is that it provides absolute certainty and high confidencefor those properly classified galaxies. Table 5.1 shows the final classification result.Figure 5.1 shows a pie chart of the three categories of classification of the galaxymorphology in Galaxy Zoo Table 2.

Table 5.1 Galaxy Zoo Table 2 Data Set: Final Morphological Classifications

Category No. of GalaxiesUncertain 41556Spiral 17747Elliptical 6232

After thoroughly pre-processing the data set, which involved removing all non-numerical attributes and attributes unrelated to voting (e.g. RA, DEC), variousclustering experiments, shown in table 5.2, were designed where the value of kwould vary and the galaxies labelled as Uncertain would be included or removedaltogether.

The resulting accuracies were unfavourable, reaching a maximum of 51.9417%at best when using all three classes of galaxies, which indicated that the experimentswere not successful. This shows the significance of the iterative nature of data sci-ence projects. It is usually the case that the initial model shows less than promisingresults, that lead to investigating the reasons behind these outcomes. In the rest ofthis chapter, we shall take the reader in our journey to finding the ground truth inthis data set.

5.2 Data Mining the Galaxy Zoo Mergers 45

Table 5.2 Galaxy Zoo Table 2 Data Set: Various Clustering Experiments

Number of Galaxies Per Cluster Value of kSpiral Elliptical Uncertain17747 6232 41556 317747 6232 41556 417747 6232 41556 517747 6232 - 217747 6232 - 3

- - 41556 2- - 41556 3- - 41556 4

Table 5.3 The 10 Attributes with the Lowest DBI Index Values

Attribute DescriptionisoAGrad u*z Gradient of the isophotal major axispetroRad u*z Petrosian radiustexture u Measurement of surface textureisoA z*z Isophotal major axislnLExp u Log-likelihood of exponential profile fit (typical for a

spiral galaxy)lnLExp g Log-likelihood of exponential profile fit (typical for a

spiral galaxy)isoA u*z Isophotal major axisisoB z*z Isophotal minor axisisoBGrad u*z Gradient of the isophotal minor axisisoAGrad z*z Gradient of the isophotal major axis

5.2 Data Mining the Galaxy Zoo Mergers

After the lack of success at clustering the Galaxy Zoo Table 2 data, it was decidedthat a complete re-designing of all the experiment models was necessary, keeping inline with the CRISP-DM methodology. At this point of the research project, a thor-ough investigation in the literature was carried out to try and find the best method ofobtaining the required data pertaining to the morphological features of these galax-ies. One paper, entitled Data Mining the Galaxy Zoo Mergers, provided the neces-sary direction that this project required. In it, Baehr et al. [17] produced a list ofthe top 10 attributes with the lowest Davies-Bouldin Index (DBI) values, from theSDSS, for use with decision tree classification and K-Means clustering. It was ascer-tained that the larger the DBI value an attribute possesses, the less useful it becomesfor decision tree classification and clustering. Attributes with high DBI values weredeemed less then useless in both applications.

Being inspired by their work, the same 10 attributes were acquired from the SDSSdatabase and clustering experiments designed over them. Table 5.3 shows the list ofthese attributes.


Table 5.4 The Best Resulting Subset of the Original 10 Attributes

AttributeisoA z*zlnLExp gisoAGrad u*zisoB z*z

Accurate application of the morphological class labels to each of the galaxies inthe data set before clustering was achieved by reference to each galaxys centre pointright ascension and centre point declination. These produced, in the SDSS databasequery, the object ID for each galaxy which was then matched up to the object IDin the Galaxy Zoo Table 2 data set to obtain the correct label (i.e. Spiral, Elliptical,Uncertain). The K-Means clustering algorithm was applied to the full sample ofthe data set of 3000 galaxies using classes-to-clusters evaluation with the value of kset to 3. This process was then repeated iteratively using various subsets of the 10attributes. The best resulting subset, as shown in Table 5.4, contained 4 attributes. Itis observed that, out of all five SDSS filter wavebands (i.e. u, g, r, i, z), the majorityof the attributes in this subset appear to derive from the z waveband.

The resulting accuracies showed no particularly significant improvement. It wasoriginally thought that the reason for the low accuracies was due to the majority ofthe galaxies having been labelled as Uncertain. An alternative clustering attempt,where 1000 of the 1763 galaxies labelled as Uncertain were removed to stratify thedata set, was carried out but proved ineffective as it showed no accuracy increasewhatsoever. In fact, the accuracy dropped even further.

5.3 Extensive SDSS Data Analysis

As the objective of this research is to provide astronomers with an effective tooltowards accurate galaxy morphology analysis and identification, using the newlyacquired knowledge of the SDSS database together with the Galaxy Zoo data set,it was decided that an even more comprehensive analysis, set on a larger scale andinvolving more attributes, would be carried out with experiments designed accord-ingly. Our IFS algorithm, discussed in Chapter 4, was also implemented in the sub-sequent experiments.

It was observed that all 10 attributes acquired in the work of Baehr et al. [17]came from a single table in the SDSS database called PhotoObjAll. This providedstrong indication as to the possibility that this particular table contained most, if notall, of the morphology-identifying attributes of the galaxies. As such, a total of 135attributes, all originating from the PhotoObjAll table, were acquired for 3000 galax-ies. Pre-processing was carried out and the IFS algorithm was applied accordingly.The final data set contained 2987 galaxies and 36 attributes, inclusive of the CLASSlabel. 2000 additional galaxies were also processed, by converting their centre pointright ascension and centre point declination values to the appropriate format for

5.3 Extensive SDSS Data Analysis 47

submission in the query to the SDSS database, and this experiment design was re-peated two more times with the same attributes. Table 5.5 lists these experimentaloutcomes in summary.

Table 5.5 The Various Experiments Carried Out Utilizing the IFS Algorithm

Before Applying the IFS Algorithm After Applying the IFS AlgorithmNo. of Galaxies No. of Attributes No. of Galaxies No. of Attributes

3000 135 2987 364000 135 3985 285000 135 4979 23

Table 5.6 lists the 23 attributes (excluding the CLASS label) and their respectiveinformation gain levels for the best attribute selection for 5000 galaxies.

The accuracies acquired from these experiments were considerably and consis-tently higher than any of the previous experiments carried out; indicating that ourIFS algorithm developed has proven itself effective. It is observed, consistently withall three instances, that the number of attributes used after applying the IFS algo-rithm is less then half. With more galaxies processed, it is conceivable that there isa definite possibility for further accuracy increase.

5.3.1 Isolating and Re-Clustering Galaxies Labelled as Uncertain

It was determined, upon this success, that the next step would be to actively attemptto identify the morphologies of those galaxies labelled as Uncertain, which is themain goal of this project. Another series of experiments were designed, this timeaimed at focusing on the Uncertain galaxies from the full data set by actively re-labelling and re-clustering the full set. It was decided that the Uncertain galaxieswould, logically, possess either an Elliptical or Spiral label.

The idea was to first remove all galaxies labelled as Uncertain from the data setand cluster the remaining galaxies using K-Means with the value of k set to 2. K-Means is then applied strictly to only those galaxies labelled as Uncertain with kalso set to 2. This time, however, instead of using classes-to-clusters evaluation, thetraining set is used and the clustering labels, cluster0 and cluster1, are held. Oncethis is completed, these two clusters are reintroduced back into the original data setwith the remaining galaxies and the entire data set clustered with k set to 4. Theobjective is to observe how closely placed each of the two clusters are to each other.

The proximity between the four clusters was found to be minute so the two Un-certain clusters were re-labelled with various combinations of the Spiral and Ellipti-cal labels in order to determine their morphologies. Clustering on all these differentdata sets was conducted with the value of k set to 2.


Table 5.6 The Best Combination of Attributes with Respective Information Gain Levels

Attribute Information Gain Attribute Information GainexpRad g 0.2207 isoAGrad r 0.0775expRad r 0.1965 lnLDeV z 0.0716expRad i 0.1831 texture g 0.0706lnLDeV g 0.1367 isoPhiGrad g 0.0639lnLDeV r 0.1275 texture r 0.0522isoB i 0.1206 lnLDeV u 0.0428isoB r 0.1154 texture i 0.0367lnLExp r 0.1002 isoPhiGrad i 0.03lnLExp i 0.0986 texture u 0.0153isoBGrad g 0.092 isoColcGrad r 0.0115petroRad u 0.0834lnLExp z 0.0822

5.3.2 Extended Experimentation

The re-labelling of the cluster0 and cluster1 clusters and subsequent re-clustering ofthe full data sets provided the breakthrough in results. In order to verify that theseresults were consistent, the same data sets were subjected to the Random Forest[34] and Support Vector Machines (SVM) [46] algorithms. With Random Forest,the number of trees used was 100. With SVM, the Sequential Minimal Optimisation(SMO) algorithm, developed by John Platt for efficient optimisation solving [128],was utilised. Both algorithms were introduced in Chapter 4 of this book. The ra-tionale behind the adoption of these two techniques was the notable success bothexhibited in a number of real-world applications.

Chapter 6Development of Data Mining Models

“That all our knowledge begins with experience there can be no doubt.” by ImmanuelKant (1724 - 1804)

This chapter showcases the implementations of the various experiments carried outin the methodology, in order to meet the requirements of accuracy. The data miningtools utilised are discussed along with any issues that arose during the implementa-tion process. Samples of the various written code, MySQL queries and the designedknowledge-flow models will all be presented here.

6.1 Waikato Environment for Knowledge Analysis (WEKA)

WEKA, originally developed in 1993 and first published in 1994 [79], is a uni-fied workbench, written in Java and developed in the University of Waikato, NewZealand, that incorporates numerous machine learning techniques and algorithmsfor various data mining tasks including pre-processing, clustering, classification,association rules and regression (See Figure 6.1 for the famous WEKA’s logo). Itwas only in 2006 that the first public release of WEKA was seen. In 2009, Hall et al.[70] announced that WEKA had undergone a major update with the entire softwarehaving been re-written and various new features added in. As an interesting sidenote, the project of Celis et al. [39] is one of many that have effectively extendedWEKA, in this particular case in the area of distributed data mining providing adistributed cross-validation facility.

Fig. 6.1 Official WEKA Logo


50 6 Development of Data Mining Models

WEKA currently possesses (but is not limited to) the following features:

• 49 different tools for data pre-processing• 76 algorithms for classification• 8 different algorithms for clustering• 15 evaluators for attributes towards feature selection• 3 association-rule discovery algorithms

WEKA accepts data sets in both CSV and in its native Attribute Relation FileFormat (ARFF). The ARFF structure describes all attributes in the header portion ofthe file and lists all data samples in the data portion of the file. Anything precededwith a % symbol is considered a comment and is disregarded by WEKA [125].Figure 6.2 shows the structure of an ARFF file for the classical Iris data set.

Fig. 6.2 The IRIS Data Set ARFF Structure

It is worth mentioning that current WEKA implementation accepts many otherdata formats, and also it connects to relational databases.

Some of the features that make using WEKA advantageous include the fact thatit is openly available for free under the GNU General Public License, it is extremelyportable as it has been implemented completely in Java and possesses an easy-to-useGUI which prevents users from being too overwhelmed.

6.1 Waikato Environment for Knowledge Analysis (WEKA) 51

6.1.1 WEKA Implementations

The output generated by WEKA from the various experiments conducted in thisproject will be explored in this section. In order to prevent this section of the bookfrom becoming too excessive, only the significant samples from the initial exper-iments will be explored here along with the important final result generations ob-tained from the design and implementation of the knowledge-flow models for theSDSS data.

6.1.2 Initial Experimentation on Galaxy Zoo Table 2 Data Set

After acquiring the Galaxy Zoo Table 2 data set, the initial experiments on WEKAwere carried out using various values for k and having the galaxies labelled as Un-certain isolated from the rest of the data set. Figures 6.3, 6.4 and 6.5 present theWEKA outputs from running all 65,535 galaxies with all three classes and the val-ues of k set to 3, 4 and 5 respectively, all using classes-to-clusters evaluation.

6.1.3 Experiments with Data Mining the Galaxy Zoo MergersAttributes

After acquiring the 10 attributes listed by Baehr et al. [17] as possessing the lowestDBI values, repeated clustering, as described in chapter 5, showed the best resultto have come from the use of only 4 of the 10 attributes. Figure 6.6 presents theWEKA output for this experiment.

6.1.4 Further Experimentation on the SDSS Data

After acquiring a further 135 attributes from the SDSS database, additional galaxieswere also processed. K-Means clustering, after pre-processing and applying the IFSalgorithm, was applied to 3000, 4000 and 5000 galaxies using classes-to-clustersevaluation. Figure 6.7 lists the best attribute selection for 3000 galaxies while Figure6.8 shows the generated WEKA output of this experiment.

Figure 6.9 details the best attribute section for 4000 galaxies while Figure 6.10shows the generated WEKA output of this respective experiment.

Finally, Figure 6.11 showcases the best attribute section, this time for 5000 galax-ies, while Figure 6.12 shows the generated WEKA output of this respective experi-ment.

6.1.5 Uncertain Galaxy Re-Labelling and Re-Clustering

The next phase of this project involved clustering the galaxies labelled as Uncertaininto two separate clusters, saving their labels and reintroducing them back into the


Fig. 6.3 WEKA Output for Galaxy Zoo Table 2 with k=3

original data set to then cluster the entire set with the value of k set to 4. Figure 6.13shows the resulting WEKA output of this experiment.

The final phase involved re-labelling the two Uncertain clusters, labelled cluster0and cluster1 by default, using all possible unique combinations of the Spiral andElliptical labels to assess the probability of them possessing either spiral or ellipticalmorphology. All clustering experiments that follow have the value of k set to 2.Figure 6.14 shows the resulting output from WEKA when cluster0 is re-labelled asSpiral and cluster1 as Elliptical.

Figure 6.15 shows the resulting output from WEKA when cluster0 is re-labelledas Elliptical and cluster1 as Spiral.

Figure 6.16 shows the best resulting output from WEKA when both cluster0 andcluster1 are both re-labelled as Spiral.

Lastly, Figure 6.17 displays the resulting output from WEKA when cluster0 andcluster1 are both re-labelled as Elliptical.






Fig. 6.6 Best WEKA Output for the Lowest-DBI Attributes with k=3


Fig. 6.7 Best Attribute Selection for 3000 Galaxies


Fig. 6.8 WEKA Output for 3000 Galaxies with k=3










Fig. 6.13 WEKA Output with k=4 after Splitting Uncertain Galaxies into 2 Clusters


Fig. 6.14 WEKA Output with k=2 when cluster0=Spiral and cluster1=Elliptical


Fig. 6.15 WEKA Output with k=2 when cluster0=Elliptical and cluster1=Spiral


Fig. 6.16 WEKA Output with k=2 when cluster0=Spiral and cluster1=Spiral


Fig. 6.17 WEKA Output with k=2 when cluster0=Elliptical and cluster1=Elliptical


6.1.6 Random Forest and SMO Experimentation

This section showcases the WEKA implementations of the extended experimentsconducted after arriving at the highest accuracy, achieved when both cluster0 andcluster1 are relabelled as Spiral.

Figures 6.18 and 6.19 show the resulting WEKA outputs when Random Forestand SMO are applied, respectively, to the set in which cluster0 is re-labelled asSpiral and cluster1 as Elliptical.

Fig. 6.18 WEKA Output with Random Forest (numTrees=100) Applied to the Set with clus-ter0=Spiral and cluster1=Elliptical

Figures 6.20 and 6.21 show the resulting WEKA outputs when Random Forestand SMO are applied, respectively, to the set in which cluster0 is re-labelled asElliptical and cluster1 as Spiral.


Fig. 6.19 WEKA Output with SMO Applied to the Set with cluster0=Spiral and clus-ter1=Elliptical


Fig. 6.20 WEKA Output with Random Forest (numTrees=100) Applied to the Set with clus-ter0=Elliptical and cluster1=Spiral


Fig. 6.21 WEKA Output with SMO Applied to the Set with cluster0=Elliptical and clus-ter1=Spiral


Figures 6.22 and 6.23 show the best resulting WEKA outputs when RandomForest and SMO are applied, respectively, to the set in which both cluster0 andcluster1 are re-labelled as Spiral.

Fig. 6.22 WEKA Output with Random Forest (numTrees=100) Applied to the Set with clus-ter0=Spiral and cluster1=Spiral

Figures 6.24 and 6.25 show the resulting WEKA outputs when Random Forestand SMO are applied, respectively, to the set in which both cluster0 and cluster1 arere-labelled as Elliptical.


Fig. 6.23 WEKA Output with SMO Applied to the Set with cluster0=Spiral and clus-ter1=Spiral


Fig. 6.24 WEKA Output with Random Forest (numTrees=100) Applied to the Set with clus-ter0=Elliptical and cluster1=Elliptical


Fig. 6.25 WEKA Output with SMO Applied to the Set with cluster0=Elliptical and clus-ter1=Elliptical

6.2 R Language and RStudio 75

Fig. 6.26 R Language Logo

6.2 R Language and RStudio

As previously shown, the WEKA explorer has been used to conduct the experimentsdescribed in Chapter 5. However, for flexibility of such implementation, in this sec-tion, we shall provide the readers with how the same processes can be implementedin another statistical development language, namely, R.

RStudio, which is written in C++, is another tool utilised in this project that isopenly available for free. It is an Integrated Development Environment specially de-signed for statistical computing based upon the language R. R was initially conceivedby Ross Ihaka and Robert Gentleman from the University of Auckland, inspired bythe S environment which was developed by John Chambers at Bell Laboratories. Asa result, both are very similar. The focus of R is mainly on statistical and graphicaltechniques. However, a vast amount of packages have since been developed, throughthe Comprehensive R Archive Network (CRAN) which is hosted by Vienna Uni-versity’s Institute for Statistics and Mathematics [80], extending the usefulness ofR [23]. The language has a distinct logo shown in Figure 6.26.

In this book, the main packages utilised are RWeka, RWekajars and rJava. Theyare all linked together such that RWekajars is utilised by RWeka and both requirerJava to run.

6.2.1 RStudio Implementation

There are two main bodies of code used to execute all the RStudio implementationsof the experiments. The code used to implement the K-Means function from theRWeka package can be seen in Figure 6.27.

The string nameOfDataSet.csv, which is used to select the data set to be im-ported for analysis, and the value of k which lies in the line (kmeans.result ¡-kmeans(gzTable2, 3)) and is currently set to 3, are the two variables which changemost frequently. The way the code works is that it takes a data set and stores it inthe variable gzTable and creates a duplicate gzTable2 and strips off its CLASS labelfor the purpose of clustering. After applying the K-Means algorithm to gzTable2,


Fig. 6.27 R Code Used to Impelment the K-Means Function

Fig. 6.28 R Code Implementation on Galaxy Zoo Table 2 Data Set (k=3)

Fig. 6.29 An Attempt at Plotting a Graph using NVOTE and P EL

the generated clusters are compared to the CLASS label from gzTable and the resultsare generated. An example of this implementation can be seen in Figure 6.28 wherethe entire Galaxy Zoo Table 2 data set was clustered with the value of k set to 3.

6.2 R Language and RStudio 77

The function plot() was used at the start but was not the most effective, mainlydue to the fact that the data is multidimensional. An example of an attempt of plot-ting when clustering the original Galaxy Zoo Table 2 data set, using the attributesNVOTE and P EL, can be seen in Figure 6.29.

The code used to implement the Simple K-Means function, also from the RWekapackage, can be seen in Figure 6.30.

Fig. 6.30 R Code Used to Implement the Simple K-Means Function

This is very similar to that of the K-Means implementation, where a duplicatedata set is created and used for clustering and the CLASS label from the original setcompared with the clustered data. Figure 6.31 shows an example of the output ofthis implementation from a clustering attempt on the entire Galaxy Zoo Table 2 dataset where only the galaxies labelled Elliptical and Spiral were utilised.

Fig. 6.31 R Code Implementation on Galaxy Zoo Table 2 Data Set without Galaxies LabelledUncertain (k=2)

We observe in the output generated by RStudio that cluster 0 contains only galax-ies from the Spiral class while cluster 1 contains a mixture of galaxies from both theElliptical and Spiral classes.


6.3 MySQL Database Queries

Once it was realised that the required morphology-identifying attributes were ob-tainable from the SDSS database, a series of MySQL queries had to be written andsubmitted to this effect. Figure 6.32 shows the relational algebraic query that wassubmitted in order to obtain data for the 10 attributes from the study of Baehr et al.[17].

Fig. 6.32 Relational Algebraic Query to Obtain the 10 Attributes with Lowest DBI Values

After the unsuccessful clustering attempts on these 10 attributes, another databasequery was submitted to obtain additional attributes. Figure 6.33 displays the lengthyrelational algebraic query used to obtain the 135 attributes used for the final phasesof the experiments.

These queries had to be submitted through the SDSS website via a form thatalso attaches the list of values for the centre point right ascension and centre pointdeclination for all galaxies being queried. These values are provided by Galaxy Zooin day/time format (i.e. hh:mm:ss.s, dd:mm:ss.s) but had to be manually convertedinto J2000 decimal degree format for submission as part of the MySQL query.

6.4 Development of Knowledge-Flow Models

One of the main issues in data science projects is the accessibility of the producedmodels. The achieved outcomes of such projects are usually reported, with databeing accessed, when working with open data like Galaxy Zoo. We argue that it isimportant to have these models accessible for re-usability purposes. Hence, we usedWEKA’s Knowledge-Flow module.

Knowledge-flow models are extremely useful in data mining, especially when thequestion which techniques provide the best results? beckons. Providing a solutionfor such a problem cannot be done a priori so by building these models, runningdifferent algorithms simultaneously and on multiple data sets becomes a much morestraightforward task.

6.4 Development of Knowledge-Flow Models 79

Fig. 6.33 Relational Algebraic Query Used to Obtain the 135 Attributes from the PhotoOb-jAll Table

Two main knowledge-flow models have been designed and developed as per therequirements and will be shown and discussed in this section in detail.

The first model assists with comparing the clustering results of 3000, 4000 and5000 galaxies after applying the IFS algorithm in order to get the best attributeselection. This can be viewed in Figure 6.34.


Fig. 6.34 The Knowledge-Flow Model Designed for Cluster Accuracy Comparison

An Attribute Selection component is attached between each of the data sets andtheir respective Class Assigner components, along with a TextViewer attached to theAttribute Selection. This allows for the viewing of the attributes in descending orderof their information gain values as set by the Attribute Selection component which isconfigured to use the information gain attribute evaluator in WEKA. As this model isdesigned for comparing the K-Means accuracies after applying the IFS algorithm,SimpleKMeans is utilised along with the ClustererPerformanceEvaluator for thepurpose of classes-to-clusters evaluation. The value of k in all the SimpleKMeanscomponents is set to 3.

The second model, seen in Figure 6.35, deals with a comparison of not only fourdifferent data sets, but also of three different algorithms (i.e. K-Means, RandomForest, SMO). The four data sets make up the four different re-labelling combina-tions of the galaxies originally labelled as Uncertain. This model was designed forthe final phase of the project, identifying the actual morphologies of those galaxiesoriginally labelled as Uncertain.

All four data sets are attached to SimpleKMeans, RandomForest and SMO com-ponents. Each data set is also run through a CrossValidation component with thevalue of folds set to 10 and then split into trainingSet and testSet which are thenboth sent into the RandomForest and SMO components. It is also notable that thevalue of numTrees in all the RandomForest components is set to 100. The value of kin all the SimpleKMeans components is set to 2. All outputs are sent to a centralisedTextViewer component that lists all the final results for convenient comparison.

6.5 Summary 81

Fig. 6.35 The Knowledge-Flow Model Designed to Compare K-Means, Randome Forest andSMO

6.5 Summary

This chapter provided the readers with recipes on how to develop data mining mod-els using a variety of tools. Each tool provides some advantages over the others.WEKA Explorer is an easy to use graphical user interface that can provide interac-tivity and flexibility. On the other hand, WEKA Knowledge-Flow provides the userswith a tool that automates a workflow of data mining processes with persistence stor-age. Finally, R provides the flexibility of a programming language to manipulate thedata as well as the intermediate results.

Chapter 7Experimentation Results

“The test of all knowledge is experiment. Experiment is the sole judge of scientifictruth.” by Richard Phillips Feynman (1918 1988)

In order to meet the requirements discussed in the previous chapters of this book,extensive evaluation of the results of the implementations detailed in chapter 6 werecarried out. Initial experiments on the Galaxy Zoo Table 2 data set and on the initial10 attributes obtained from the work of Baehr et al. [17] were eventually deemedto have not met these requirements. However, our IFS algorithm, developed at thestart and deployed to the larger data sets acquired from the SDSS database duringthe later phases of the project, produced a significant increase on the original ac-curacies, deeming it a successful implementation and having successfully fulfilledthe requirements. The re-labelling and iterative clustering of the galaxies originallylabelled as Uncertain was also a remarkable success. These results were reinforcedby further experimentation through the use of the Random Forest and SequentialMinimal Optimisation (SMO) algorithms.

7.1 Galaxy Zoo Table 2 Clustering Results

Initial clustering experiments were carried out on the processed Galaxy Zoo Table2 data set as shown in table 7.1. These initially appeared promising and were en-couraging. However, it was eventually realised that the K-Means clustering of theGalaxy Zoo Table 2 data itself was irrelevant to the objectives of this research.

When the galaxies labelled Uncertain are isolated from the rest of the data setand the remaining are clustered with the value of k set to 2, the resulting accuracy of96.9724% is, indeed, exceptional. However, as mentioned previously, the clusteringof the Galaxy Zoo Table 2 data set itself was eventually realised to be irrelevant.This is because the attributes contained in the set do not actually pertain to any ofthe morphological features of each of the galaxies. Instead, it contains the variousvoting data of participants in the Galaxy Zoo project.


84 7 Experimentation Results

Table 7.1 Galaxy Zoo Table 2 Data Set: Clustering Results

Number of Galaxies Per Cluster Value of k Accuracy (%)Spiral Elliptical Uncertain17747 6232 41556 3 42.954117747 6232 41556 4 48.294817747 6232 41556 5 51.941717747 6232 - 2 96.972417747 6232 - 3 73.4559

- - 41556 2 61.594- - 41556 3 43.842- - 41556 4 42.9445

On a positive note, however, the entire process did provide information on thevarious individual morphological classes. It was also only discovered through theseexperiments, the importance of each galaxys centre point right ascension and centrepoint declination values as a means of unique identification.

7.2 Clustering Results of Lowest DBI Attributes

After first acquiring 1500 galaxies, clustering using all the 10 attributes with thelowest DBI values was carried out with the value of k set to 3. This was repeatedagain after acquiring additional attributes lnLDeV u and lnLDeV g. These two at-tributes refer to the log-likelihood of De Vaucouleurs profile fit which is typical forelliptical galaxies and were therefore perceived to be useful additions to the original10 attributes. Subsequently, an additional 1000 galaxies was added in to the dataset, totalling 2500, and the experiments repeated for comparison. The galaxies la-belled Uncertain were also removed and both the galaxies labelled Uncertain andthose labelled Spiral or Elliptical were separately clustered. Table 7.2 displays thesepreliminary results.

It was observed that the additional two attributes, when compared to clusteringresults before being added in, made no discernable difference. Full data set clus-tering up to this point in the project was still weak with accuracies not exceeding52%.

Table 7.2 The Initial Clustering Results of all 10 Attributes

No. of Galaxies No. of Attributes Accuracy (%)1500(Full Set) 10 46.13331500(Full Set) 12 46.13332500(Full Set) 10 48.922500(Full Set) 12 48.92

1464(Uncertain Only) 10 72.88251036(Spiral/Elliptical Only) 10 67.9537

7.3 Extensive SDSS Analysis Results 85

The results of the experiments carried out on the various subsets of the 10 at-tributes with 3000 galaxies can be viewed in table 7.3. The objective here was todecide if there was an optimum number or subset of attributes that would provideenhanced clustering accuracies.

Table 7.3 The Iterative Clustering Results of Various Subsets of the 10 Attributes

No. of Attributes Accuracy (%) Within Cluster Sum of Squared Errors1 50.8 0.310972080925612962 54.2 73.223562872369813 54.2 3.2130596115392094 54.2 56.571631263888494 49.5333 3.2130596115392095 49.4 5.1038948660063035

10 45.8 186.893316665896

It was observed that the accuracy peaked at 54.2% and was continuously lowerafter the 4th attempt.

Despite the relatively low accuracies obtained from the various experiments de-signed and carried out by this point, it was ascertained that these attributes from theSDSS database were relevant morphology-identifying features. The main problemwas in the attribute selection process. Identifying the best selection of attributes iscrucial to maximising success in K-Means clustering.

7.3 Extensive SDSS Analysis Results

After acquiring all 135 attributes from the PhotoObjAll table of the SDSS databaseand preprocessing the data set, clustering was applied to 3000, 4000 and 5000 galax-ies with the value of k set to 3. Table 7.4 displays the results from the same experi-ments carried out both before and after applying the IFS algorithm.

It is from these results that the IFS algorithm developed and implemented wasdeemed to be successful. Two interesting observations were immediately made uponarriving at these results. Firstly, in all three instances, the utilisation of the IFS algo-rithm provided a consistent increase in the clustering accuracies of approximately15-20%. Secondly, after applying the IFS algorithm, less than half the number of

Table 7.4 Results of the IFS Algorithm Implementation

Before Applying the IFS Algorithm After Applying the IFS AlgorithmNo. of Galaxies No. of Attributes Accuracy(%) No. of Attributes Accuracy(%)

3000 115 46.2487 36 63.20724000 115 45.872 28 62.76045000 115 45.7923 23 65.6156


attributes are kept. This was the first major breakthrough in the project which led tothe subsequent re-labelling and re-clustering of the galaxies labelled as Uncertainin an attempt to try and successfully identify their morphologies.

7.4 Results of Uncertain Galaxy Re-Labelling andRe-Clustering

With the 5000 galaxies and the 23 attributes deemed the best combination throughthe IFS algorithm, all the 2983 galaxies labelled as Uncertain were taken out ofthe original data set and clustered with the value of k set to 2. This produced twoclusters labelled cluster0 and cluster1 respectively. These two clusters were returnedto the original data set and then clustered with the value of k set to 4. This producedan accuracy of 57.863%. These two clusters were then re-labelled using variouscombinations of the Spiral and Elliptical labels. The entire data set was iterativelyclustered with the value of k constantly set to 2. Table 7.5 provides the full resultsof these experiments.

Table 7.5 Results of the various Re-Labelling and Re-Clustering Experiments

Data Set Type Number of Galaxies Per Cluster Accuracy (%)Spiral Elliptical Uncertain

Full Data Set 1476 520 2983 65.6156Spiral/Elliptical Only 1476 520 - 72.495Uncertain Only - - 2983 78.9474Cluster0 - Spiral /Cluster1 - Elliptical

2104 2875 - 63.0649

Cluster0 - Elliptical /Cluster1 - Spiral

3831 1148 - 77.2444

Cluster0 - Spiral /Cluster1 - Spiral

4459 520 - 82.627

Cluster0 - Elliptical /Cluster1 - Elliptical

1476 3503 - 68.4475

It is notable that the highest clustering accuracy of 82.627% was obtained whengalaxies from both cluster0 and cluster1 were re-labelled as Spiral. Out of the 4979galaxies in the complete data set, only 865 were incorrectly classified, amounting toapproximately 17.4%, a miniscule fraction of the entire data set.

7.5 Results of Further Experimentation

Being motivated by this boost in accuracy, it was determined that further ex-perimentation would be required in order to solidify this finding. State-of-the-art

7.5 Results of Further Experimentation 87

classification techniques, namely, Random Forest and SMO implementation of Sup-port Vector Machines, were used. Table 7.6 shows the resulting accuracies of theseimplementations.

Table 7.6 Results of the Various Random Forest and SMO Implementations

Data Set Type Algorithm Accuracy (%)K-means Random Forest SMO

Cluster0 - Spiral /Cluster1 - Elliptical

63.0649 90.6005 86.9452

Cluster0 - Elliptical /Cluster1 - Spiral

77.2444 83.6513 77.9675

Cluster0 - Spiral /Cluster1 - Spiral

82.627 91.3838 89.6566

Cluster0 - Elliptical /Cluster1 - Elliptical

68.4475 83.089 78.3892

The accuracies for all three algorithms, when all the galaxies from cluster0 andcluster1 were re-labelled as Spiral, consistently outperformed the rest of the experi-ments. With the number of trees set to 100, Random Forest provided an exceptionalaccuracy of 91.3838% which indicates two concluding remarks that we can statewith certainty:

• A significant majority of the galaxies labelled as Uncertain are indisputably ofspiral morphology. This complies with the scientific fact that spiral and irregu-lar galaxies form 60% of the galaxies in the local universe [111], including ourMilky Way galaxy (See Figure 7.1 for an image of this galaxy). This result is alsoconsistent with the fact that the Sloan Digital Sky Survey confirms that 77% ofall observed galaxies in the universe are of spiral morphology [134].

Fig. 7.1 A Spitzer Space Telescope infrared image of hundreds of thousands of stars in theMilky Way’s core (credit: NASA/JPL-Caltech)


• There is another small subset of galaxies amongst those that are ”Uncertain”that are either of elliptical morphology, are stars or possess an entirely differentmorphology type.

7.6 Summary

The results shown in this chapter provide evidence of the iterative nature of datascience projects. Negative results are indicative on how to achieve better results. Asdetailed in both this chapter and the previous one, we were able through succes-sive experiments to boost the accuracy almost 40%. Such approach to data scienceprojects suggests that intelligent data analysis is both science and art.

Chapter 8Conclusion and Future Work

“I like to think of my galaxies as a social network, that is they share properties incommon with other galaxies that share properties in common with other galaxies.We build this network of knowledge and then we try to find the strong links in thatnetwork.” by Kirk Borne

The CRISP-DM model was selected as the appropriate methodology for this re-search and was delineated in chapter 4. Despite CRISP-DM lacking certain SDLCprocesses in enough detail to sustain large-scale projects, it is regarded as a meansof promoting data mining as an engineering process and, as such, was judged asbeing more than sufficient in supporting this particular scale of project.

Following the CRISP-DM methodology, repeated data preparation, modellingand evaluation was required, especially when transitioning from the study of theGalaxy Zoo data sets to the analysis of the SDSS data sets. In reflection, it ap-pears that while all phases are naturally crucial to the development and deploymentof such a data mining project, the data understanding phase carries a significantlyheavier weight. To be able to analyse and understand the data beyond the surfacegreatly assists with the data preparation and modelling phases. This was particularlythe case when it came to submitting queries to the SDSS database. It was impera-tive to observe that all attributes were derived from the PhotoObjAll table and thatthey were all, in fact, features relating to galaxy morphology. If this was not known,attributes could have been queried from any of the other numerous tables from theSDSS database, likely hindering the success of this project as a whole. This bringsthe importance of domain knowledge when dealing with data science project.

8.1 Conclusion

Motivated by the fact that over 60% of all galaxies in the Galaxy Zoo Table 2 dataset are classified as Uncertain, a means for astronomers to more efficiently and ac-curately classify these galaxies was designed. A novel heuristic algorithm, calledIncremental Feature Selection (IFS), was developed to assist with accomplishingsuch a task by heuristically obtaining the best selection of attributes through their


90 8 Conclusion and Future Work

calculated information gain, thus providing the optimum acquirable clustering ac-curacy. A series of experiments were then conducted, involving the clustering ofthe galaxies labelled as Uncertain, saving their cluster assignments and then re-introducing them back into the original data set. It is shown that the highest accu-racy of 82.627% was obtained when all galaxies from cluster0 and cluster1 werere-labelled as Spiral. Applying the Random Forest and SMO algorithms over all theoriginal experiments showed that same data set to outperform the others which fur-ther reinforced this finding. In addition to this, the Sloan Digital Sky Survey reportsthat approximately 77% of all observed galaxies in the universe are, in fact, of spi-ral morphology which also indicates consistency in the results of this project [134].There is no doubt that a majority of the galaxies labelled as Uncertain in the GalaxyZoo Table 2 data set are of spiral morphology.

8.1.1 Experimental Remarks

The initial experiments carried out on the Galaxy Zoo Table 2 data set, as well asthe 10 attributes listed as having the lowest DBI values, were deemed unsuccessfulin that the data was either irrelevant or produced unfavourable results. These wereconcluded as having not met the requirements of accuracy.

However, after acquiring a much larger set of attributes, redesigning the exper-iments and developing and implementing our IFS algorithm designed to facilitateand improve the best attribute selection process, the results were hugely successfulwith an increase in accuracy of approximately 15-20%. The knowledge-flow mod-els designed and executed also facilitate these results with further experimentationcarried out using the Random Forest and SMO algorithms.

8.2 Future Work and Big Data

The last post of Jim Gray on his Microsoft webpage before he went missing in2007 was a presentation prepared in collaboration with Alex Szalay on eScience. Inthis presentation, the fourth paradigm in scientific discovery has been defined anda vision has been detailed. The first three paradigms were identified as empirical,theoretical and computational (See Figure 8.1 for Gray’s slide on paving the way fora new era in scientific discovery). The fourth paradigm has been proposed to be dataexploration, where data analysis and sharing play an important role in this new eraof scientific discovery, characterised by its very large data sets (in our contemporaryterms, it is the era of Big scientific data).

8.2.1 Analysis of Data Storage Representation

For decades, the relational model has dominated the scene in data storage and re-trieval. However, with the success of a reasonable number of NoSQL database mod-els in real-world applications, it has become important to consider the model of

8.2 Future Work and Big Data 91

Fig. 8.1 Jim Gray’s Slide on the Fourth Paradigm in Scientific Discovery

choice according to detailed analysis of factors affecting such a decision, most im-portantly; the size and frequency of read and write database operations. We planto reveal how the different factors can contribute to decision making on using onedata model, through cost modelling, simulation and experimentation. Application toGalaxy Zoo database will be based on such detailed analysis.

8.2.2 Output Storage Representation

The general theme with data mining related project is that models are treated asthe final outcome. They can be stored in a way that can allow re-using them whenrequired. However, many of the important features about the model are not stored in-cluding parameter settings, and also the notion of chaining is not considered. PMMLis an XML representation that captures most of these important features about themodel [69]. However, two open issues are still in need for addressing:

1. extensibility of the model to allow for emerging knowledge discovery methodsto be represented; and

2. storage and retrieval of such models according to planned workflows and ad-hocqueries.

92 8 Conclusion and Future Work

In the former, the stages of the knowledge discovery process are well-defined. Onthe other hand, retrieval of such models can be done by other analysts that can haveaccess to such models in an open access platform. The latter is a more difficultproblem, as access patterns of ad-hoc queries are unknown prior to the execution.We envision to address both issues. The application to Galaxy Zoo would allow usto experimentally assess new methods we shall be investigating in the future.

8.2.3 Data Mining and Storage Workflow

There are many tools that allow us to design data mining workflows, starting fromdata retrieval through to visualisation and evaluation. However, these workflowsassume in-memory storage of intermediate results, or simply storage as files. Weplan to extend data mining workflows to include how intermediate outcomes in aworkflow will be stored, also dynamically, how the data will be annotated to showwhich analysis tasks have been applied on. This way, if a user attempts the sameprocess already executed in a workflow, the stored model can be retrieved insteadof re-executing it again. Given the iterative nature of the data mining methods, suchrecall can be an efficient way of performing the process especially that many hy-brid approaches to data mining has proven successful. The application on GalaxyZoo can provide physicists and astronomers with a flexible and efficient tool forworkflow execution.

8.2.4 Development and Adoption of Data Mining Techniques

Data mining methods have matured over the last decade. However, when faced withdomain specific data, new techniques can be needed, or at least modified. For exam-ple, DBSCAN clustering [56] was developed for spatial data and found its way laterto many other applications. Thus, tailored data mining techniques can lead to newmethods. In the future, we plan to conduct a thorough investigation of the existingfeature engineering, data mining, and visualisation methods. Adoption of methodsand the need for new development will be decided according to such investigation. Itis also intended to re-design existing techniques or new ones for parallel processing,adopting the MapReduce framework.

8.2.5 Providing Astronomers with Insights

Surprising results in astronomy using data mining techniques can be traced backto the 1990s when clustering revealed new types of stars and galaxies [159]. Ourfuture plans aim to provide astronomers with new perspective of the data. Althoughthe Hubble sequence provides an acceptable way of classifying galaxies, it has beencriticised for its subjectivity [50]. The use of unsupervised learning techniques canreveal new ways to classify galaxies, as reported in this monograph.

8.3 Final Words 93

8.3 Final Words

It has been a long, but enjoyable, journey in our way to accomplishing this researchproject. Initial results were less than promising. However, they proved to be ex-tremely important as indicative directions for subsequent tasks. This proves not onlythe iterative process of any data mining project, but also the art of adopting the cor-rect set of parameters and the most suitable techniques.

We believe that the lesson delivered by results reported in this monograph is thatdata mining workflows should be both interactive and iterative.

References

1. Abazajian, K., Adelman-McCarthy, J.K., Agueros, M.A., Allam, S.S., Anderson, K.S.,Anderson, S.F., Annis, J., Bahcall, N.A., Baldry, I.K., Bastian, S., et al.: The second datarelease of the sloan digital sky survey. The Astronomical Journal 128(1), 502 (2004)

2. Abazajian, K., Adelman-McCarthy, J.K., Agueros, M.A., Allam, S.S., Anderson, K.S.,Anderson, S.F., Annis, J., Bahcall, N.A., Baldry, I.K., Bastian, S., et al.: The third datarelease of the sloan digital sky survey. The Astronomical Journal 129(3), 1755 (2005)

3. Abazajian, K., Adelman-McCarthy, J.K., Agueros, M.A., Allam, S.S., Anderson, S.F.,Annis, J., Bahcall, N.A., Baldry, I.K., Bastian, S., Berlind, A., et al.: The first datarelease of the sloan digital sky survey. The Astronomical Journal 126(4), 2081 (2003)

4. Abazajian, K.N., Adelman-McCarthy, J.K., Agueros, M.A., Allam, S.S., Prieto, C.A.,An, D., Anderson, K.S., Anderson, S.F., Annis, J., Bahcall, N.A., et al.: The seventhdata release of the sloan digital sky survey. The Astrophysical Journal Supplement Se-ries 182(2), 543 (2009)

5. Acuna, E., Rodriguez, C.: The treatment of missing values and its effect on classifieraccuracy. In: Classification, Clustering, and Data Mining Applications, pp. 639–647.Springer (2004)

6. Adelman-McCarthy, J.K., Agueros, M.A., Allam, S.S., Anderson, K.S., Anderson, S.F.,Annis, J., Bahcall, N.A., Bailer-Jones, C.A., Baldry, I.K., Barentine, J., et al.: The fifthdata release of the sloan digital sky survey. The Astrophysical Journal Supplement Se-ries 172(2), 634 (2007)

7. Adelman-McCarthy, J.K., Agueros, M.A., Allam, S.S., Anderson, K.S., Anderson, S.F.,Annis, J., Bahcall, N.A., Baldry, I.K., Barentine, J., Berlind, A., et al.: The fourthdata release of the sloan digital sky survey. The Astrophysical Journal Supplement Se-ries 162(1), 38 (2006)

8. Agrawal, D., Aggarwal, C.C.: On the design and quantification of privacy preservingdata mining algorithms. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 247–255. ACM (2001)

9. Agrawal, R., Srikant, R.: Privacy-preserving data mining. ACM Sigmod Record 29(2),439–450 (2000)

10. Ahn, C.P., Alexandroff, R., Prieto, C.A., Anderson, S.F., Anderton, T., Andrews, B.H.,Aubourg, E., Bailey, S., Balbinot, E., Barnes, R., et al.: The ninth data release of thesloan digital sky survey: First spectroscopic data from the sdss-iii baryon oscillationspectroscopic survey. The Astrophysical Journal Supplement Series 203(2), 21 (2012)

96 References

11. Aihara, H., Prieto, C.A., An, D., Anderson, S.F., Aubourg, E., Balbinot, E., Beers, T.C.,Berlind, A.A., Bickerton, S.J., Bizyaev, D., et al.: The eighth data release of the sloandigital sky survey: first data from sdss-iii. The Astrophysical Journal Supplement Se-ries 193(2), 29 (2011)

12. Alpher, R.A., Herman, R.: Evolution of the universe. Nature 162, 774–775 (1948)13. Alsabti, K.: An efficient k-means clustering algorithm. In: Proceedings of IPPS/SPDP

Workshop on High Performance Data Mining (1998)14. Antunes, C.M., Oliveira, A.L.: Temporal data mining: An overview. In: KDD Workshop

on Temporal Data Mining, pp. 1–13 (2001)15. Asaka, T., Yanagida, T.: Solving the gravitino problem by the axino. Physics Letters

B 494(3), 297–301 (2000)16. Astrophysics, N.: Galaxies,

http://science.nasa.gov/astrophysics/focus-areas/what-are-galaxies/

17. Baehr, S., Vedachalam, A., Borne, K.D., Sponseller, D.: Data mining the galaxy zoomergers. In: CIDU, pp. 133–144 (2010)

18. Ball, N.M.: Astroinformatics, cloud computing, and new science at the canadian astron-omy data centre. American Astronomical Society Meeting Abstracts 219 (2012)

19. Ball, N.M., Brunner, R.J.: Data mining and machine learning in astronomy. Interna-tional Journal of Modern Physics D 19(7), 1049–1106 (2010)

20. Ball, N.M., Brunner, R.J., Myers, A.D., Strand, N.E., Alberts, S.L., Tcheng, D., Llora,X.: Robust machine learning applied to astronomical data sets. ii. quantifying pho-tometric redshifts for quasars using instance-based learning. The Astrophysical Jour-nal 663(2), 774 (2007)

21. Bamford, S.: My galaxies (2012), http://www.mygalaxies.co.uk/22. Banerji, M., Lahav, O., Lintott, C.J., Abdalla, F.B., Schawinski, K., Bamford, S.P.,

Andreescu, D., Murray, P., Raddick, M.J., Slosar, A., et al.: Galaxy zoo: reproducinggalaxy morphologies via machine learning. Monthly Notices of the Royal AstronomicalSociety 406(1), 342–353 (2010)

23. Bates, D., Chambers, J., Dalgaard, P., Gentleman, R., Hornik, K., Iacus, S., Ihaka, R.,Leisch, F., Lumley, T., Maechler, M., et al.: The r project for statistical computing(2007)

24. Bell, G., Hey, T., Szalay, A.: Beyond the data deluge. Science 323(5919), 1297–1298(2009)

25. Bennett, K.P., Fayyad, U., Geiger, D.: Density-based indexing for approximate nearest-neighbor queries. In: Proceedings of the Fifth ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pp. 233–243. ACM (1999)

26. Berkhin, P.: A survey of clustering data mining techniques. In: Grouping Multidimen-sional Data, pp. 25–71. Springer (2006)

27. Borne, K.: Surprise detection in science datasets using k-nearest neighbour data distri-butions (knn-dd)

28. Borne, K.: Scientific data mining in astronomy. arXiv preprint arXiv:0911.0505 (2009)29. Borne, K.D.: Managing the big data avalanche in astronomy-data mining the galaxy zoo

classification database. American Astronomical Society Meeting Abstracts 223 (2014)30. Bose, I., Mahapatra, R.K.: Business data mining — a machine learning perspective.

Information & Management 39(3), 211–225 (2001)31. Bradley, P.S., Fayyad, U., Reina, C.: Efficient probabilistic data clustering: Scaling to

large databases. Microsoft Research, Redmont, USA (1998)32. Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: ICML,

vol. 98, pp. 91–99 (1998)



http://www.mygalaxies.co.uk/

References 97

33. Bradley, P.S., Fayyad, U.M., Reina, C., et al.: Scaling clustering algorithms to largedatabases. In: KDD, pp. 9–15 (1998)

34. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)35. Brescia, M., Cavuoti, S., Djorgovski, G.S., Donalek, C., Longo, G., Paolillo, M.: Ex-

tracting knowledge from massive astronomical data sets. In: Astrostatistics and DataMining, pp. 31–45. Springer (2012)

36. Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Miningand Knowledge Discovery 2(2), 121–167 (1998)

37. de la Calleja, J., Fuentes, O.: Automated classification of galaxy images. In: Negoita,M.G., Howlett, R.J., Jain, L.C. (eds.) KES 2004. LNCS (LNAI), vol. 3215, pp. 411–418. Springer, Heidelberg (2004)

38. Cao, L.J., Keerthi, S.S., Ong, C.J., Zhang, J., Periyathamby, U., Fu, X.J., Lee, H.: Par-allel sequential minimal optimization for the training of support vector machines. IEEETransactions on Neural Networks 17(4), 1039–1049 (2006)

39. Celis, S., Musicant, D.R.: Weka-parallel: machine learning in parallel, Carleton Col-lege, CS TR. Citeseer (2002)

40. Cen, R.: On the origin of the hubble sequence: I. insights on galaxy color migrationfrom cosmological simulations. The Astrophysical Journal 781, 38 (2014)

41. Chan, P.K., Fan, W., Prodromidis, A.L., Stolfo, S.J.: Distributed data mining in creditcard fraud detection. IEEE Intelligent Systems and their Applications 14(6), 67–74(1999)

42. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.:Crisp-dm 1.0. CRISP-DM Consortium (2000)

43. Clark, S.: The Big Questions The Universe. Quercus (2011)44. Clifton, C., Marks, D.: Security and privacy implications of data mining. In: ACM

SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery,pp. 15–19. Citeseer (1996)

45. Collobert, R., Bengio, S., Bengio, Y.: A parallel mixture of svms for very large scaleproblems. Neural Computation 14(5), 1105–1114 (2002)

46. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297(1995)

47. Cuzzocrea, A., Gaber, M.M.: Data science and distributed intelligence: Recent devel-opments and future insights. In: Fortino, G., Badica, C., Malgeri, M., Unland, R. (eds.)IDC 2012. SCI, vol. 446, pp. 139–146. Springer, Heidelberg (2012)

48. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Com-munications of the ACM 51(1), 107–113 (2008)

49. Dong, J.X., Krzyzak, A., Suen, C.Y.: Fast svm training algorithm with decompositionon very large data sets. IEEE Transactions on Pattern Analysis and Machine Intelli-gence 27(4), 603–618 (2005)

50. Dressler, A., OemlerJr, A., Butcher, H.R., Gunn, J.E.: The morphology of distant clustergalaxies. 1: Hst observations of cl 0939+ 4713. The Astrophysical Journal 430, 107–120 (1994)

51. Dvali, G., Senjanovic, G.: Is there a domain wall problem? arXiv preprint hep-ph/9501387 (1995)

52. Eliche-Moral, M.C., Gonzalez-Garcıa, A.C., Aguerri, J.A.L., Gallego, J., Zamorano,J., Balcells, M., Prieto, M.: Evolution along the sequence of s0 hubble types inducedby dry minor mergers i. global bulge-to-disk structural relations. Astronomy & Astro-physics/Astronomie et Astrophysique 547 (2012)

53. Ellis, J., Linde, A.D., Nanopoulos, D.V.: Inflation can save the gravitino. Physics LettersB 118(1), 59–64 (1982)

98 References

54. Ellis, J., Nanopoulos, D.V., Olive, K.A., Tamvakis, K.: Primordial supersymmetric in-flation. Nuclear Physics B 221(2), 524–548 (1983)

55. Ellis, J., Nanopoulos, D.V., Quiros, M.: On the axion, dilaton, polonyi, gravitinoand shadow matter problems in supergravity and superstring models. Physics LettersB 174(2), 176–182 (1986)

56. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discoveringclusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)

57. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discoveryin databases. AI Magazine 17(3), 37 (1996)

58. Fayyad, U.M., Reina, C., Bradley, P.S.: Initialization of iterative refinement clusteringalgorithms. In: KDD, pp. 194–198 (1998)

59. Fine, S., Scheinberg, K.: Efficient svm training using low-rank kernel representations.The Journal of Machine Learning Research 2, 243–264 (2002)

60. Fortson, L., Masters, K., Nichol, R., Borne, K., Edmondson, E., Lintott, C., Raddick,J., Schawinski, K., Wallin, J.: Galaxy zoo: Morphological classification and citizen sci-ence. arXiv preprint arXiv:1104.5513 (2011)

61. Fukugita, M., Ichikawa, T., Gunn, J., Doi, M., Shimasaku, K., Schneider, D.: The sloandigital sky survey photometric system. The Astronomical Journal 111, 1748 (1996)

62. Gaber, M.M.: Scientific data mining and knowledge discovery. Springer (2010)63. Gaber, M.M.: Journeys to Data Mining: Experiences from 15 Renowned Researchers.

Springer Publishing Company, Incorporated (2012)64. Gardner, J.P., Mather, J.C., Clampin, M., Doyon, R., Greenhouse, M.A., Hammel, H.B.,

Hutchings, J.B., Jakobsen, P., Lilly, S.J., Long, K.S., et al.: The james webb spacetelescope. Space Science Reviews 123(4), 485–606 (2006)

65. Gauci, A., Adami, K.Z., Abela, J.: Machine learning for galaxy morphology classifica-tion. arXiv preprint arXiv:1005.0390 (2010)

66. Gibson, C.H.: The first turbulent mixing and combustion. IUTAM Turbulent Mixingand Combustion 21 (2001)

67. Gingerich, O.: The book nobody read: chasing the revolutions of Nicolaus Copernicus,vol. 1. Penguin Books (2004)

68. Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attributevalues in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI),vol. 2005, pp. 378–385. Springer, Heidelberg (2001)

69. Guazzelli, A., Zeller, M., Lin, W.C., Williams, G.: Pmml: An open standard for sharingmodels. The R Journal 1(1), 60–65 (2009)

70. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The wekadata mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18(2009)

71. Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan kaufmann(2006)

72. Hand, D.J., Mannila, H., Smyth, P.: Principles of data mining (adaptive computationand machine learning) (2001)

73. Hassan, A., Fluke, C.J., Barnes, D.G.: Unleashing the power of distributed cpu/gpuarchitectures: Massive astronomical data analysis and visualization case study. arXivpreprint arXiv:1111.6661 (2011)

74. Heiden, A.V.: The galileo project (1995), http://galileo.rice.edu/75. Henrion, M., Mortlock, D.J., Hand, D.J., Gandy, A.: A bayesian approach to star–galaxy

classification. Monthly Notices of the Royal Astronomical Society 412(4), 2286–2302(2011)

http://galileo.rice.edu/

References 99

76. Henrion, M., Mortlock, D.J., Hand, D.J., Gandy, A.: Classification and anomaly detec-tion for astronomical survey data. In: Astrostatistical Challenges for the New Astron-omy, pp. 149–184. Springer (2013)

77. Hey, A.J., Tansley, S., Tolle, K.M., et al.: The fourth paradigm: data-intensive scientificdiscovery (2009)

78. Ho, T.K.: Random decision forests. In: Proceedings of the Third International Confer-ence on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)

79. Holmes, G., Donkin, A., Witten, I.H.: Weka: A machine learning workbench. In: Pro-ceedings of the 1994 Second Australian and New Zealand Conference on IntelligentInformation Systems, pp. 357–361. IEEE (1994)

80. Hornik, K.: The comprehensive r archive network. Wiley Interdisciplinary Reviews:Computational Statistics 4(4), 394–398 (2012)

81. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with cat-egorical values. Data Mining and Knowledge Discovery 2(3), 283–304 (1998)

82. Hubble, E.P.: Extragalactic nebulae. The Astrophysical Journal 64, 321–369 (1926)83. Hubble, E.P.: The realm of the nebulae. Yale University Press (1936)84. Jagannathan, G., Wright, R.N.: Privacy-preserving distributed k-means clustering over

arbitrarily partitioned data. In: Proceedings of the Eleventh ACM SIGKDD Interna-tional Conference on Knowledge Discovery in Data Mining, pp. 593–599. ACM (2005)

85. Jonas, J., Harper, J.: Effective counterterrorism and the limited role of predictive datamining. Cato Institute (2006)

86. Kajisawa, M., Yamada, T.: When did the hubble sequence appear?: Morphology,color, and number-density evolution of the galaxies in the hubble deep field north.Publications-Astronomical Society of Japan 53(5), 833–852 (2001)

87. Kamar, E., Hacker, S., Horvitz, E.: Combining human and machine intelligence inlarge-scale crowdsourcing. In: Proceedings of the 11th International Conference on Au-tonomous Agents and Multiagent Systems, vol. 1, pp. 467–474. International Founda-tion for Autonomous Agents and Multiagent Systems (2012)

88. Kantardzic, M.: Data mining: concepts, models, methods, and algorithms. John Wiley& Sons (2011)

89. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.:An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans-actions on Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002)

90. Kasivajhula, S., Raghavan, N., Shah, H.: Morphological galaxy classification using ma-chine learning. Monthly Notices Royal Astron. Soc. 8, 1–8 (2007)

91. Kirshner, S., Cadez, I.V., Smyth, P., Kamath, C.: Learning to classify galaxy shapesusing the em algorithm. In: Advances in Neural Information Processing Systems, pp.1497–1504 (2002)

92. Kleissner, C.: Data mining for the enterprise. In: Proceedings of the Thirty-First HawaiiInternational Conference on System Sciences, vol. 7, pp. 295–304. IEEE (1998)

93. Kormendy, J., Bender, R.: A proposed revision of the hubble sequence for ellipticalgalaxies. The Astrophysical Journal Letters 464(2), L119 (1996)

94. Kormendy, J., Bender, R.: A revised parallel-sequence morphological classification ofgalaxies: structure and formation of s0 and spheroidal galaxies. The Astrophysical Jour-nal Supplement Series 198(1), 2 (2012)

95. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Data preprocessing for supervised lean-ing. International Journal of Computer Science 1(2), 111–117 (2006)

96. Kovalerchuk, B., Vityaev, E.: Data mining in finance. Advances in relational and hybridmethods (2000)

100 References

97. Krishna, K., NarasimhaMurty, M.: Genetic k-means algorithm. IEEE Transactions onSystems, Man, and Cybernetics, Part B: Cybernetics 29(3), 433–439 (1999)

98. Larsson, S.E., Sarkar, S., White, P.L.: Evading the cosmological domain wall problem.Physical Review D 55(8), 5129 (1997)

99. Laurikainen, E., Salo, H., Buta, R., Knapen, J.: Properties of bars and bulges in thehubble sequence. Monthly Notices of the Royal Astronomical Society 381(1), 401–417(2007)

100. Lazarides, G., Shafi, Q.: Axion models with no domain wall problem. Physics LettersB 115(1), 21–25 (1982)

101. Lee, C., Jang, M.G.: Fast training of structured svm using fixed-threshold sequentialminimal optimization. ETRI Journal 31(2), 121–128 (2009)

102. Lemaıtre, G.: The primeval atom hypothesis and the problem of the clusters of galaxies.La Structure et lEvolution de lUnivers, pp. 1–32 (1958)

103. Zhang, L.J.M., Lin, B., An, F.Z.: improvement algorithm to sequential minimal opti-mization. Journal of Software 5, 007 (2003)

104. Lin, Y., Jeon, Y.: Random forests and adaptive nearest neighbors. Journal of the Amer-ican Statistical Association 101(474), 578–590 (2006)

105. Linde, A.: Primordial inflation without primodial monopoles. Physics Letters B 132(4),317–320 (1983)

106. Linde, A., Linde, D., Mezhlumian, A.: From the big bang theory to the theory of astationary universe. Physical Review D 49(4), 1783 (1994)

107. Linde, A.D.: A new inflationary universe scenario: A possible solution of the hori-zon, flatness, homogeneity, isotropy and primordial monopole problems. Physics Let-ters B 108(6), 389–393 (1982)

108. Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Bellare, M. (ed.) CRYPTO2000. LNCS, vol. 1880, pp. 36–54. Springer, Heidelberg (2000)

109. Lintott, C.J., Schawinski, K., Slosar, A., Land, K., Bamford, S., Thomas, D., Raddick,M.J., Nichol, R.C., Szalay, A., Andreescu, D., et al.: Galaxy zoo: morphologies derivedfrom visual inspection of galaxies from the sloan digital sky survey. Monthly Noticesof the Royal Astronomical Society 389(3), 1179–1189 (2008)

110. Lloyd, S.: Least squares quantization in pcm. IEEE Transactions on Information The-ory 28(2), 129–137 (1982)

111. Loveday, J.: The apm bright galaxy catalogue. Monthly Notices of the Royal Astro-nomical Society 278(4), 1025–1048 (1996)

112. Lucas, P.: Bayesian analysis, pattern analysis, and data mining in health care. CurrentOpinion in Critical Care 10(5), 399–403 (2004)

113. Marban, O., Mariscal, G., Segovia, J.: A data mining and knowledge discovery processmodel. In: Data Mining and Knowledge Discovery in Real Life Applications, p. 8. IN-TECH (2009)

114. Marban, O., Segovia, J., Menasalvas, E., Fernandez-Baizan, C.: Toward data miningengineering: A software engineering approach. Information Systems 34(1), 87–107(2009)

115. Martin, J.: The spectral sequence. In: A Spectroscopic Atlas of Bright Stars, pp. 15–21.Springer (2010)

116. Martınez-Gonzalez, E., Diego, J., Vielva, P., Silk, J.: Cosmic microwave backgroundpower spectrum estimation and map reconstruction with the expectation-maximizationalgorithm. Monthly Notices of the Royal Astronomical Society 345(4), 1101–1109(2003)

117. Mather, J., Hinshaw, G., Page, J.D.L.: Cosmic microwave background. In: Planets, Starsand Stellar Systems, pp. 609–684. Springer (2013)

References 101

118. McConnell, S., Skillicorn, D.: Distributed data mining for astrophysical datasets. In:Astronomical Data Analysis Software and Systems XIV, vol. 347, p. 360 (2005)

119. Murdin, P.: Big bang theory. Encyclopedia of Astronomy and Astrophysics 1, 4801(2000)

120. Newton, I.: Principia (1687). Translated by Andrew Motte 1729 (2004)121. Neyman, J., Scott, E.L.: Statistical approach to problems of cosmology. Journal of the

Royal Statistical Society. Series B (Methodological), 1–43 (1958)122. Noble, W.S.: What is a support vector machine? Nature Biotechnology 24(12), 1565–

1567 (2006)123. Nolan, P., Abdo, A., Ackermann, M., Ajello, M., Allafort, A., Antolini, E., Atwood,

W., Axelsson, M., Baldini, L., Ballet, J., et al.: Fermi large area telescope second sourcecatalog. The Astrophysical Journal Supplement Series 199(2), 31 (2012)

124. Ordonez, C., Omiecinski, E.: Frem: fast and robust em clustering for large data sets. In:Proceedings of the Eleventh International Conference on Information and KnowledgeManagement, pp. 590–599. ACM (2002)

125. Paynter, G., Trigg, L., Frank, E., Kirkby, R.: Attribute-relation file format (arff) (2002)126. Perryman, M.A.: Extra-solar planets. Reports on Progress in Physics 63(8), 1209 (2000)127. Phil, H.D.M.: Data Mining Techniques and Applications: An Introduction. Course

Technology Cengage Learning (2010)128. Platt, J., et al.: Sequential minimal optimization: A fast algorithm for training support

vector machines (1998)129. Prather, J.C., Lobach, D.F., Goodwin, L.K., Hales, J.W., Hage, M.L., Hammond, W.E.:

Medical data mining: knowledge discovery in a clinical data warehouse. In: Proceed-ings of the AMIA Annual Fall Symposium, p. 101. American Medical Informatics As-sociation (1997)

130. Prestopnik, N.R.: Citizen science case study: Galaxy zoo/zooniverse (2012),http://citsci.syr.edu/system/files/galaxyzoo.pdf

131. Roberts, M.S., Haynes, M.P.: Physical parameters along the hubble sequence. AnnualReview of Astronomy and Astrophysics 32, 115–152 (1994)

132. Romero, C., Ventura, S.: Educational data mining: a review of the state of the art.IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Re-views 40(6), 601–618 (2010)

133. Russell, J.L.: Kepler’s laws of planetary motion: 1609-1666. Cambridge Univ Press(1964)

134. SDSS: Spiral galaxies,http://cas.sdss.org/dr7/en/proj/basic/galaxies/spirals.asp

135. Shamir, L.: Automatic morphological classification of galaxy images. Monthly Noticesof the Royal Astronomical Society 399(3), 1367–1372 (2009)

136. Simpson, E., Roberts, S., Psorakis, I., Smith, A.: Dynamic bayesian combination ofmultiple imperfect classifiers. In: Guy, T.V., Karny, M., Wolpert, D.H. (eds.) DecisionMaking and Imperfection. SCI, vol. 474, pp. 1–38. Springer, Heidelberg (2013)

137. Slobogin, C.: Government data mining and the fourth amendment, pp. 317–341. TheUniversity of Chicago Law Review (2008)

138. Smith, A., Lynn, S., Sullivan, M., Lintott, C., Nugent, P., Botyanszki, J., Kasliwal, M.,Quimby, R., Bamford, S., Fortson, L., et al.: Galaxy zoo supernovae. Monthly Noticesof the Royal Astronomical Society 412(2), 1309–1319 (2011)

http://citsci.syr.edu/system/files/galaxyzoo.pdf

http://cas.sdss.org/dr7/en/proj/basic/galaxies/spirals.asp

http://cas.sdss.org/dr7/en/proj/basic/galaxies/spirals.asp

102 References

139. Smith, M., Gomez, H., Eales, S., Ciesla, L., Boselli, A., Cortese, L., Bendo, G., Baes,M., Bianchi, S., Clemens, M., et al.: The herschel reference survey: dust in early-typegalaxies and across the hubble sequence. The Astrophysical Journal 748(2), 123 (2012)

140. Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. 1,801–804 (1956)

141. Stoughton, C., Lupton, R.H., Bernardi, M., Blanton, M.R., Burles, S., Castander, F.J.,Connolly, A., Eisenstein, D.J., Frieman, J.A., Hennessy, G., et al.: Sloan digital skysurvey: Early data release. The Astronomical Journal 123(1), 485 (2002)

142. Sullivan, D.G.: Skyserver: An astronomical database (2012)143. Szalay, A.S., Gray, J., Thakar, A.R., Kunszt, P.Z., Malik, T., Raddick, J., Stoughton, C.:

vandenBerg, J.: The sdss skyserver: public access to the sloan digital sky server data.In: Proceedings of the 2002 ACM SIGMOD International Conference on Managementof Data, pp. 570–581. ACM (2002)

144. Szkody, P., Anderson, S.F., Hayden, M., Kronberg, M., McGurk, R., Riecken, T.,Schmidt, G.D., West, A.A., Gansicke, B.T., Gomez-Moran, A.N., et al.: Cataclysmicvariables from sdss. vii. the seventh year. The Astronomical Journal 137(4), 4011(2006)

145. Taton, R., Wilson, C., Hoskin, M.: Planetary Astronomy from the Renaissance to theRise of Astrophysics, Part A, Tycho Brahe to Newton, vol. 2. Cambridge UniversityPress (2003)

146. Trefil, J.S.: The moment of creation: Big bang physics from before the first millisecondto the present universe. Courier Dover Publications (2013)

147. Vaidya, J., Clifton, C.: Privacy preserving association rule mining in vertically parti-tioned data. In: Proceedings of the Eighth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pp. 639–644. ACM (2002)

148. Vapnik, V.: The nature of statistical learning theory. Springer (2000)149. Vasconcellos, E., de Carvalho, R., Gal, R., LaBarbera, F., Capelato, H., Velho, H.F.C.,

Trevisan, M., Ruiz, R.: Decision tree classifiers for star/galaxy separation. The Astro-nomical Journal 141(6), 189 (2011)

150. Vedachalam, A.: Effective outlier detection in science data streams151. Verykios, V.S., Bertino, E., Fovino, I.N., Provenza, L.P., Saygin, Y., Theodoridis, Y.:

State-of-the-art in privacy preserving data mining. ACM Sigmod Record 33(1), 50–57(2004)

152. Viveros, M.S., Nearhos, J.P., Rothman, M.J.: Applying data mining techniques to ahealth insurance information system. In: VLDB, pp. 286–294 (1996)

153. Wadadekar, Y.: Morphology of galaxies. arXiv preprint arXiv:1201.2252 (2012)154. Wagstaff, K.: Clustering with missing values: No imputation required. Springer (2004)155. Wagstaff, K.L., Laidler, V.G.: Making the most of missing values: object clustering with

partial data in astronomy. Astronomical Data Analysis Software and Systems XIV 347,172 (2005)

156. Wang, R., Allen, T., Harris, W., Madnick, S.: An information product approach for totalinformation awareness (2002)

157. Way, M.: Galaxy zoo morphology and photometric redshifts in the sloan digital skysurvey. The Astrophysical Journal Letters 734(1), L9 (2011)

158. Way, M.J., Klose, C.: Can self-organizing maps accurately predict photometric red-shifts? Publications of the Astronomical Society of the Pacific 124(913), 274–279(2012)

159. Weir, N., Fayyad, U.M., Djorgovski, S.: Automated star/galaxy classification for digi-tized poss-ii. The Astronomical Journal 109, 2401 (1995)

References 103

160. Wirth, R., Hipp, J.: Crisp-dm: Towards a standard process model for data mining.In: Proceedings of the 4th International Conference on the Practical Applications ofKnowledge Discovery and Data Mining, pp. 29–39. Citeseer (2000)

161. Wozniak, P., Akerlof, C., Amrose, S., Brumby, S., Casperson, D., Gisler, G., Kehoe, R.,Lee, B., Marshall, S., McGowan, K., et al.: Classification of rotse variable stars usingmachine learning. Bulletin of the American Astronomical Society 33, 1495 (2001)

162. Yu, H., Yang, J., Han, J.: Classifying large data sets using svms with hierarchical clus-ters. In: Proceedings of the ninth ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pp. 306–315. ACM (2003)

163. Zhang, H.R., Han, Z.Z.: An improved sequential minimal optimization learning algo-rithm for regression support vector machine. Journal of Software 14(12), 2006–2013(2003)

164. Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Applied ArtificialIntelligence 17(5-6), 375–381 (2003)

Index

Astronomy 5

Big Bang Theory 12Big Data 1

Citizen Science 17CRISP-DM 31

Data Mining 1Data Pre-processing 22

Galaxy 9Galaxy Morphology 10Galaxy Zoo 18Galaxy Zoo Table 2 43

IFS 40Incremental Feature Selection 40

K-Means Algorithm 33Knowledge Discovery 16

R Language 75Random Forests 38RStudio 75

Sequential Minimal Optimisation 37SMO 37Star 9Support Vector Machines 34SVM 34

Waikato Environment for KnowledgeAnalysis 49

WEKA 49

Documents

[Studies in Big Data] Astronomy and Big Data Volume 6 ||