Astroinformatics at the Canadian Astronomy Data...

Preview:

Citation preview

The Next Generation Virgo Cluster Survey (NGVS) is one of six science projects integrated into CANFAR during its development phase. It is a 104 square degree survey of the Virgo Cluster of galaxies in 5 optical bands (ugriz), utilizing the MegaCam camera on the Canada-France-Hawaii telescope, with a limiting magnitude of 25.7 (10σ point source) in the g band. The survey will revolutionize the science of this prototypical high density environment in the local universe.

The survey data size, while not extremely large by modern standards, sti l l represents a substantial dataset that will be amenable to data mining. The expected final dataset is 50T, processed by two independent pipelines, MegaPipe at CADC, and TERAPIX at the Institut d'Astrophysique in Paris.

Currently, the problem of deducing cluster membership absent spectroscopic redshifts remains unsolved.

K-Means Clustering

The aim of K-means is to optimally assign points in a parameter space to clusters, in an unsupervised manner:

for observations xj, k clusters Si, with cluster means μi.

Here, we perform dimension reduction (currently PCA) and run the SkyTree kmeans algorithm on

We describe ongoing astroinformatics work at the Canadian Astronomy Data Centre (CADC). With a collection of over 0.5 petabytes of information, and serving nearly 3000 astronomers worldwide, CADC is one of the world's largest astronomy data centres. Its unique blend of astronomers and computer specialists among its staff results in a rich interaction between world experts that is ideal for the fostering of developments within astroinformatics. Part of CADCʼs ongoing goals is to retain science drivers as the primary motivator at each step of the process, from the receipt of raw data from telescopes, to the release of that data, and its use by scientists. Thus, the developments remain guided by maximal benefit to the astronomy community.The Canadian Advanced Network for Astronomical Research (CANFAR) is a University of Victoria and CADC project that builds on the existing CADC infrastructure to provide storage, processing, and analysis tools needed to enable astronomers to perform data-intensive astronomy on current and next generation datasets. CANFAR provides a Virtual Cluster, accessed via a Virtual Machine environment, over which the user has complete control, and access to Cloud Computing on the Compute Canada Grid. Its services are compliant with the International Virtual Observatory Alliance standards. Hence, rather than build a new infrastructure for a project such as a sky survey, an individual or collaboration may utilize CANFAR.Although the infrastructure provided by CANFAR is vital, its main focus is on the basic storage and processing of data. To apply methods such as KDD, machine learning, and data mining, further software must be run. By analogy to the argument that CANFAR can provide the generic hardware portions of a data processing pipeline, we implement fast, scalable, data mining algorithms that simplify the generic portions of KDD within current and future datasets, further enabling practical data-intensive astronomy. We show an example of the use of the SkyTree software to perform K-means clustering to determine which galaxies in the Next Generation Virgo Cluster Survey (NGVS) are cluster members. This problem is unsolved within the survey.

Astroinformatics at the Canadian Astronomy Data Centre Nicholas M. Ball

Canadian Astronomy Data Centre, Herzberg Institute of Astrophysics, Victoria, BC, Canada http://sites.google.com/site/nickballastronomer nick.ball@nrc-cnrc.gc.ca

Introduction Virgo Cluster Membershipvia K-Means

Dowler, P., et al., 2008, Common Archive Observation Model. ADASS XVII, ASP Conference Proceedings, Vol. 394, eds. Argyle R.W., Bunclark P.S., Lewis J.R., pp 426-429

Gaudet S., the CADC team, 2011, Virtualization and Grid Utilization within the CANFAR Project. ADASS XX, ASP Conference Proceedings, Vol. 442, eds. Evans I.N., Accomazzi A., Mink D.J., Rots A.H., pp 61–64

This research used the facilities of the Canadian Astronomy Data Centre, operated by the National Research Council of Canada with the support of the Canadian Space Agency. Funding for CANFAR was provided by CANARIE via the Network Enabled Platforms Supporting Virtual Organisations program.

The IVOA Interest Group in Knowledge Discovery in Databases (led by G. Longo) aims to deploy practical data mining algorithms of use to astronomers:

“We will develop and test scalable data mining algorithms and the accompanying new standards

for VO interfaces and protocols, so that these algorithms can be discovered and used

transparently within VO science workflows or in standalone data exploration applications.”

KDD-IG Charter, 2010

As part of achieving these aims, we are constructing an online guide to data mining in astronomy. Prior to this guide, no such tool existed. The guide is designed for the astronomer who is interested in using the methods of data mining to improve their science return, but whose main priority remains getting their science done. The guide is currently situated at http://www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/IvoaKDDguide .

References & Acknowledgments

Figure 2: K-means clustering results, showing normalized cluster membership for several

subsets of objects as a function of cluster number (35 clusters in this case)

Astroinformatics at CADC

Astroinformatics will become the only way to render future datasets comprehensible. It will become increasingly impractical to download data, hence an infrastructure is required in which the data analysis can be done in situ, without the need for downloading and local processing.A significant proportion of the KDD component of CADCʼs astroinformatics has been in the context of the science requirements of the Next Generation Virgo Survey (NGVS), guided by the authorʼs science interests, e.g., the galaxy luminosity function. We show an example here.

Canadian Astronomy Data Centre

The Canadian Astronomy Data Centre (CADC), based at the Herzberg Institute of Astrophysics in Victoria, BC, is one of the largest astronomy data centres in the world. Founded in 1986, it currently holds over 500T of data, and has served over 100T to more than 5000 distinct IP addresses worldwide. CADC combines the expertise of astronomers and computer specialists, and is hence ideal for realizing the scientific benefits of astroinformatics.

the first three PCs. kmeans includes a facility to determine the optimal number of clusters via cross-validation.

Results

Initial results show that the procedure is able to discern meaningful groups of galaxies within the PC1-PC2-PC3 space (Figure 2), e.g., cluster members and background galaxies confirmed by spectroscopy, saturated stars, and bright, low surface brightness artifacts.

Future Work

Obvious refinements include:

• More thorough removal of image/catalogue artifacts

• Testing of non-linear dimension reduction, e.g., kernel PCA

• Use of prior knowledge, e.g., constrained K-means, guided by object spectra

• Probabilistic cluster membership• Detailed characterization of the objects

contained in each cluster

CANFAR & CVO

T h e C a n a d i a n A d v a n c e d N e t w o r k f o r Astronomical Research (CANFAR; http://canfar.phys.uvic.ca), led by Chris Pritchet at the University of Victoria, and contracted to CADC, is:

“a project ... to provide the delivery, processing, storage, analysis, and distribution of astronomical

datasets of unprecedented size. ... The project builds on CADC's existing infrastructure to provide IVOA-

compliant tools and services for astronomers, and access to Cloud Computing on the Compute Canada

Grid, via a Virtual Machine environment.” CANFAR Statement of Work, 2008

CANFAR Usage

CANFAR provides a Virtual Cluster, accessible to an individual user or collaboration (Virtual Organization). Each user operates within their own Virtual Machine environment, over which they have complete control. This provides access to CANFAR services, and the Cloud Computing resources of Compute Canada (Figure 1).

VO-Compliant Web Services

Outward-facing CANFAR services use IVOA-compliant protocols, e.g., TAP and VOSpace for data services, UWS for processing, and TLS and X.509 grid certificates for security. Inward-facing infrastructure builds on existing CADC resources. For cloud computing, Cloud Scheduler, Condor, Nimbus, and iRODS are used.

In addition, CADC data are available in IVOA-compliant form via the Common Archive Observation Model of the Canadian Virtual Observatory (Dowler et al. 2008).

Figure 1: CANFAR infrastructure: Virtual Organizations, such as individual users or survey teams, access CANFAR and Compute Canada resources via a Virtual Machine environment. Green boxes show new components resulting

from CANFAR. From Gaudet et al. (2011).

CANFAR Science

Several hundred thousand processor hours have been logged on CANFAR in aid of science projects. In particular, six projects, including the NGVS, have been integral to CANFARʼs development. Extensive analysis that would not be possible on a desktop, e.g., the NGVS MegaPipe pipeline, fitting galaxy profiles, etc., is now being performed, including by non-data specialists.

Guide to Data Mining in Astronomy

Fast Data Mining Algorithms

Until recently, most data mining algorithms have scaled as N2, rendering them intractable for modern datasets. However, fast libraries which implement data mining algorithms scaling as NlogN, or better, are now available.

Thus the installation of such libraries on the CADC infrastructure will enable the practical use of these algorithms by astronomers who are not data mining specialists, enabling useful science.

While each specific usage of such software will remain science-driven, the underlying tools are not dataset-specific, hence the effort to make available such generic tools is appropriate.

We have installed and are running the SkyTree software (http://www.fast-lab.org), and have confirmed that its algorithms scale as required.

Data Services Processing Services

Processing Resources

CondorWeb Service

Storage Resources

Browse, Retrieveand Store Data

Queue and Monitor Processing

Start VI

Store and Retrieve Data

Control Processing

Monitor Processing

Store and Retrieve Data

GetVI

Maintain VCE

Collaboration User InterfacesWIKI

ExistingExternal Services

Interactive Collaborationactivities

MonitorProcessing

Link to Data

Astronomer's Desktop

CANFAR enabled applicationsHTTP clients VO enabled applications

CADCStorageCluster

GridStorage

CADCRDBMS

Data StorageAD

ProcessingState

Management

Virtual Resource Advertisement

GridStorage

GridStorageClusters

Database Queries

VoSpaceTAP

UWSGMS

CADCProcessing

ClusterGrid

Cluster

Virtual Cluster Scheduler(Condor)

Run Job

Queue Analysis

Nimbus

GridCluster

NimbusGrid

ProcessingClusters

Start VI

Data Storage(IRODS)

SurveyWeb Page

VirtualOrganizationManagement

ConfigureVI

Get and Save VI

Store and Retrieve Data

UWS

NimbusNimbus

CloudScheduler

ManageProcessingSequences

Control Processing

Monitor Processing

Database Queries

UWS

Data WebService

Retrieveand Store Data

Recommended