1
The Next Generation Virgo Cluster Survey (NGVS) is one of six science projects integrated into CANFAR during its development phase. It is a 104 square degree survey of the Virgo Cluster of galaxies in 5 optical bands (ugriz), utilizing the MegaCam camera on the Canada-France-Hawaii telescope, with a limiting magnitude of 25.7 (10σ point source) in the g band. The survey will revolutionize the science of this prototypical high density environment in the local universe. The survey data size, while not extremely large by modern standards, still represents a substantial dataset that will be amenable to data mining. The expected final dataset is 50T, processed by two independent pipelines, MegaPipe at CADC, and TERAPIX at the Institut d'Astrophysique in Paris. Currently, the problem of deducing cluster membership absent spectroscopic redshifts remains unsolved. K-Means Clustering The aim of K-means is to optimally assign points in a parameter space to clusters, in an unsupervised manner: for observations x j , k clusters S i , with cluster means μ i . Here, we perform dimension reduction (currently PCA) and run the SkyTree kmeans algorithm on We describe ongoing astroinformatics work at the Canadian Astronomy Data Centre (CADC). With a collection of over 0.5 petabytes of information, and serving nearly 3000 astronomers worldwide, CADC is one of the world's largest astronomy data centres. Its unique blend of astronomers and computer specialists among its staff results in a rich interaction between world experts that is ideal for the fostering of developments within astroinformatics. Part of CADCʼs ongoing goals is to retain science drivers as the primary motivator at each step of the process, from the receipt of raw data from telescopes, to the release of that data, and its use by scientists. Thus, the developments remain guided by maximal benefit to the astronomy community. The Canadian Advanced Network for Astronomical Research (CANFAR) is a University of Victoria and CADC project that builds on the existing CADC infrastructure to provide storage, processing, and analysis tools needed to enable astronomers to perform data-intensive astronomy on current and next generation datasets. CANFAR provides a Virtual Cluster, accessed via a Virtual Machine environment, over which the user has complete control, and access to Cloud Computing on the Compute Canada Grid. Its services are compliant with the International Virtual Observatory Alliance standards. Hence, rather than build a new infrastructure for a project such as a sky survey, an individual or collaboration may utilize CANFAR. Although the infrastructure provided by CANFAR is vital, its main focus is on the basic storage and processing of data. To apply methods such as KDD, machine learning, and data mining, further software must be run. By analogy to the argument that CANFAR can provide the generic hardware portions of a data processing pipeline, we implement fast, scalable, data mining algorithms that simplify the generic portions of KDD within current and future datasets, further enabling practical data-intensive astronomy. We show an example of the use of the SkyTree software to perform K-means clustering to determine which galaxies in the Next Generation Virgo Cluster Survey (NGVS) are cluster members. This problem is unsolved within the survey. Astroinformatics at the Canadian Astronomy Data Centre Nicholas M. Ball Canadian Astronomy Data Centre, Herzberg Institute of Astrophysics, Victoria, BC, Canada http://sites.google.com/site/nickballastronomer [email protected] Introduction Virgo Cluster Membership via K-Means Dowler, P., et al., 2008, Common Archive Observation Model. ADASS XVII, ASP Conference Proceedings, Vol. 394, eds. Argyle R.W., Bunclark P.S., Lewis J.R., pp 426-429 Gaudet S., the CADC team, 2011, Virtualization and Grid Utilization within the CANFAR Project. ADASS XX, ASP Conference Proceedings, Vol. 442, eds. Evans I.N., Accomazzi A., Mink D.J., Rots A.H., pp 61–64 This research used the facilities of the Canadian Astronomy Data Centre, operated by the National Research Council of Canada with the support of the Canadian Space Agency. Funding for CANFAR was provided by CANARIE via the Network Enabled Platforms Supporting Virtual Organisations program. The IVOA Interest Group in Knowledge Discovery in Databases (led by G. Longo) aims to deploy practical data mining algorithms of use to astronomers: “We will develop and test scalable data mining algorithms and the accompanying new standards for VO interfaces and protocols, so that these algorithms can be discovered and used transparently within VO science workflows or in standalone data exploration applications.” KDD-IG Charter, 2010 As part of achieving these aims, we are constructing an online guide to data mining in astronomy. Prior to this guide, no such tool existed. The guide is designed for the astronomer who is interested in using the methods of data mining to improve their science return, but whose main priority remains getting their science done. The guide is currently situated at http:// www.ivoa.net/cgi-bin/twiki/bin/view/ IVOA/IvoaKDDguide . References & Acknowledgments Figure 2: K-means clustering results, showing normalized cluster membership for several subsets of objects as a function of cluster number (35 clusters in this case) Astroinformatics at CADC Astroinformatics will become the only way to render future datasets comprehensible. It will become increasingly impractical to download data, hence an infrastructure is required in which the data analysis can be done in situ, without the need for downloading and local processing. A significant proportion of the KDD component of CADCʼs astroinformatics has been in the context of the science requirements of the Next Generation Virgo Survey (NGVS), guided by the author ʼ s science interests, e.g., the galaxy luminosity function. We show an example here. Canadian Astronomy Data Centre The Canadian Astronomy Data Centre (CADC), based at the Herzberg Institute of Astrophysics in Victoria, BC, is one of the largest astronomy data centres in the world. Founded in 1986, it currently holds over 500T of data, and has served over 100T to more than 5000 distinct IP addresses worldwide. CADC combines the expertise of astronomers and computer specialists, and is hence ideal for realizing the scientific benefits of astroinformatics. the first three PCs. kmeans includes a facility to determine the optimal number of clusters via cross-validation. Results Initial results show that the procedure is able to discern meaningful groups of galaxies within the PC1-PC2-PC3 space (Figure 2), e.g., cluster members and background galaxies confirmed by spectroscopy, saturated stars, and bright, low surface brightness artifacts. Future Work Obvious refinements include: More thorough removal of image/catalogue artifacts Testing of non-linear dimension reduction, e.g., kernel PCA Use of prior knowledge, e.g., constrained K- means, guided by object spectra Probabilistic cluster membership Detailed characterization of the objects contained in each cluster CANFAR & CVO The Canadian Advanced Network for Astronomical Research ( CANFAR; http:// canfar.phys.uvic.ca), led by Chris Pritchet at the University of Victoria, and contracted to CADC, is: “a project ... to provide the delivery, processing, storage, analysis, and distribution of astronomical datasets of unprecedented size. ... The project builds on CADC's existing infrastructure to provide IVOA- compliant tools and services for astronomers, and access to Cloud Computing on the Compute Canada Grid, via a Virtual Machine environment.” CANFAR Statement of Work, 2008 CANFAR Usage CANFAR provides a Virtual Cluster, accessible to an individual user or collaboration (Virtual Organization). Each user operates within their own Virtual Machine environment, over which they have complete control. This provides access to CANFAR services, and the Cloud Computing resources of Compute Canada (Figure 1). VO-Compliant Web Services Outward-facing CANFAR services use IVOA- compliant protocols, e.g., TAP and VOSpace for data services, UWS for processing, and TLS and X.509 grid certificates for security. Inward-facing infrastructure builds on existing CADC resources. For cloud computing, Cloud Scheduler, Condor, Nimbus, and iRODS are used. In addition, CADC data are available in IVOA- compliant form via the Common Archive Observation Model of the Canadian Virtual Observatory (Dowler et al. 2008). Figure 1: CANFAR infrastructure: Virtual Organizations, such as individual users or survey teams, access CANFAR and Compute Canada resources via a Virtual Machine environment. Green boxes show new components resulting from CANFAR. From Gaudet et al. (2011). CANFAR Science Several hundred thousand processor hours have been logged on CANFAR in aid of science projects. In particular, six projects, including the NGVS, have been integral to CANFAR ʼ s development. Extensive analysis that would not be possible on a desktop, e.g., the NGVS MegaPipe pipeline, fitting galaxy profiles, etc., is now being performed, including by non-data specialists. Guide to Data Mining in Astronomy Fast Data Mining Algorithms Until recently, most data mining algorithms have scaled as N 2 , rendering them intractable for modern datasets. However, fast libraries which implement data mining algorithms scaling as NlogN, or better, are now available. Thus the installation of such libraries on the CADC infrastructure will enable the practical use of these algorithms by astronomers who are not data mining specialists, enabling useful science. While each specific usage of such software will remain science-driven, the underlying tools are not dataset-specific, hence the effort to make available such generic tools is appropriate. We have installed and are running the SkyTree software ( http://www.fast-lab.org), and have confirmed that its algorithms scale as required. Data Services Processing Services Processing Resources Condor Web Service Storage Resources Browse, Retrieve and Store Data Queue and Monitor Processing Start VI Store and Retrieve Data Control Processing Monitor Processing Store and Retrieve Data Get VI Maintain VCE Collaboration User Interfaces WIKI Existing External Services Interactive Collaboration activities Monitor Processing Link to Data Astronomer's Desktop CANFAR enabled applications HTTP clients VO enabled applications CADC Storage Cluster Grid Storage CADC RDBMS Data Storage AD Processing State Management Virtual Resource Advertisement Grid Storage Grid Storage Clusters Database Queries VoSpace TAP UWS GMS CADC Processing Cluster Grid Cluster Virtual Cluster Scheduler (Condor) Run Job Queue Analysis Nimbus Grid Cluster Nimbus Grid Processing Clusters Start VI Data Storage (IRODS) Survey Web Page Virtual Organization Management Configure VI Get and Save VI Store and Retrieve Data UWS Nimbus Nimbus Cloud Scheduler Manage Processing Sequences Control Processing Monitor Processing Database Queries UWS Data Web Service Retrieve and Store Data

Astroinformatics at the Canadian Astronomy Data Centredame.dsf.unina.it/astroinformatics/posters/ball_cadc.pdf · Astroinformatics at the Canadian Astronomy Data Centre ... This research

  • Upload
    hahanh

  • View
    217

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Astroinformatics at the Canadian Astronomy Data Centredame.dsf.unina.it/astroinformatics/posters/ball_cadc.pdf · Astroinformatics at the Canadian Astronomy Data Centre ... This research

The Next Generation Virgo Cluster Survey (NGVS) is one of six science projects integrated into CANFAR during its development phase. It is a 104 square degree survey of the Virgo Cluster of galaxies in 5 optical bands (ugriz), utilizing the MegaCam camera on the Canada-France-Hawaii telescope, with a limiting magnitude of 25.7 (10σ point source) in the g band. The survey will revolutionize the science of this prototypical high density environment in the local universe.

The survey data size, while not extremely large by modern standards, sti l l represents a substantial dataset that will be amenable to data mining. The expected final dataset is 50T, processed by two independent pipelines, MegaPipe at CADC, and TERAPIX at the Institut d'Astrophysique in Paris.

Currently, the problem of deducing cluster membership absent spectroscopic redshifts remains unsolved.

K-Means Clustering

The aim of K-means is to optimally assign points in a parameter space to clusters, in an unsupervised manner:

for observations xj, k clusters Si, with cluster means μi.

Here, we perform dimension reduction (currently PCA) and run the SkyTree kmeans algorithm on

We describe ongoing astroinformatics work at the Canadian Astronomy Data Centre (CADC). With a collection of over 0.5 petabytes of information, and serving nearly 3000 astronomers worldwide, CADC is one of the world's largest astronomy data centres. Its unique blend of astronomers and computer specialists among its staff results in a rich interaction between world experts that is ideal for the fostering of developments within astroinformatics. Part of CADCʼs ongoing goals is to retain science drivers as the primary motivator at each step of the process, from the receipt of raw data from telescopes, to the release of that data, and its use by scientists. Thus, the developments remain guided by maximal benefit to the astronomy community.The Canadian Advanced Network for Astronomical Research (CANFAR) is a University of Victoria and CADC project that builds on the existing CADC infrastructure to provide storage, processing, and analysis tools needed to enable astronomers to perform data-intensive astronomy on current and next generation datasets. CANFAR provides a Virtual Cluster, accessed via a Virtual Machine environment, over which the user has complete control, and access to Cloud Computing on the Compute Canada Grid. Its services are compliant with the International Virtual Observatory Alliance standards. Hence, rather than build a new infrastructure for a project such as a sky survey, an individual or collaboration may utilize CANFAR.Although the infrastructure provided by CANFAR is vital, its main focus is on the basic storage and processing of data. To apply methods such as KDD, machine learning, and data mining, further software must be run. By analogy to the argument that CANFAR can provide the generic hardware portions of a data processing pipeline, we implement fast, scalable, data mining algorithms that simplify the generic portions of KDD within current and future datasets, further enabling practical data-intensive astronomy. We show an example of the use of the SkyTree software to perform K-means clustering to determine which galaxies in the Next Generation Virgo Cluster Survey (NGVS) are cluster members. This problem is unsolved within the survey.

Astroinformatics at the Canadian Astronomy Data Centre Nicholas M. Ball

Canadian Astronomy Data Centre, Herzberg Institute of Astrophysics, Victoria, BC, Canada http://sites.google.com/site/nickballastronomer [email protected]

Introduction Virgo Cluster Membershipvia K-Means

Dowler, P., et al., 2008, Common Archive Observation Model. ADASS XVII, ASP Conference Proceedings, Vol. 394, eds. Argyle R.W., Bunclark P.S., Lewis J.R., pp 426-429

Gaudet S., the CADC team, 2011, Virtualization and Grid Utilization within the CANFAR Project. ADASS XX, ASP Conference Proceedings, Vol. 442, eds. Evans I.N., Accomazzi A., Mink D.J., Rots A.H., pp 61–64

This research used the facilities of the Canadian Astronomy Data Centre, operated by the National Research Council of Canada with the support of the Canadian Space Agency. Funding for CANFAR was provided by CANARIE via the Network Enabled Platforms Supporting Virtual Organisations program.

The IVOA Interest Group in Knowledge Discovery in Databases (led by G. Longo) aims to deploy practical data mining algorithms of use to astronomers:

“We will develop and test scalable data mining algorithms and the accompanying new standards

for VO interfaces and protocols, so that these algorithms can be discovered and used

transparently within VO science workflows or in standalone data exploration applications.”

KDD-IG Charter, 2010

As part of achieving these aims, we are constructing an online guide to data mining in astronomy. Prior to this guide, no such tool existed. The guide is designed for the astronomer who is interested in using the methods of data mining to improve their science return, but whose main priority remains getting their science done. The guide is currently situated at http://www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/IvoaKDDguide .

References & Acknowledgments

Figure 2: K-means clustering results, showing normalized cluster membership for several

subsets of objects as a function of cluster number (35 clusters in this case)

Astroinformatics at CADC

Astroinformatics will become the only way to render future datasets comprehensible. It will become increasingly impractical to download data, hence an infrastructure is required in which the data analysis can be done in situ, without the need for downloading and local processing.A significant proportion of the KDD component of CADCʼs astroinformatics has been in the context of the science requirements of the Next Generation Virgo Survey (NGVS), guided by the authorʼs science interests, e.g., the galaxy luminosity function. We show an example here.

Canadian Astronomy Data Centre

The Canadian Astronomy Data Centre (CADC), based at the Herzberg Institute of Astrophysics in Victoria, BC, is one of the largest astronomy data centres in the world. Founded in 1986, it currently holds over 500T of data, and has served over 100T to more than 5000 distinct IP addresses worldwide. CADC combines the expertise of astronomers and computer specialists, and is hence ideal for realizing the scientific benefits of astroinformatics.

the first three PCs. kmeans includes a facility to determine the optimal number of clusters via cross-validation.

Results

Initial results show that the procedure is able to discern meaningful groups of galaxies within the PC1-PC2-PC3 space (Figure 2), e.g., cluster members and background galaxies confirmed by spectroscopy, saturated stars, and bright, low surface brightness artifacts.

Future Work

Obvious refinements include:

• More thorough removal of image/catalogue artifacts

• Testing of non-linear dimension reduction, e.g., kernel PCA

• Use of prior knowledge, e.g., constrained K-means, guided by object spectra

• Probabilistic cluster membership• Detailed characterization of the objects

contained in each cluster

CANFAR & CVO

T h e C a n a d i a n A d v a n c e d N e t w o r k f o r Astronomical Research (CANFAR; http://canfar.phys.uvic.ca), led by Chris Pritchet at the University of Victoria, and contracted to CADC, is:

“a project ... to provide the delivery, processing, storage, analysis, and distribution of astronomical

datasets of unprecedented size. ... The project builds on CADC's existing infrastructure to provide IVOA-

compliant tools and services for astronomers, and access to Cloud Computing on the Compute Canada

Grid, via a Virtual Machine environment.” CANFAR Statement of Work, 2008

CANFAR Usage

CANFAR provides a Virtual Cluster, accessible to an individual user or collaboration (Virtual Organization). Each user operates within their own Virtual Machine environment, over which they have complete control. This provides access to CANFAR services, and the Cloud Computing resources of Compute Canada (Figure 1).

VO-Compliant Web Services

Outward-facing CANFAR services use IVOA-compliant protocols, e.g., TAP and VOSpace for data services, UWS for processing, and TLS and X.509 grid certificates for security. Inward-facing infrastructure builds on existing CADC resources. For cloud computing, Cloud Scheduler, Condor, Nimbus, and iRODS are used.

In addition, CADC data are available in IVOA-compliant form via the Common Archive Observation Model of the Canadian Virtual Observatory (Dowler et al. 2008).

Figure 1: CANFAR infrastructure: Virtual Organizations, such as individual users or survey teams, access CANFAR and Compute Canada resources via a Virtual Machine environment. Green boxes show new components resulting

from CANFAR. From Gaudet et al. (2011).

CANFAR Science

Several hundred thousand processor hours have been logged on CANFAR in aid of science projects. In particular, six projects, including the NGVS, have been integral to CANFARʼs development. Extensive analysis that would not be possible on a desktop, e.g., the NGVS MegaPipe pipeline, fitting galaxy profiles, etc., is now being performed, including by non-data specialists.

Guide to Data Mining in Astronomy

Fast Data Mining Algorithms

Until recently, most data mining algorithms have scaled as N2, rendering them intractable for modern datasets. However, fast libraries which implement data mining algorithms scaling as NlogN, or better, are now available.

Thus the installation of such libraries on the CADC infrastructure will enable the practical use of these algorithms by astronomers who are not data mining specialists, enabling useful science.

While each specific usage of such software will remain science-driven, the underlying tools are not dataset-specific, hence the effort to make available such generic tools is appropriate.

We have installed and are running the SkyTree software (http://www.fast-lab.org), and have confirmed that its algorithms scale as required.

Data Services Processing Services

Processing Resources

CondorWeb Service

Storage Resources

Browse, Retrieveand Store Data

Queue and Monitor Processing

Start VI

Store and Retrieve Data

Control Processing

Monitor Processing

Store and Retrieve Data

GetVI

Maintain VCE

Collaboration User InterfacesWIKI

ExistingExternal Services

Interactive Collaborationactivities

MonitorProcessing

Link to Data

Astronomer's Desktop

CANFAR enabled applicationsHTTP clients VO enabled applications

CADCStorageCluster

GridStorage

CADCRDBMS

Data StorageAD

ProcessingState

Management

Virtual Resource Advertisement

GridStorage

GridStorageClusters

Database Queries

VoSpaceTAP

UWSGMS

CADCProcessing

ClusterGrid

Cluster

Virtual Cluster Scheduler(Condor)

Run Job

Queue Analysis

Nimbus

GridCluster

NimbusGrid

ProcessingClusters

Start VI

Data Storage(IRODS)

SurveyWeb Page

VirtualOrganizationManagement

ConfigureVI

Get and Save VI

Store and Retrieve Data

UWS

NimbusNimbus

CloudScheduler

ManageProcessingSequences

Control Processing

Monitor Processing

Database Queries

UWS

Data WebService

Retrieveand Store Data