15
To the Problem of To the Problem of Organizing Heterogeneous Organizing Heterogeneous Information Information Olga Zhelenkova Olga Zhelenkova 1,2 1,2 , Vladimir Vitkovskij , Vladimir Vitkovskij 1,2 1,2 (1) SAO RAS (Nizhnij Arkhyz), (2) ITMO University (Saint- (1) SAO RAS (Nizhnij Arkhyz), (2) ITMO University (Saint- Petersburg) Petersburg) 1 Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands SAO RAS

To the Problem of Organizing Heterogeneous Information Olga Zhelenkova 1,2, Vladimir Vitkovskij 1,2 (1) SAO RAS (Nizhnij Arkhyz), (2) ITMO University (Saint-Petersburg)

Embed Size (px)

Citation preview

To the Problem of Organizing To the Problem of Organizing Heterogeneous Information Heterogeneous Information

Olga ZhelenkovaOlga Zhelenkova1,21,2, Vladimir Vitkovskij, Vladimir Vitkovskij1,21,2

(1) SAO RAS (Nizhnij Arkhyz), (2) ITMO University (Saint-Petersburg)(1) SAO RAS (Nizhnij Arkhyz), (2) ITMO University (Saint-Petersburg)

1Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands

SAO RAS

The science use case: The science use case: a multi-band study of a sample of radio sources (I)a multi-band study of a sample of radio sources (I)

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands2

Series of blind surveys of 20´ sky strip centered on δ1981=+04° 57´± 20´ (SS433) carried out on the radio telescope RATAN-600 in 1980-1999 on 3.9GHz .

(1) RC (RATAN COLD) catalogue obtained from observations of the deep survey COLD in 1980(a,b). The steep spectrum RC-sample studied since the early 90s(c,d).

(2) Refined RC (RCR) catalogue obtained from the blind survey observations, 1980-1999(e). 562 RCR radio sources are in the range α2000= [07h– 17h] (~100□°) intersecting with SDSS and FIRST surveys; 90%-completeness on S3.9GHz>15mJy (S1.4GHz>28mJy) for αmean~0.52 (Sν~ν-α). They are almost completely identified (96%), with 260 objects identified the first time(f).a- Parijskij et al., 1992A&AS...96..583P; b- Parijskij et al., 1993A&AS...98..391P; c- Goss et al., 1992AZh....69..673G; d- Parijskij et al., 2010ARep...54..675P; e- Soboleva et al., 2010AstBu..65...42S; f- Zhelenkova et al., 2013AstBu..68…26Z.

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands3

collect all available in free access data for optical identification and investigate of the RCR-sample; data collecting, visualization, statistic analysis with VO tools – ALADIN(1), TOPCAT(2), VIZIER(3), NED(4), ds9(5), casjobs(6), SkyView(7); organize collected data (PostgrSQL + web-inteface) for a further study(8).

The science use case: The science use case: a multi-band study of a sample of radio sources (II)a multi-band study of a sample of radio sources (II)

(1) Bonnarel et al., 2000A&AS..143...33B; (2) Taylor, 2005ASPC...347..29; (3) Ochsenbein et al., 2000A&AS..143…23O; (4) Mazarrella et al., 2007ASPC..376..153M; (5) Joye&Mandel, 2003ASPC..295..489J; (6) O’Mullane et al., 2005cs........2072O; (7) McGlynn, 2007ASPC..382...43M; (8) http://www.sao.ru/fetch/cgi-bin/SkyObj/rcrn.cgi

Catalogues surveys

Spectral range Resolution,error

Limit

radio

VLSS 74 МГц 80” 500mJy

TXS 365 MГц ~10” 150 mJy

NVSS 1.4 GГц 45” 2.5 mJy

FIRST 1.4 GГц 5.4” 1 mJy

GB6 4.85 GГц 3.5' 28-37mJy

mm,submm

Planck 30GHz, 44GHz, 70GHz, 100GHz, 143GHz, 217GHz, 353GHz, 545GHz, 857GHz

33', 27', 13',10', 7', 5',4', 4', 4'

0.5Jy, 0.6Jy, 0.5Jy,0.3Jy, 0.2Jy, 0.2Jy,0.2Jy, 0.4Jy, 0.7Jy

IR

WISE 3.4μm, 4.4μm, 12μm, 22μm 0.2', 0.1', 0.1', 0.1'0.2”

16.6m, 15.6m, 11.3m, 8.0m

2MASS J,H,K 0.2”, 10% 15.8m, 15.1m, 14.3m

LAS UKIDSS Y, J,H,K (H+K) <0.1” 20.5m, 20.0m, 18.8m,18.4m

optics

DSS-II blue , red, IR ~21m

SDSS u, g, r, i, z (g+r+i) ±0.1” 22.0m, 22.2m, 22.2m, 21.3m, 20.5m (~23m)

USNO-B1 B1, R1, B2, R2, I 0.2”, 0.3m V =21m

GSC 2.3.2 J, F, N 0.2”- 0.28”0.13m -0.22m

RF=20m

4

The science use case: The science use case: a multi-band study of a sample of radio sources (III)a multi-band study of a sample of radio sources (III)

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands 5

The science use case: The science use case: a multi-band study of a sample of radio sources (IV)a multi-band study of a sample of radio sources (IV)

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands 6

The science use case: The science use case: a multi-band study of a sample of radio sources (V)a multi-band study of a sample of radio sources (V)

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands7

The science use case: problems – The science use case: problems – manipulate with many parameters and images manipulate with many parameters and images

1st stage: VLSS, NVSS, FIRST, GB6 and DSS (USNO-B1, GSC.2.3), SDSS DR1, 2MASS, also NED; 2nd stage: added LAS UKIDSS, used newer release SDSS; 3rd stage: added WISE, used newer releases SDSS LAS UKIDSS; 4th stage: added Planck, used SDSS DR10, LAS UKIDSS DR9.

1) 9 catalogues (~110 parameters) and images from 7 digital surveys (12 maps, contour overplays);

2) 10 catalogues (~130 parameters) and images from 8 digital surveys (16 maps, contour overlays);

3) 11 catalogues (~150 parameters) and images from 9 digital surveys (18 maps, contour overlays). Results: RCR-sources are almost completely identified (96%), with ~45% objects identified the first time;

4) 12 catalogues (>150 parameters) and images from 10 digital surveys (28 maps, contour overlays).

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands8

The science use case: what we needThe science use case: what we need

Thanks for efforts of the International Virtual Observatory Alliance we now have excellent tools providing web-services for access and visualization data like ALADIN, SAOImage DS9, TOPCAT, Vizier, NED and so on. But other problems need further activities.

i. Easy access to data – request and download - ++ii. Visualization of different type of data - +iii. Keep the collected data up to date - ?iv. Can easily manipulate collected data - ?v. Interchange and publish new knowledge about objects - ?vi. Store together different data and knowledge about an object - ?

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands9

Available projects: Available projects: keep the collected data up to datekeep the collected data up to date

VO Data Keeping-up Agent (VOdka) - the web-application for support users’ data [O.Laurino & S.Smareglia, ASP 442, 571 (2011)]: •possibility for users to be asynchronously notified when new data are available,•give users a quick look of what data, relevant to their research interests, can be found in the Virtual Observatory,•make the users’ queries and results persistent.

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands10

Available projects: Available projects: interchange and publish new knowledge about objectsinterchange and publish new knowledge about objectswith annotationswith annotations

AstroDAS (Bose et al. 2006IPAW..1445..154B): annotating astronomy catalogues to provide astronomers with the ability to share their assertions about matching celestial objects.

AstroDAbis (Gray N. et al., arXiv:1111.6116, http://astrodabis.roe.ac.uk) service provides a stand-off annotation service for astronomical catalogue entries. AstroDAbis service will implicitly create URI names for every object in catalogues.

SKUA (Semantic Knowledge Underpinning Astronomy, N. Gray & T. Linde, ASP , 2009, https://code.google.com/p/skua/) is a web-application for a semantic infrastructure for astronomy based on the organisation of annotation services.

ADSASS (ADS All-Sky Survey, Pepe A. et al., arXiv:1111.3983) is an ongoing effort aimed at turning the NASA Astrophysics Data System (ADS) into a data resource based on ideas of geo-information systems.

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands11

Available formats:Available formats:store together different data about an object. FITSstore together different data about an object. FITS

FITS is a simple and easily understood self-describing format which holds its information in metadata and data blocks. Metadata are captured via key-value pairs. Headers may or may not be then grouped with data blocks. The first header is denoted as the “primary” header and subsequent headers known as “extensions”. The standard supports rules for development new data structure – extension (Pence et al., A&A 524, A42 (2010) .

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands12

Available formats:Available formats:store together different data about an object. store together different data about an object. VOTableVOTable

VOTable is designed as a flexible storage and exchange format for tabular data.Its interoperability is encouraged through the use of XML. VOTable has built-in features for big-data and Grid computing. It allows metadata and data to be stored separately, with the remote data linked.(VOTable Format Definition V.1.093 (http://cdsweb.u-strasbg.fr/doc/VOTable/1.092/votable.htx).

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands13

Astronomy is a very good science at free sharing data, but poorer at sharing knowledge.

The fundamental problem remains - data and knowledge store in different places: archives contain only basic observational data, whereas all the astrophysical interpretation of that data is contained in journal papers.

Need to do the next step which may help for more effective discovery and research - to keep all collected about an object/objects of researcher’s interest data together also add annotations and textual representation of queries (for possibility of repeat updating requests).

SummarySummary

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands14

ALADIN stack as a new FITS-extension (or VOTable, ALADIN stack as a new FITS-extension (or VOTable, of HDF5 variant)of HDF5 variant)

The internal format of ALADIN is named a stack. It is a flat XML-similar file represented all-collected (images and tables) about an object information as planes with appropriate descriptions and results of requests.

This data format proved convenient when working with heterogeneous information collected for the study of the objects of interest to the researcher.

Structure of the ALADIN stack can be represented as a new extension of FITS.

Big Data Across Disciplines: In Search of Symbiosis. 3-5 November, 2014. Groningen, Netherlands15

Thank you for attention !Thank you for attention !

Work supported by the Russian Fund of Basic Research, grants 12-07-00503-a, 14-07-00361-a