96
1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Embed Size (px)

Citation preview

Page 1: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

1

Peter Fox

Data Science – CSCI/ERTH/ITWS-6961

Week 12, November 20, 2012

Webs of Data and Data on the Web, the Deep Web, Data

Discovery, Data Integration

Page 2: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Contents• Review of reading assignment

• Webs of data and semantic web

• Data on the web, linked data

• Deep web

• Data discovery

• Data integration

• Summary

• Next week

2

Page 3: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Reading• Data Quality European Union Presentation

• ISO Technical Standards - General Reference

3

Page 4: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Webs of data• Early Web - Web of pages

• http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html

• Semantic web started as a way to facilitate “machine accessible content”– Initially was available only to those with familiarity

with the languages and tools, e.g. your parents could not use it

• Webs of data grew out of this– One specific example is W3C’s Linked Open

Data 4

Page 5: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Semantic Web• http://www.w3.org/2001/sw/

• “The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF). See also the separate FAQ for further information.”

5

Page 6: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

6

Terminology• Semantic Web

– An extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation, www.semanticweb.org

– Primer: http://www.ics.forth.gr/isl/swprimer/ • Semantic Grid

– Semantic services to use the resources of many computers connected by a network to solve large scale computational/ data problems

• Ontology (n.d.). The Free On-line Dictionary of Computing. http://dictionary.reference.com/browse/ontology– An explicit formal specification of how to represent the

objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

Page 7: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

7

Semantic Web Layers

http://www.w3.org/2003/Talks/1023-iswc-tbl/slide26-0.html, http://flickr.com/photos/pshab/291147522/

Page 8: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

8

Application Areas for SW• Smart search• Annotation (even simple forms), smart tagging• Geospatial• Implementing logic (rules), e.g. in workflows• Data integration• Verification …. and the list goes on• Web services• Web content mining with natural language parsing• User interface development (portals)• Semantic desktop• Wikis - OntoWiki, SemanticMediaWiki• Sensor Web• Software engineering• Explanation

Page 9: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

9

Semantic Web Basics• The triple: {subject-predicate-object}

Interferometer is-a optical instrumentOptical instrument has focal length

• W3C is the primary (but not sole) governing org.– RDF– OWL 1.0 and 2.0 - Ontology Web Language

• RDF – programming environment for 14+ languages, including C, C++,

Python, Java, Javascript, Ruby, PHP,...(no Cobol or Ada yet ;-( )

• OWL programming for Java

• Closed World - where complete knowledge is known (encoded), AI relied on this

• Open World - where knowledge is incomplete/ evolving, SW promotes this

Page 10: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

10

Ontology Spectrum

Catalog/ID

SelectedLogical

Constraints(disjointness,

inverse, …)

Terms/glossary

Thesauri“narrower

term”relation

Formalis-a

Frames(properties)

Informalis-a

Formalinstance

Value Restrs.

GeneralLogical

constraints

Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty; – updated by McGuinness.Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html

Page 11: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

11

SW != ontologies on the web (!)• Ontologies are important, but use them only when necessary as

identified by use cases• The Semantic Web is about integrating data on the Web; ontologies

(and/or rules) are tools to achieve that when necessary• SW ontologies != some big (central) ontology

– The ethos of the Semantic Web is on sharing, ie, sharing possibly many small ontologies

– A huge, central ontology could be difficult to manage in terms of maintenance.

– Semantic web languages such as OWL contain primitives for equivalence and disjointness of terms and meta primitives for versioning info

• The practice: – SW applications using ontologies mix large number of ontologies and

vocabularies (FOAF, DC, and others) – the real advantage comes from this mix: that is also how new relationships

may be discovered• One readable background article from the metadata world is available at:

http://www.metamodel.com/article.php?story=20030115211223271

Page 12: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

12

Semantic Web Myths• ‘the Semantic Web is a reincarnation of Artificial Intelligence

on the Web’ (closed world versus open world)• ‘it relies on giant, centrally controlled ontologies for

"meaning" (as opposed to a democratic, bottom-up control of terms)’

• ‘one has to add metadata to all Web pages, convert all relational databases, and XML data to use the Semantic Web’

• ‘one has to learn formal logic, knowledge representation techniques, description logic, etc, to use it’

• ‘it is, essentially, an academic project, of no interest for industry’

Page 13: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

13

Integrating Multiple Data Sources

• The Semantic Web lets us merge statements from different sources

• The RDF Graph Model allows programs to use data uniformly regardless of the source

• Figuring out where to find such data is a motivator for Semantic Web Services

#Ionosphere #magnetic

“100”“TerrestrialIonosphere”

name

hasCoordinates

hasLowerBoundaryValue

Different line & text colors represent different data sources

hasLowerBoundaryUnit

“km”

Page 14: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

14

Drill Down /Focused Perusal• The Semantic Web uses Uniform

Resource Identifiers (URIs) to name things

• These can typically be resolved to get more information about the resource

• This essentially creates a web of data analogous to the web of text created by the World Wide Web

• Ontologies are represented using the same structure as content– We can resolve class and

property URIs to learn about the ontology

InternetInternet

…#NeutralTemperature

...#ISR

…#Norway

…#EISCAT

measuredby

type

locatedIn

...#FPI

...#MilllstoneHill

operatedby

Page 15: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

15

Statements about Statements• The Semantic Web allows us to

make statements about statements– Timestamps

– Provenance / Lineage

– Authoritativeness / Probability / Uncertainty

– Security classification

– …

• This is an unsung virtue of the Semantic Web

#Aurora

Red

#Danny’s

20031031

hascolor

hasSource

hasDateTime

Ontologies Workshop, APL May 26, 2006

Page 16: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

16

‘Collecting’ the ‘data’

• Part of the (meta)data information is present in tools ... but thrown away at output e.g., a business chart can be generated by a tool: it ‘knows’ the structure, the classification, etc. of the chart, but, usually, this information is lost storing it in web data would be easy!

• SW-aware tools are around (even if you do not know it...), though more would be good: – Photoshop CS stores metadata in RDF in, say, jpg files

(using XMP)– RSS 1.0 feeds are generated by (almost) all blogging

systems (a huge amount of RDF data!)

Page 17: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

17

‘Collecting’ the ‘data’• Scraping - different tools, services, etc, come

around every day: – get RDF data associated with images, for

example: service to get RDF from flickr images– service to get RDF from XMP– XSLT scripts to retrieve microformat data from

XHTML files– RSS scraping in use in VO projects in Japan– scripts to convert spreadsheets to RDF – e.g. see

the tools, tutorials, demos at http://logd.tw.rpi.edu

Page 18: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

18

‘Collecting’ the ‘data’• SQL - A huge amount of data in Relational Databases

– Although tools exist, it is not feasible to convert that data into RDF

– Instead: SQL ⇋ RDF ‘bridges’ are being developed: a query to RDF data is transformed into SQL on-the-fly

– Reading for this week, article by Berners Lee and Sahoo et al.

– RDB2RDF W3 working group - http://www.w3.org/2001/sw/rdb2rdf/

– D2RQ/ D2RServer– Commercial solutions appearing

• NoSQL• Other ‘graph’ forms…

Page 19: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

19

More Collecting• RDFa (formerly known as RDF/A) extends XHTML

by: – extending the link and meta to include child elements– add metadata to any elements (a bit like the class in

microformats, but via dedicated properties)

• It is very similar to microformats, but with more rigor: – it is a general framework (instead of an ‘agreement’ on

the meaning of, say, a class attribute value)– terminologies can be mixed more easily

• GRDDL - Gleaning Resource Descriptions from Dialects of Languages

• ATOM (follow on to RSS)

Page 20: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Linked open data• http://linkeddata.org/guides-and-tutorials

• http://tomheath.com/slides/2009-02-austin-linkeddata-tutorial.pdf (we will look at some of these slides now, #1-25 and 30-37)

• And of course:– http://logd.tw.rpi.edu/ – http://data-gov.tw.rpi.edu/wiki

20

Page 21: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

http://richard.cyganiak.de/2007/10/lod/

• Latest 295• 2011-09-19 295• 2010-09-22 203• 2009-07-14 95• 2009-03-27 93• 2009-03-05 89• 2008-09-18 45• 2008-03-31 34• 2008-02-28 32• 2007-11-10 28• 2007-11-07 28• 2007-10-08 25• 2007-05-01 12

21

Page 22: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

2009-03-05 (Chris Bizer)

22

Page 23: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

September 2011

23

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

Page 24: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

(Class 2) Management• Creation of logical collections

• Physical data handling

• Interoperability support

• Security support

• Data ownership

• Metadata collection, management and access.

• Persistence

• Knowledge and information discovery

• Data dissemination and publication 24

Page 25: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Data Management and WOD• Is this the grand solution?

• How is the data managed?

• Found?

• Curated?

• What about the metadata?

• What problems are introduced?

• See: Parsons and Fox (2012): http://mp-datamatters.blogspot.com/

25

Page 26: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Data on the Web, Internet• Data behind web services

• Data files on web sites

• We have covered data as service approaches

• Thinking you have found data when you have really only found information and metadata

• The real difference between this topic and the next one is:– Access and dissemination– Level of curation (and often description)

26

Page 27: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Data on the internet• http://www.dataspaceweb.org/

• Data files on other protocols– FTP– RFTP– GridFTP– SABUL– XMPP/AMQP– Others…

27

Page 28: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Deep web• Data behind web services

• Data behind query interfaces (databases or files)

• Introduces a different curation problem

28

Page 29: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

The loose definition• Something that a crawler cannot find and/or

index– Creates the other definition of shallow web

• Has many implications for discovery, access and use

• Curation is more complex to satisfy this definition, i.e. not a matter of just putting files ‘on the web’

• 50, 100, 1000 times the ‘shallow web’?

29

Page 30: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Managing (in) the deep web• Sometimes, the deep web aspect of a data

source can be due to extreme obscurity, language peculiarities, NO metadata, NO documentation

• There are no known studies of how effective data management (what you are learning) could change the percentage of deep/ shallow

• Semantics are often put forward as a solution http://www.mkbergman.com/458/new-currents-in-the-deep-web/ 30

Page 31: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Internet impacts on management

• Management of data that is… on the Internet!

• Web – ‘stateless’

• Curation, Preservation – highly stateful (by definition)

• You will hear terms such as digital curation and digital preservation (search on these) but what about internet curation and internet preservation (Internet Archive?)

• What others??31

Page 32: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

(Class 2) Management• Creation of logical collections

• Physical data handling

• Interoperability support

• Security support

• Data ownership

• Metadata collection, management and access.

• Persistence

• Knowledge and information discovery

• Data dissemination and publication 32

Page 33: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Thus data frameworks are appearing

• Many – meaning they go beyond web sites, they incorporate many of the data management functions

• Initially syntactic – e.g. OPeNDAP, ADDE, ODATA, OODT

• Application oriented – e.g. virtual observatories

• Semantic – e.g. Virtual Solar-Terrestrial Observatory

• ALL of these are changing the nature of data management and role of data ‘providers’ cf. ?

33

Page 34: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

34

Page 35: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

35

Some DefinitionsDAP = Data Access Protocol

Model used to describe the data; Request syntax and semantics; and Response syntax and semantics.

OPeNDAP The software; Numerous reference implementations; Core/libraries and services (servers and clients).

OPeNDAP Inc. OPeNDAP is a 501.c(3) non-profit corporation; Formed to maintain, evolve and promote the

discipline neutral DAP that was the DODS core infrastructure.

Page 36: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

36

Considerations with regard to the development of DAP and OPeNDAP

Many data providers

Many data formats

Many different semantic representations of the data

Many different security requirements

Many different client types

Page 37: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

37

Broad Vision

A world in which a single data access protocol is used for the exchange of data between network based applications regardless of discipline.

A layer above TCP/IP providing for syntactic and semantic consistency not available in existing protocols such as FTP.

Page 38: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

38

Practical Practical Considerations

The broad vision:

Is syntactically achievable, but

Was not semantically achievable, at least not fully, but perhaps in the near term.

Page 39: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

39

OPeNDAP Inc. Mission Statement

To maintain, evolve and promote a data access protocol (DAP) and reference implementation software (OPeNDAP) for the syntactically consistent exchange of data over the network.

The DAP should provide syntactic interoperability across disciplines and allow for semantic interoperability within disciplines.

Page 40: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

40

The DAP has been designed to be as general as possible without being constrained to a particular discipline or world view.

The Data Access Protocol (DAP)

The DAP is a discipline neutral data access protocol; it is being used in astronomy, medicine, earth science,…

Provides data format and location, and data organization transparency

Is metadata neutral

Page 41: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

41

DAP comparisons• File-based

– GridFTP/FTP– HTTP– SRB

• Service-based– Open-Geospatial Consortium, WCS, WMS, WFS, …– Virtual Observatory (Astronomy), SIAP, SSAP, STAP,…

Page 42: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

42

Who is using DAP/ OPeNDAP?

• Science examples– PMEL with their Tsunami inundation modeling– Ocean regional modelers to extract open

boundary conditions– Visualization of data sets using MATLAB/IDL/…

• Service examples– Live Access Server– Mapserver – OGC services and OPeNDAP data

access (future)– Digital Library Service - metadata and catalogue

info

Page 43: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

43

Data Access Protocol (DAP2) - Current

DAP2 currently a NASA/ESE ‘Standard’

Current servers implement DAP2

DAP 2 + XML responses (implemented)

DAP3

Page 44: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

44

DAP4 DAP4 improvements over DAP3:

Additional datatypesSwathBlob - GIF, MPEG,…

Additional functionality Check sumModulo

The additional datatypes will enable the DAP to be used in a wider variety of circumstances and are a direct response to users’ requests.

Page 45: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

45

What DAP means to me

• Data access and transport• Response types: DAP objects versus file type

– A DAP URL is essentially an HTTP URL with additional restrictions placed on the abs-path component.

– DAP2-URL = "http://" host [ ":" port ] [ abs-path]• abs-path = server-path data-source-id [ "." ext[ "?" query ] ] • server-path = [ "/" token ] • data-source-id = [ "/" token ] • ext = "das" | "dds" | "dods"

– The server-path is the pathname to the server, whereas data-source-id is the pathname to the data.

Page 46: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

46

OPeNDAP V3 Architecture

Cgi style access

CGI-style access Uses web server HTTP protocol Several request and response types Reads data files, Databases, et c., returns info May return DAP2 objects or other data Client can be application, web browser or

specialized server/service

DataClient

Page 47: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

47

OPeNDAP V4 (Hyrax) Architecture

OLFS BES

OPeNDAP Lightweight Front end Server (OLFS) Receives requests and asks the BES to fill them Uses Java Servlets Does not directly ‘touch’ data Multi-protocol

Data

Back End Server (BES) Reads data files, Databases, et c., returns info May return DAP2 objects or other data Does not require web server

Client

Page 48: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

48

Binaries Generated

There are approximately 80 binaries built on a nightly basis. They are built for the following platforms/operating systems:

Linux FC4 FC5

MacOS-X (universal binaries when possible)

Windows XP, win32

Java 1.5 (Tomcat 5.5)

IRIX (in four variants), Solaris, AIX, OSF

Page 49: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

49

Clients Browser Interfaces Data System Integrators (ODC) Servers Processing Servers Aggregating Servers - OPeNDAP chains Ancillary Information Services

The OPeNDAP data access protocol is used by a variety of system elements.

OPeNDAP System Elements

Page 50: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

50

Clients

Clients make requests and receive responses via the DAP.

Clients convert data from the OPeNDAP data model to the form required in the client application.

Page 51: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

51

netCDF C

Ferret GrADS

netCDF Java

IDV VisAD ncBrowse

Matlab

MatlabClient

Access ExcelIDL

IDLClient

ArcGIS

pyDAP

OPeNDAP Clients

ArcGIS

pyDAP

NCL

NCLClient

Internet

WebBrowser OPeNDAP

DataConnector

Page 52: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

52

OC (2009) A pure OPeNDAP C API (OC) for the client-

side

Applications:DAP-aware ‘commands’ for commercial analysis

programs (e.g., IDL, matlab)Scripting tools (e.g., Perl, python)

Page 53: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

53

Clients Browser Interfaces Data System Integrators (ODC) Servers Processing Servers Aggregating Servers - OPeNDAP chains Ancillary Information Services

The OPeNDAP data access protocol is used by a variety of system elements.

OPeNDAP System Elements

Page 54: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

54

Browser interfaces

Page 55: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

55

Clients Browser Interfaces Data System Integrators (ODC) Servers Processing Servers Aggregating Servers - OPeNDAP chains Ancillary Information Services

The OPeNDAP data access protocol is used by a variety of system elements.

OPeNDAP System Elements

Page 56: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

56

Servers

Servers receive requests and provide responses via the DAP.

Servers convert the data from the form in which they are stored to the DAP.

Servers provide for subsetting of the data and more.

Page 57: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

57

Data Data Data Data Data Data Data

HDF5

HDF4 JDBC

FreeFormFITS

CDF CEDAR

Data

netCDF

netCDF HDF4 HDF5

Data

DSP

DSP

Data

JGOFS

Tables SQL FITS CDFFlat

Binary CEDAR

Data

General

ESML

OPeNDAP Servers

CDM

Internet

Page 58: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

58

Data

GRIBBUFR

OPeNDAP

GDS

Data

CODAR

CODAR

Data

FDS

netCDFOPeNDAP

Data

General

pyDAP

Data

DAPPER

netCDFOPeNDAP

Data

netCDFOPeNDAP

TDS

Data

General

pyDAP

Data

netCDFOPeNDAP

TDS

OPeNDAP Servers (specialized processing)

Data

ESG

netCDFOPeNDAP

Internet

Page 59: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

59

Servers

Servers may also provide other services

Directory traversal.

Browser-based form to build URL.

Ascii or other representations of data.

Metadata associated with the data.

Server side functions.

Page 60: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

60

Data

General

pyDAP

OPeNDAP Aggregation Servers

Data

GRIBBUFR

OPeNDAP

GDS

Data

CODAR

CODAR

Data

FDS

netCDFOPeNDAP

Data

DAPPER

netCDFOPeNDAP

Data

TDS

netCDFOPeNDAP

Data

General

JGOFS

Data

ESG

netCDFOPeNDAP

Internet

Page 61: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

61

The Aggregation Server: An Example

AggregationAggregationServerServer

File

DSP Data Set

FileFileFile

netCDF Data Set

File File

Matlab

Local

OPeNDAP

HTML, GIFMatlabClient

DSP

Page 62: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

62

OPeNDAP’s Hyrax (‘Server4’)

• Uses a modular architecture to support different application-level protocols– Data access using DAP2 (DAP3)– Catalogs using THREDDS– Browsing using HTML and ASCII

• Modules for data access– Different file types– Potential for database and scripting

• Modules for commands– Commands provide varying operations for different

protocols

Page 63: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

63

OPeNDAP V4 (Hyrax) Architecture

OLFS BES

OPeNDAP Lightweight Front end Server (OLFS) Receives requests and asks the BES to fill them Uses Java Servlets Does not directly ‘touch’ data Multi-protocol

Data

Back End Server (BES) Reads data files, Databases, et c., returns info May return DAP2 objects or other data Does not require web server

Client

Page 64: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

64

GridFTPDAP2

GridFTPDAP2

HTTPDAP2HTTPDAP2

ASCII outputASCII output

HTML formHTML form

Info outputInfo output

OPeNDAP Lightweight Front end ServerOPeNDAP Lightweight Front end Server

THREDDSTHREDDS

Request Formulation**Request Formulation**

Req

uest

fro

m c

lient

Res

pons

e to

clie

ntB

ESSOAP-DAP (HTTP)

DAP2 (GridFTP, HTTP)

Page 65: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

65

BES

Network Protocol andProcess start/stopactivities

Data Store Interfaces

BES Framework

PPT*Initialization/Termination

DAP2Access

NetCDF3 HDF4 FreeForm…

DataCatalogs

Commands**BES Commands/ XML Documents

*PPT is built in (other protocols)**Some commands are built inData DataData

Page 66: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

66

Ancillary Information Service• Current capability: Attributes only• Client-side only• Local and remote resources• Local resource databases

The AIS enables users to augment the metadata for a data source in a controlled way without requiring write access to the original data. By using the DAP, users are also isolated from data format issues.

Page 67: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

BOM, Melbourne, VIC 20071015 (Fox)

67

AIS Server

Client linkedw/DAP

Software

DataSource

AISServer

AISResource

1

2

0

3

0. Client requests metadata from the AIS server (which appears no different from any other DAP server).

1. The AIS server gets metadata from data source2. The AIS server gets matching the AIS resource using the AIS database and

merges it into the metadata.3. The AIS server returns resulting the metadata object.

Page 68: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

04/21/23 Bureau of Meteorology, Melbourne Australia

68

Lessons (Re)Learned Lessons (Re)Learned

1. Modularity provides for flexibility

The more modular the underlying infrastructure the more flexible the system. This is particularly important for network based systems for which the technology, software and hardware, are changing rapidly.

Page 69: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

04/21/23 Bureau of Meteorology, Melbourne Australia

69

Lessons (Re)LearnedLessons (Re)Learned

2. Data of interest will be stored in a variety of formats.

Regardless of how much one might want to define the format to be used by system participants, in the end the data will be stored in a variety of formats.

2a. The same is true of use metadata!

Page 70: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

04/21/23 Bureau of Meteorology, Melbourne Australia

70

Lessons LearnedLessons Learned

3. Structural representation of sequence data sets is a major obstacle to interoperability

Care must be given to the organizational structure (as opposed to the format) of the data.

Page 71: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

04/21/23 Bureau of Meteorology, Melbourne Australia

71

Lessons LearnedLessons Learned

7. The lack of a consistent structure for data inventories is a major obstacle to the use of distributed systems.

Page 72: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

04/21/23 Bureau of Meteorology, Melbourne Australia

72

Lesser Lessons Learned Lesser Lessons Learned

9. Some surprises/observations encountered in the OPeNDAP effort

Metadata focus in the past has been on data discovery not on data use, but metadata for use is where it’s at.

Number of variables increases almost linearly with the number of data sets.

Users will take advantage of all of the flexibility offered by a system sometimes to the disadvantage of all.

Incredible variability in the structural organization of data.

Page 73: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

04/21/23 Bureau of Meteorology, Melbourne Australia

73

Lessons LearnedLessons Learned

10. Time to maturity is order 10 years not 3

Developing new infrastructure takes time, particularly to iron out all of the %^*% little details.

Page 74: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Summary

Tetherless World Constellation

74

Page 75: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Data discovery• Free text search on the internet/ web

• Data portals

• What makes discovery work?– For Deep Web?– For Linked Data?

75

Page 76: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Data discovery• What makes discovery work?

– Metadata– Logical organization– Attention to the fact that someone would want to

discover it– It turns out that file types are a key enabler or

inhibitor to discovery

• What does not work?– Result ranking using *any* conventional

algorithms76

Page 77: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Smart search• Semantically aware search, e.g.

http://noesis.itsc.uah.edu

• Faceted search, e.g. – mspace (http://mspace.fm )– jSpace (Clark and Parsia)– Exhibit (MIT)– S2S – e.g. International Open Government

Dataset Catalog (IOGDC; http://logd.tw.rpi.edu )

77

Page 78: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

NOESIS

78

Page 79: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Search Application integration!

Page 80: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Deep web dashboards…

80

Page 81: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Intl. Open Govt. Data Cat.http://logd.tw.rpi.edu

Page 82: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Federated search• “is the simultaneous search of multiple online

databases or web resources and is an emerging feature of automated, web-based library and information retrieval systems. It is also often referred to as a portal or a federated search engine.” wikipedia

• Libraries have been doing this for a long time (Z39.50, ISO23950)

• Key is consistent search metadata fields (keywords)• E.g. Geospatial One Stop http://www.geodata.gov

82

Page 83: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Data integration• “involves combining data residing in different

sources and providing users with a unified view of these data. This process becomes significant in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example). ”

83

Page 84: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Data integration• “Data integration appears with increasing

frequency as the volume and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. In management circles, people frequently refer to data integration as "Enterprise Information Integration" (EII)” wikipedia

• Is this a data science/ management challenge (rhetorical question)?

84

Page 85: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

AcquireData Participate

Data.gov

Use Side

Community of UsersSupply Side

Community of Suppliers

Supply Chain Management – no geo integration focus

Connect Discover

Enable Discovery

Enable Use

Build Dataset

Publish Dataset

Value Chain –data.gov – Integration Context

Access and Interoperability Focused

Page 86: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Simple supply side questions that are very hard to answer?

• Who produces the information I need?• Are they “the” recognized authority? How can I tell?• How often will it be re-published?

– Is the supply predictable and reliable? Can I count on it?• Do the data have a geospatial characteristic?

– What are its geospatial qualities (specs) and provenance?– Is it consistently defined in its meaning?– What is the scope of its coverage?

• Will the data be maintained?– Geometry and models– Attributes and metadata

• Where do I get it and in what forms?

Page 87: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

They should not have to ask if it has been integrated?

87

Page 88: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

What is stopping us from answering these basic

questions?

88

Page 89: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Barriers to integration

• What is preventing our data from being integrated?– Acquisition:

• Uncoordinated data acquisition strategies at national level

• Barrier between business data and geospatial data i.e. schools, minerals,

• Few means to broker and optimize requirements from consumers

– Production

• Quality of our metadata and when and how we get it

• Unclear operational roles in a national data framework. (NSDI)

• Absence of a granular or meaningful trustworthy data chain of authority?

• Absence of a schedule to communicate what is going to be happening?89

Page 90: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Barriers

• What is preventing our data from being integrated?– Data Management

• Cataloging

• Fundamental Semantics (A16)

– Policy, Organization and Culture

• Federated political and government collection and production environments

• divergent data quality requirements – national, state, local, regional

• Stove-piped national Geodetic policy (A16)

• Shifting market expectations and tolerances for lower quality in favor of access?

• Legacy institutional barriers and thinking

• They are national assets not just a programs data.

Page 91: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

AcquireData Participate

Data.gov

Use Side

Community of Users

Supply Side

Community of Suppliers

Supply Chain Management Data Integration Focused

Connect Discover

Enable Discovery

Enable Use

Build / Intra

Dataset Integration

Publish Dataset

Where are the problems occurring in the Value Chain?

Access and Interoperability Focused

DownstreamData

Integration$$$

AmbiguousCataloging

and semantics

Gap in planning view of Acquisition

Gap in what gets

integrated

Page 92: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

We resemble this!

Page 93: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Why we need to think differently!

Page 94: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Aiding data integration• Standards – formats for sure but also

– Metadata– Semantics– Designing for integratability!

• The goal should be to REDUCE the curation barrier to data integration

• What would you do? What have you done?94

Page 95: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Summary• Theme of data management in the chaotic

and enabling environment of the web, internet

• Emergence of frameworks that encompass some aspects of data management

• Unlocking data in an integratable way is an immense challenge

• Anything/ everything you can do by following what you have learned in this course will help

• http://tw.rpi.edu/web/Workshop/Community/GeoData2011 95

Page 96: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-6961 Week 12, November 20, 2012 Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

What is next• Nov. 27 – project presentations

• Final assignment to be handed in

• Reading for this week: – Semantic Deep Web, James Geller, Soon Ae

Chun, and Yoo Jung An, – The Deep Web (Internet Tutorials) – Digital Image Resources on the Deep Web– Parsons and Fox: Is Data Publication the Right

Metaphor?96