84
1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13, November 29, 2010

1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Embed Size (px)

Citation preview

Page 1: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

1

Foundations VII: Data life-cycle, Mining and

Knowledge Discovery

Deborah McGuinness and Joanne Luciano

With Peter Fox and Li Ding

CSCI-6962-01

Week 13, November 29, 2010

Page 2: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Contents• Review assignment

• More advanced topics; life cycle, mining and adding to your knowledge base

• Summary

• Next week (your presentations)

2

Page 3: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

3

Semantic Web Methodology and Technology Development Process

• Establish and improve a well-defined methodology vision for Semantic Technology based application development

• Leverage controlled vocabularies, et c.

Use Case

Small Team, mixed skills

Analysis

Adopt Technology Approach

Leverage Technology Infrastructur

e

Rapid PrototypeOpen World:

Evolve, Iterate, Redesign, Redeploy

Use Tools

Science/Expert Review & Iteration

Develop model/

ontology

EvaluationEvaluation

Page 4: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Data->Information->Knowledge

4

Page 5: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Data Life Cycle• Life cycle (we will define these shortly)

– Acquisition, curation, preservation– Long term stewardship

• Data and information – we use this to get to the discussion of knowledge– Content; the values– Context; the background, setting, etc.– Structure; organization and form

• Representation/ storage– Analog– Digital (and born digital)

5

Page 6: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Why it is important• 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving

Dying Data, Sep. 12, 2002, Forbes. [Online]. Available: http://www.forbes.com/2002/09/12/0912data_print.html)

• 1986 BBC Digital Domesday (A. Jesdanun, “Digital memory threatened as file formats evolve,” Houston Chronicle, Jan. 16, 2003. [Online]. Available: http://www.chron.com/cs/CDA/story.hts/tech/1739675)

• R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The international polar year: Making data available for the long-term,” in Proc. Fall AGU Conf., San Francisco, CA, Dec. 2004. [Online]. Available: ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_International_Polar_Year:_Making_Data_and_Information_Available_for_the_Long_Term.ppt 6

Page 7: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Why (cont’d)• e-science aims to derive new knowledge from

(possibly) multiple sources data

• The data needs to be persistent, available and usable

• The rate of creation of knowledge representations is increasing; they are a representation of the known ‘facts’ based on the data

• We studied KR creation, engineering, evolution and iteration

• Knowledge needs a life-cycle as well7

Page 8: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

At the heart of it

• Inability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc.

• Inability to know the inter-relations, assumptions and missing information

• We’ll look at a (data) use case for this shortly

• But first we will look at what, how and who in terms of the full life cycle 8

Page 9: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

What to collect?

• Documentation– Metadata– Provenance

• Ancillary Information

• Knowledge

9

Page 10: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Who does this?

• Roles:– Data creator– Data analyst– Data manager– Data curator

10

Page 11: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

How it is done

11

Page 12: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Acquisition

12

Page 13: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Curation

13

Page 14: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Preservation• Usually refers to the full life cycle

• Archiving is a component

• Stewardship is the act of preservation

• Intent is that ‘you can open it any time in the future’ and that ‘it will be there’

• This involves steps that may not be conventionally thought of

• Think 10, 20, 50, 200 years…. looking historically gives some guide to future considerations 14

Page 15: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Some examples and experience

• NASA

• NOAA

• Library community

• Note:– Mostly in relation to publications, books, etc but

some for data– Note that knowledge is in publications but the

structure form is meant for humans not computers, despite advances in text analysis

– Very little for the type of knowledge we are considering: in machine accessible form 15

Page 16: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Back in the day...

SEEDS Working Group on Data Lifecycle• Second Workshop Report

o https://esdswg.eosdis.nasa.gov/documents/W2_Bothwell.pdfo Many LTA recommendations

• Earth Sciences Data Lifecycle Reporto https://esdswg.eosdis.nasa.gov/documents/lta_prelim_rprt2.pdfo Many lessons learned from USGS experience, plus some

recommendations• SEEDS Final Report (2003) - Section 4

o https://esdswg.eosdis.nasa.gov/documents/FinRec.pdfo Final recommendations vis a vis data lifecycle

MODIS Pilot Project• GES DISC, MODAPS, NOAA/CLASS, ESDIS effort• Transferred some MODIS Level 0 data to CLASS

Page 17: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Mostly Technical Issues

• Data Preservationo Bit-level integrityo Data readability

• Documentation• Metadata• Semantics• Persistent Identifiers• Virtual Data Products• Lineage Persistence• Required ancillary data• Applicable standards

Page 18: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Mostly Non-Technical Issues

• Policy (constrained by money…)• Front end of the lifecycle

o Long-term planning, data formats, documentation...• Governance and policy• Legal requirements• Archive to archive transitions

• Money (intertwined with policy)• Cost-benefit trades• Long-term needs of NASA Science Programs • User input

o Identifying likely users• Levels of service• Funding source and mechanism

Page 19: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

HDF4 Format "Maps"for Long Term Readability

C. Lynnes, GES DISCR. Duerr and J. Crider, NSIDC

M. Yang and P. Cao, The HDF Group

Use case: a real live one; deals mostlywith structure and (some) content

HDF=Hierarchical Data FormatNSIDC=National Snow and Ice Data CenterGES=Goddard Earth ScienceDISC=Data and Information Service Center

Page 20: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

In the year 2025...

A user of HDF-4 data will run into the following likely hurdles:• The HDF-4 API and utilities are no longer supported...

o ...now that we are at HDF-7• The archived API binary does not work on today's OS's

o ...like Android 3.1 • The source does not compile on the current OS

o ...or is it the compiler version, gcc v. 7.x?• The HDF spec is too complex to write a simple read

program...o ...without re-creating much of the API

What to do?

Page 21: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

HDF Mapping Files

Concept:  create text-based "maps" of the HDF-4 file layouts while we still have a viable HDF-4 API (i.e., now)• XML• Stored separately from, but close to the data files• Includes 

o internal metadatao variable info o chunk-level info

byte offsets and length linked blocks compression information

Task funded by ESDIS project•  The HDF Group, NSIDC and GES DISC

Page 22: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Map sample (extract)

        <hdf4:SDS objName="TotalCounts_A" objPath="/ascending/Data Fields" objID="xid-DFTAG_NDG-5">          <hdf4:Attribute name="_FillValue" ntDesc="16-bit signed integer">            0 0          </hdf4:Attribute>          <hdf4:Datatype dtypeClass="INT" dtypeSize="2" byteOrder="BE" />          <hdf4:Dataspace ndims="2">            180 360          </hdf4:Dataspace>          <hdf4:Datablock nblocks="1">            <hdf4:Block offset="27266625" nbytes="20582" compression="coder_type=DEFLATE" />          </hdf4:Datablock>        </hdf4:SDS>

Page 23: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Status and Future

Status • Map creation utility (part of HDF)• Prototype read programs

o Co Perl

• Paper in TGRS special issue• Inventory of HDF-4 data products within EOSDIS

Possible Future Steps• Revise XML schema• Revise map utility and add to HDF baseline• Implement map creation and storage operationally

o e.g., add to ECS or S4PA metadata files

Page 24: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Examples of NASA context

24

Page 25: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Contextual Information:

• Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics, etc.)

• Instrument/sensor calibration data and method• Processing algorithms and their scientific basis,

including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product)

• Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product

25

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Page 26: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Contextual Information (continued):

• Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive

• Quality assessment information• Validation record, including identification of validation data sets• Data structure and format, with definition of all parameters and

fields• In the case of earth based data, station location and any

changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record

• A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set

• Information received back from users of the data set or product

26

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Page 27: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

However…• Even groups like NASA do not have a

governance model for this work

• Governance: defintion

• Stakeholders:– NASA for integrity of their data holdings (is it their

responsibility?)– Public for value for and return on investment– Scientists for future use (intended and un-

intended)– Historians

27

Page 28: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

NOAA

28

Page 29: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Library community• OAIS

• OAI (PMH and ORE)

29

Page 30: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Metadata Standards - PREMIS

• Provide a core preservation metadata set with broad applicability across the digital preservation community

• Developed by an OCLC and RLG sponsored international working group– Representatives from libraries, museums,

archives, government, and the private sector.

• Based on the OAIS reference model

Page 31: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Metadata Standards - PREMIS

• Maintained by the Library of Congress• Editorial board with international membership• User community consulted on changes

through the PREMIS Implementers Group • Version 1 was released in June 2005• Version 2 was just released

Page 32: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Rights

Events

Agents

“a coherent set of contentthat is reasonably

described as a unit”For example, a web site, data set or collection of data sets

“a coherent set of contentthat is reasonably

described as a unit”For example, a web site, data set or collection of data sets

“a discrete unit of information in digital form”

For example, a data file

“a discrete unit of information in digital form”

For example, a data file“assertions of one or more

rights or permissionspertaining to an object

or an agent”e.g., copywrite notice, legalstatute, deposit agreement

“assertions of one or more rights or permissions

pertaining to an objector an agent”

e.g., copywrite notice, legalstatute, deposit agreement

“an action that involves atleast one object or agentknown to the preservation

repository”e.g., created, archived,

migrated

“an action that involves atleast one object or agentknown to the preservation

repository”e.g., created, archived,

migrated

“a person, organization, orsoftware program associatedwith preservation events in

the life of an object”e.g., Dr. Spock donated it

“a person, organization, orsoftware program associatedwith preservation events in

the life of an object”e.g., Dr. Spock donated it

PREMIS - Entity-Relationship Diagram

IntellectualEntities

Objects

Page 33: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

PREMIS - Types of Objects

• Representation - “the set of files needed for a complete and reasonable rendition of an Intellectual Entity”

• File • Bitstream - “contiguous or non-contiguous

data within a file that has meaningful common properties for preservation purposes”

Page 34: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Metadata Standards - METS

• Metadata Encoding and Transmission Standard

• An initiative of the Digital Library Federation

• Based on the Making of America II project

Page 35: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

METS - What’s Its Purpose?• Provides the means to convey the metadata

necessary for – management of digital objects within a repository– exchange of objects between repositories (or

between repositories and their users)

• Designed to facilitate – shared development of information management

tools/services– interoperable exchange of digital materials

Page 36: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

METS - What’s its status?• Version 1.6 was released in Sept. 2007

• Maintained by the Library of Congress

• International Editorial Board

• NISO registration as of 2006

Page 37: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Backup Materials - MODIS Contextual Info

Page 38: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Instrument/sensor characteristics

38

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Page 39: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Processing Algorithms & Scientific Basis

39

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Page 40: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Ancillary Data

40

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Page 41: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Processing History including Source Code

41

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Page 42: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Quality Assessment Information

42

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Page 43: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Validation Information

43

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Page 44: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Other Factors that can Influence the Record

44

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Page 45: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign

Bibliography

45

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Page 46: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group

Information from users• Data Errors found

• Quality updates

• Things that need further explanation

• Metadata updates/additions?

• Community contributed metadata????

Page 47: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Back to why you need to…• E-science uses data and it needs to be

around when what you create goes into service and you go on to something else

• That’s why someone on the team must address life-cycle (data, information and knowledge – we’ll get to the latter shortly) and work with other team members to implement organizational, social and technical solutions to the requirements

47

Page 48: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

What would you need to do?

48

Page 49: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

(Digital) Object Identifiers• Object is used here so as not to pre-empt an

implementation, e.g. resource, sample, data, catalog

• Examples:– DOI– URI– XRI

49

Page 50: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Versioning

50

Page 51: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Mining• We will start with data but the ideas apply to

information and knowledge bases as well

• Definition

• History

• Our interest

51

Page 52: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

SAM: Smart Assistant for Earth Science Data Mining

PI: Rahul Ramachandran

Co-I: Peter Fox, Chris Lynnes, Robert Wolf, U.S. Nair

Page 53: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Science Motivation• Study the impact of natural iron fertilization process such as

dust storm on plankton growth and subsequent DMS production– Plankton plays an important role in the carbon cycle– Plankton growth is strongly influenced by nutrient availability (Fe/Ph)– Dust deposition is important source of Fe over ocean– Satellite data is an effective tool for monitoring the effects of dust

fertilization• Analysis entails

– Mine MODIS L1B data for dust storm events and identify the swath of area influenced by the passage of the dust storms.

– Examine correlations between fertilization, plankton growth and DMS production

Page 54: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Current Analysis Process

• MODIS aerosol products don’t provide speciation• Locate and download all the data to their local machine• Write code to classify and detect dust accurately [ 3-4

month effort]• Write code to classify and detect other dust aerosols [ 3-

4 month effort]• Write code to segment the detected region in order to

account for advection effect and correlation coefficient [2 months effort]

Page 55: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Analysis with SAM

• Create a workflow to perform classification using many different state of the art classifiers on distributed data

• Create a workflow to segment detected regions using image processing services on distributed data

Bottom line: • Scientist does not have to write all the code to perform

the analysis• Can compose workflows that utilize distributed

data/services• Can share the workflow with others to collaborate, reuse

and modify

Page 56: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Conducting Science using Internet as the Primary Computer

Page 57: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Mash-ups Example: Yahoo Pipes

Page 58: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Data Mining in the ‘new’ Distributed Data/Services Paradigm

Page 59: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Too many choices!!

•And that’s only part of the toolkit•ADaM-IVICS toolkit has over 100+ algorithms

Page 60: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

SAM Objectives• Improve usability of Earth Science data by

existing data mining services for research, by incorporating semantics into the workflow composition process.– Semantic search capable of mapping a

conceptual task– Assistance in mining workflow composition– Verification that services are connected in a

semantically correct fashion

Page 61: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Ontology Use

Page 62: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Semi-automated Workflow Composition

Filtering services basedon data format

Page 63: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Semi-automated Workflow Composition

Filtering service optionsbased on both data formatand task selected

Page 64: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Semi-automated Workflow Composition

Final Workflow

Page 65: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Science Motivation• Study the impact of natural iron fertilization process

such as dust storm on plankton growth and subsequent DMS production– Plankton plays an important role in the carbon cycle– Plankton growth is strongly influenced by nutrient

availability (Fe/Ph)– Dust deposition is important source of Fe over ocean– Satellite data is an effective tool for monitoring the effects

of dust fertilization

Page 66: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Hypothesis• In remote ocean locations there is a positive

correlation between the area averaged atmospheric aerosol loading and oceanic chlorophyll concentration

• There is a time lag between oceanic dust deposition and the photosynthetic activity

Page 67: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Primary source of ocean nutrients

WIND BLOWND

UST

SAHARA

SEDIMENTS FROM RIVER

OCEAN UPWELLING

Page 68: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

SAHARA

DUST

SST

CLOUDS

NUTRIENTS

CHLOROPHYLL

Factors modulating dust-ocean photosynthetic effect

Page 69: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Objectives

• Use satellite data to determine, if atmospheric dust loading and phytoplankton photosynthetic activity are correlated.

• Determine physical processes responsible for observed relationship

Page 70: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Preliminary Results

Page 71: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Data and Method• Data sets obtained from SeaWiFS and

MODIS during 2000 – 2006 are employed

• MODIS derived AOT

Page 72: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

The areas of study

1

5

6

8

43

2

7

1-Tropical North Atlantic Ocean 2-West coast of Central Africa 3-Patagonia

4-South Atlantic Ocean 5-South Coast of Australia 6-Middle East 7- Coast of China 8-Arctic Ocean

*Figure: annual SeaWiFS chlorophyll image for 2001

Page 73: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Tropical North Atlantic Ocean dust from Sahara Desert

-0.68497

-0.1587

4

-0.856

11

-0.446

7

-0.75102

-0.6644

8

-0.72603

-0.17504 -0.0902 -0.328 -0.4595 -0.14019 -0.7253 -0.1095

Ch

loro

ph

yll

AOT

Page 74: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Arabian Sea Dust from Middle East

0.59895 0.66618 0.37991 0.45171 0.52250 0.36517 0.5618

0.76650

0.69797

0.75071

0.4412

0.8495

0.708625

0.65211

Ch

loro

ph

yll

AOT

Page 75: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Summary and future work• Dust impacts oceans photosynthetic activity,

positive correlations in some areas NEGATIVE correlation in other areas, especially in the Saharan basin

• Hypothesis for explaining observations of negative correlation: In areas that are not nutrient limited, dust reduces photosynthetic activity

• But also need to consider the effect of clouds, ocean currents. Also need to isolate the effects of dust. MODIS AOT product includes contribution from dust, DMS, biomass burning etc.

Page 76: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Case for SAM

• MODIS aerosol products don’t provide speciation• Why performing this data analysis is hard?

– Need to classify and detect Dust accurately – Need to classify and detect other aerosols (eg. DMS accurately)– Need to segment the detected region in order to account for

advection effects and correlation coefficient.• What will SAM provide?

– Provide capability to create a workflow to perform classification– Provide capability to create a workflow to segment detected regions

Bottom line: • Scientist does not have to write all the code to perform the

analysis• Can compose workflows that utilize distributed data/services• Can share the workflow with others to collaborate, reuse and

modify

Page 77: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Knowledge Discovery• Has a broad meaning

– Finding ontologies– Creating new knowledge from

• Previous knowledge• New sources (data, information)• Modeling

• We’ll look at a mining approach as an example

77

Page 78: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

78

Ingest/pipelines: problem definition• Data is coming in faster, in greater volumes and outstripping our ability to perform

adequate quality control

• Data is being used in new ways and we frequently do not have sufficient information on what happened to the data along the processing stages to determine if it is suitable for a use we did not envision

• We often fail to capture, represent and propagate manually generated information that need to go with the data flows

• Each time we develop a new instrument, we develop a new data ingest procedure and collect different metadata and organize it differently. It is then hard to use with previous projects

• The task of event determination and feature classification is onerous and we don't do it until after we get the data

Page 79: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

20080602 Fox VSTO et al.

79

Page 80: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

80

• Who (person or program) added the comments to the science data file for the best vignetted, rectangular polarization brightness image from January, 26, 2005 1849:09UT taken by the ACOS Mark IV polarimeter?

• What was the cloud cover and atmospheric seeing conditions during the local morning of January 26, 2005 at MLSO?

• Find all good images on March 21, 2008.• Why are the quick look images from March 21,

2008, 1900UT missing?• Why does this image look bad?

Use cases

Page 81: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

20080602 Fox VSTO et al.

81

Page 82: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

20080602 Fox VSTO et al.

82

Page 83: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Summary• (Data) life cycle – key actions

– A– B

• Mining (data, information and knowledge) – key results and work in progress– A– B

• Facilitating new discoveries– A

83

Page 84: 1 Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13,

Next week• This weeks assignments:

– Reading: None– Assignment: None

• Next class (week 14 – December 6): – Class presentation III: Use case iteration

• Term assignment due – December 6 before class• Office hours this week – by appointment or drop in

– Winslow 2104 (Professor McGuinness)– Winslow 2143 (Professor Luciano)

• Questions?

84