61
1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Embed Size (px)

Citation preview

Page 1: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

1

Peter Fox

Data Science – ITEC/CSCI/ERTH 4350/6350

Week 2, September 3, 2013

Data and information acquisition (curation) and metadata -

management

Page 2: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Admin info (keep/ print this slide)• Class: ITEC/CSCI/ERTH-4350/6350 (note new number)• Hours: 9am-11:50am Tuesday• Location: Lally 104• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by appt)• Contact location: Winslow 2120 (or Lally 207A)• TA: Saurabh Sharma; [email protected] • Web site: http://tw.rpi.edu/web/courses/DataScience/2013

– Schedule, lectures, syllabus, reading, assignments, etc.

2

Page 3: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Review from last week

• Data

• Information

• Knowledge

• Metadata/ documentation

• Data life-cycle3

Page 4: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Reading Assignments• Changing Science: Chris Anderson

• Rise of the Data Scientist

• Where to draw the line

• What is Data Science?

• An example of Data Science

• If you have never heard of Data Science

• BRDI activities

• Data policy

• Self-directed study (answers to the quiz)

• Fourth Paradigm, Digital Humanities4

Page 5: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

And now a more detailed e.g.• Or … part of my path to data science

5

Page 6: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Why do we (I) care about the Sun?

• The Sun’s radiation is the single largest external input to the the Earth’s atmosphere and thus the Earth system.

• Add, it varies – in time and wavelength • Also, for a long time - Solar Energetic Particles and the near Earth environment (and more recently the effect on clouds?)

• Observations commenced ~ 1940’s, with a resurgence in the late 1970’s

• Two quantities of scientific interest– Total Solar irradiance - TSI in Wm-2 (adjusted to 1AU)– Solar Spectral Irradiance - SSI in Wm-2m-1or Wm-2nm-1

• Measure, model, understand -> construct, predict

Page 7: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Solar radiation as a function of altitude

Page 8: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management
Page 9: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management
Page 10: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Spectral synthesis components and flow

Page 11: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Summary of Results• First comprehensive ‘database’ of:

– Empirical models of the thermodynamic structure of the solar atmosphere suitable for different solar magnetic activity levels

• First comprehensive (70 component) synthetic spectral irradiance database in absolute units– 10 disk angles, 7 models, far ultra- violet to far infrared, multi-resolution

– ~724 GB

• Strong validation in ultraviolet, visible, lines, infrared– Correct center to limb prediction for red-band irradiances– Found 30-45% network contribution to Ly- irradiance

• Several comparisons led to improvements in the atomic parameters

• Led to choice of PICARD (new satellite) filter wavelengths

Page 12: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Which brings us to DATA SCIENCE

• Drum roll…..• Some dirty secrets• And some … universal truths…

Page 13: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

13

Needs (this is our mantra)

Scientists should be able to access a global, distributed knowledge base of scientific data that:• appears to be integrated• appears to be locally available

But… data is obtained by multiple means (models and instruments), using various protocols, in differing vocabularies, using (sometimes unstated) assumptions, with inconsistent (or non-existent) meta-data. It may be inconsistent, incomplete, evolving, and distributed. And created in a manner to facilitate its generation NOT its use.

And… there exist(ed) significant levels of semantic heterogeneity, large-scale data, complex data types, legacy systems, inflexible and unsustainable implementation technology

Page 14: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Back to the TSI time series…

Page 15: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management
Page 16: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

One composite, one assumption

Page 17: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Another composite, different assumption

Page 18: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

18

Data pipelines: we have problems

• Data is coming in faster, in greater volumes and forms and outstripping our ability to perform adequate quality control

• Data is being used in new ways and we frequently do not have sufficient information on what happened to the data along the processing stages to determine if it is suitable for a use we did not envision

• We often fail to capture, represent and propagate manually generated information that need to go with the data flows

• Each time we develop a new instrument, we develop a new data ingest procedure and collect different metadata and organize it differently. It is then hard to use with previous projects

• The task of event determination and feature classification is onerous and we don't do it until after we get the data

• And now much of the data is on the Internet/Web (good or bad?)

Page 19: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

20080602 Fox VSTO et al.19

Page 20: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Provenance• Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility

Page 21: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Contents• Preparing for data collection

• Managing data

• Data and metadata formats

• Data life-cycle : acquisition– Modes of collecting– Examples– Information as data– Bias, provenance

• Curation

• Assignment 1 21

Page 22: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Data-Information-Knowledge Ecosystem

22

Data Information Knowledge

Producers Consumers

Context

PresentationOrganization

IntegrationConversation

CreationGathering

Experience

Page 23: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

MIT DDI Alliance Life Cycle

23

Page 24: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

20080602 Fox VSTO et al.

24

Page 25: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Modes of collecting data, information

• Observation

• Measurement

• Generation

• Driven by– Questions– Research idea– Exploration

25

Page 26: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Data Management reading• http://libraries.mit.edu/guides/subjects/data-management/cycle.html

• http://esipfed.org/DataManagement

• http://wiki.esipfed.org/index.php/Data_Management_Workshop

• http://lisa.dachev.com/ESDC/

• Moore et al., Data Management Systems for Scientific Applications, IFIP Conference Proceedings; Vol. 188, pp. 273 – 284 (2000)

• Data Management and Workflows http://www.isi.edu/~annc/papers/wses2008.pdf

• Metadata and Provenance Management http://arxiv.org/abs/1005.2643

• Provenance Management in Astronomy http://arxiv.org/abs/1005.3358

• Web Data Provenance for QA http://www.slideshare.net/olafhartig/using-web-data-provenance-for-quality-assessment

26

Page 27: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Management• Creation of logical collections

– The primary goal of a Data Management system is to abstract the physical data into logical collections. The resulting view of the data is a uniform homogeneous library collection.

• Physical data handling– This layer maps between the physical to the

logical data views. Here you find items like data replication, backup, caching, etc.

27

Page 28: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Management• Interoperability support

– Normally the data does not reside in the same place, or various data collection (like catalogues) should be put together in the same logical collection.

• Security support– Data access authorization and change

verification. This is the basis of trusting your data.

• Data ownership– Define who is responsible for data quality and

meaning28

Page 29: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Management• Metadata collection, management and

access.– Metadata are data about data.

• Persistence– Definition of data lifetime. Deployment of

mechanisms to counteract technology obsolescence.

• Knowledge and information discovery– Ability to identify useful relations and information

inside the data collection.

29

Page 30: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Management• Data dissemination and publication

– Mechanism to make aware the interested parties of changes and additions to the collections.

30

Page 31: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Logical Collections• Identifying naming conventions and

organization

• Aligning cataloguing and naming to facilitate search, access, use

• Provision of contextual information

31

Page 32: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Physical Data Handling• Where and who does the data come from?

• How is it transferred into a physical form?

• Backup, archiving, and caching...

• Data formats

• Naming conventions

32

Page 33: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Interoperability Support• Bit/byte and platform/ wire neutral encodings

• Programming or application interface access

• Data structure and vocabulary (metadata) conventions and standards

• Definitions of interoperability?– Smallest number of things to agree on so that

you do not need to agree on anything else 33

Page 34: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Security• What mechanisms exist for securing data?

• Who performs this task?

• Change and versioning (yes, the data may change), who does this, how?

• Who has access?

• How are access methods controlled, audited?

• Who and what – authentication and authorization?

• Encryption and data integrity34

Page 35: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Data Ownership• Rights and policies – definition and

enforcement

• Limitations on access and use

• Requirements for acknowledgement and use

• Who and how is quality defined and ensured?

• Who may ownership migrate too?

• How to address replication?

• How to address revised/ derivative products?

35

Page 36: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Metadata• Know what conventions, standards, best

practices exist

• Use them – can be hard, use tools

• Understand costs of incomplete and inconsistent metadata

• Understand the line between metadata and data and when it is blurred

• Know where and how to manage metadata and where to store it (and where not to)

• Metadata CAN be added later in many cases 36

Page 37: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Persistence• Where will you put your data so that

someone else (e.g. one of your class members) can access it?

• What happens after the class, the semester, after you graduate?

• What other factors are there to consider?

37

Page 38: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Discovery• If you choose (see ownership and security),

how does someone find your data?

• How would you provide discovery of collections, versus files, versus ‘bits’?

• How to enable the narrowest/ broadest discovery?

38

Page 39: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Dissemination

39

• Who should do this?

• How and what needs to be put in place?

• How to advertise?

• How to inform about updates?

• How to track use, significance?

Page 40: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Data Formats - preview• ASCII, UTF-8, ISO 8859-1

• Self-describing formats

• Table-driven

• Markup languages and other web-based

• Database

• Graphs

• Unstructured

• Discussion… because this is part of your assignment

40

Page 41: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Metadata formats• ASCII, UTF-8, ISO 8859-1• Table-driven• Markup languages and other web-based• Database, graphs, …• Unstructured• Look familiar? Yes, same as data• Next week we’ll look at things like

– Dublin Core (dc.x)– Encoding/ wrapper standards - METS– ISO in general, e.g. ISO/IEC 11179– Geospatial, ISO 19115-2, FGDC– Time, ISO 8601, xsd:datetime

41

Page 42: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

20080602 Fox VSTO et al.

42

Page 43: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Acquisition• Learn / read what you can about the

developer of the means of acquisition– Even if it is you (the observer)– Beware of bias!!!

• Document things– See notes from Class 1

• Have a checklist (see Management) and review it often

• Be mindful of who or what comes after your step in the data pipeline

43

Page 44: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Modes of collecting data, information

• Observation

• Measurement

• Generation

• Driven by– Questions– Research idea– Exploration

44

Page 45: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Example 1• “the record of the time when the CDTA bus 87

arrives at the bus stop on 15th street under the RPI walk over bridge. The data collection need is being driven by the desire to have a more precise idea of the time when the bus will arrive at that bus stop in the hopes that it will be closer to reality than the official CDTA schedule for bus 87.”

• Lessons:– Other buses, hard to see the bus, calibrated time

source, unanticipated metadata, better to have prepared tables for recording, …

45

Page 46: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Example 2• ‘The goal of the data collection was to explore the relative

intensity of the wavelengths in a white-light source through a colored plastic film. By measuring this we can find properties of this colored plastic film.’

• ‘We used a special tool called a spectrometer to measure the relative intensity of this light. It’s connected to a computer and records all values by using a software program that interacts with the spectrometer.’

• Lessons– Noise from external light, inexperience with the software, needed to

get help from experienced users, more metadata than expected, software used different logical organization, ...

46

Page 47: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Example 3• ‘The goal of my data collection exercise was to

observe and generate historical stock price data of large financial firms within a specified time frame of the years 2007 to 2009. This objective was primarily driven by general questions and exploration purposes – in particular, a question I wanted to have answered was how severe the ramifications of the economic crisis were on major financial firms.’

• Lessons– Irregularities in data due to company changes (buy-out,

bankrupt), no metadata – had to create it all, quality was very high, choice of sampling turned out to be crucial, …

47

Page 48: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Example 4• I performed a survey among a sample set of people to

determine how many prefer carbonated drinks (like Coke) to fruit juice. The goal of this data collection exercise was to determine which option is more popular and if any health related issues occur due to the consumption of these drinks. The data collection need was primarily driven by the question - whether consumption of caffeine, soda and excessive sugar present in these drinks actually cause health problems like obesity, cholesterol, dental decay etc. The mode of data collection was by observation.

• Lessons– The measurement unit for the amount of drink consumed daily was not fixed

before starting the data collection exercise. During the data collection process, some gave me the amount in ml whereas some in ounces and some others in number of glasses. Later, I had to convert those units to the standard unit that I was using – ml.

– Some people were reluctant to disclose health related issues and I had to guarantee them anonymity. This solved the problem to a great extent

48

Page 49: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Example 5• This data collection exercise was driven by questions, as to what the ratio

of Male to Female friends would a set of people on their Facebook profile. Many questions can be asked and analyzed from this data, like do Females predominantly have females as their Facebook friends, or males? (Similar questions can be asked about the Males) Is the person more of an outgoing person who likes to meet and make new friends, or are they selective about their friends? Why does the ratio for a particular person have a marked departure from that of the others? Does this data vary or is it affected by age? … The mode of collection was Observation, which was carried out by observing the number of Male and Female friends on the Facebook profile.

• Lessons:– One major problem faced was the obtaining of the data. As many people had a very

large number of friends, and Facebook implicitly does not have any mechanism for filtering based on Gender, the data collection had to be manually carried out by the people, which was not an easy task, and also has a higher probability for miscounting. Another problem faced was when trying to use a Facebook application to identify the stats from a person’s Facebook profile. It was discovered after a period of time that the application was not providing accurate results, and hence the data had to be collected again. Lessons learnt were that the next time, it would be more accurate and faster to obtain the data by writing a Facebook code to obtain this data automatically

49

Page 50: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Example 6• The idea is to analyse the weekly trends which are

prevailing in the popular social networking site Twitter.com and link the trends to the most latest news and happenings or holding events which happened in the same date/week, thereby finding some common conduit which connects the two. It is driven by the rationale to observe how recent or important events affect the psyche of people or what vox populi is at a given time.

• Lessons: Psyche is much harder to characterize and collect data about than first thought. Amount and type of news and happenings is very large… needed to think about categories.

50

Page 51: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Example 7• The goal of my data collection exercise is to generate

the average rainfall levels during the hurricane seasons from the year 2005-2010. Along with this data, we would generate data on the average height of the Hudson River during the hurricane season months. The data collection need is being driven by questions such as how likely will it be for the Hudson River to cause major flood damage during the hurricane season to the cities surrounding it .  

• Lesson(s): some historical data is very hard to find. Places where average height is measured do not provide optimal comparison with where rainfall is measured.

51

Page 52: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Example 8• Politicians and economists constantly talk about what is best for the country or the economy by

proposing their own believes into various laws or actions. However, what is the true impact of these ideologies in the real world? Do these ideas really work? Can everything be solved by a tax cut? Spending increases? Changing the tax code? Which ideologies serve people best? Which ideologies work best under what conditions? A lot of theories are proposed, but how do we know which will work and which won’t? That is the goal of this proposal.

• The mode of collection will be a complex one. It would require assistance from government agencies, students, and a series of IT experts to manage the data. Within the last decade, government agencies and other organizations have attempted to make all information publicly available. States, countries, and public corporations have all published balance sheets for each year they have been in business, and corporations also have to file every quarter. Various investing institutes have been collecting information on the markets, both the indices and various stocks, and how they have been performing. However, most of this information is not publicly available. Out of what is available, information can only be retrieved in the last 10 years. In addition, most information is not in a format that can be digested automatically. Someone would have to do manual data entry.

• The information that would have to be gathered would have to be no less than 100 years worth of information. It must contain the following information:

– All indices’ prices, in intervals no greater than once per quarter, with the time and date – of the prices listed. This must be for all indices for all nations part of the global – economy. – All stock prices, in intervals no greater than once per quarter, with the time and date of – all prices listed. This must be for all publicly traded stocks all over the globe. – All laws, passed or not, in all countries, states, and territories throughout the world. A – time and date would have to be recorded with who voted for and against the law. – All world events would have to be recorded, with times and dates. Also, any response – taken by anyone would have to be recorded as well. – All balance sheets from all countries, states, and territories would have to be recorded. 52

Page 53: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Data-Information-Knowledge Ecosystem

53

Data Information Knowledge

Producers Consumers

Context

PresentationOrganization

IntegrationConversation

CreationGathering

Experience

Page 54: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Information as a basis for data• Don’t over think this… data extracted from

an information source, e.g. a web page, an image, a table

• If information is data in context (for human use) then there is data behind the information, e.g. name, address, for a web page form, measure of intensity of light for an image, numerical values for a table

• But data can also be acquired from information with a different context, e.g. the number of people in an image that are wearing green

54

Page 55: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

• To incline to one side; to give a particular direction to; to influence; to prejudice; to prepossess. [1913 Webster]

• A partiality that prevents objective consideration of an issue or situation [syn: prejudice, preconception]

• For acquisition – sampling bias is your enemy

• So let’s talk about it…55

Page 56: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Provenance*• Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility

• Internal?• External?• Mode?

Page 57: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

20080602 Fox VSTO et al.

57

• Provenance in this data pipeline

• Provenance is metadata in context

• What context?– Who you are– What you are

asking– What you will

use the answer for

Page 58: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

It is an entire ecosystem• The elements that make up provenance are often

scattered• But these are what enable scientists to explore/

confirm/ deny their data science investigations

Accountability

ProofExplanation Justification Verifiability

“Transparency”

Trust

Provenance

Identity

Page 59: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

Curation (partial)• Consider the organization and presentation of

your data

• Document what has been (and has not been) done

• Consider and address the provenance of the data to date, put yourself in the place of the next person

• Be as technology-neutral as possible

• Look to add information and metainformation59

Page 60: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

What comes to mind?• Assignment 1 – propose two data collection

exercises and perform a survey of data formats, metadata and application support for data management suitable for the data you intend to collect in two weeks (10% of grade) – see web page

• Note this is due NEXT week – why?

60

Page 61: 1 Peter Fox Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 3, 2013 Data and information acquisition (curation) and metadata - management

What is next• Reading – see web page (Data Management,

Provenance)

• Next week (Data formats, metadata standards, conventions, reading and writing data and information)

61