1 Peter Fox Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01 Week 12, April 27, 2010 Information discovery and integration and course summary

1

Peter Fox

Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01

Week 12, April 27, 2010

Information discovery and integration and course

summary

Contents• Review of last class, reading

• Information discovery

• Information integration

• Summary of this course and what you needed to learn

• Objectives

• Discussion of reading

• Next class

2

Recall forms of information• Structured/ un-structured

• Presentation and organization

• Syntax-semantics-pragmatics

• Managed, designed and architected.

• Goal of this part of the class is to understand how discovery and integration are enabled or disabled based on these factors

3

Discovery• How does someone find your information?

• How would you provide discovery of – collections – files – ‘bits’

• How would you find ->

4

Discoveryo Federated Searcho Folksonomies (user contributed)o Intelligent Agentso Search Engineso Taxonomies

o Find photos of KimoBoy or girl?

5

Use cases• Find a sound recording of a swallow.

• Excuse me?

6

Use cases• Find a sound recording of an African Swallow

• Find a sound recording of a bird that sounds like an African Swallow

• Media types – how can you discover them?

7

Use cases• Find the movie that Jean Tripplehorn first

starred in/ that was her most successful/ was lead actress?

• Has anyone gene sequenced a mouse?

• Discovery can often involve information integration

8

9

Three level ‘metadata’ solution for DATA

Level 1:

Data Registration at the Discovery Level,

e.g. Volcanolocation and activity

Level 2:

Data Registration at the Inventory Level,

e.g. list of datasets,times, products

Level 3:

Data Registration at the Item Detail

Level, e.g. access toindividual quantities

Ontology basedData IntegrationUsing scientific

workflows

Earth Sciences Virtual DatabaseA Data Warehouse where

Schema heterogeneity problem is Solved; schema based integration

Data Discovery Data Integration

A.K.Sinha, Virginia Tech, 2006

10

Three level ‘metadata’ solution?

Level 1:

Registration at the Discovery Level,

e.g. Find the upperlevel entry point to a

source

Level 2:

Registration at the Inventory Level,

e.g. list of datasets,using the logical

organization

Level 3:

Registration at the Item Detail

Level, i.e. annotatione.g. tagging

Integrationusing mappingmanagement

Catalog/ IndexSchema based integration

Information Discovery

Information

Integration

A.K.Sinha, Virginia Tech, 2006

Information discovery• What makes discovery work?

– Metadata– Logical organization– Attention to the fact that someone would want to

discover it– It turns out that file types are a key enabler or

inhibitor to discovery

• What does not work?– Result ranking using *any* conventional

algorithm11

Federated search• “is the simultaneous search of multiple online

databases or web resources and is an emerging feature of automated, web-based library and information retrieval systems. It is also often referred to as a portal or a federated search engine.” wikipedia

• Libraries have been doing this for a long time (Z39.50, ISO23950)

• Key is consistent search metadata fields (keywords)• E.g. Geospatial One Stop http://www.geodata.gov

12

http://www.geodata.gov/

Search engines (1)• Contains an automated spider or crawler • No theoretical limits in the amount of indexing

(limited by hardware) • Support remote indexing• Continual background indexing of content• Custom metatag support (some low-end

products do not support this feature) • Support for indexing PDF, .doc, etc (some low-

end products do not support this feature) • Supports URL and word exclusions &

inclusions13

Search engines (2)• SSI supported

• Search by custom metatags

• Case sensitive or insensitive searching

• Simple Customizable search/results pages

• Boolean Searching capabilities

• Provide users meta description and page title in search results

• Inexpensive – $200

• Easily customizable search/results interface14

Search engines (3)• Result weighting feature

• URL Inclusion list

• Require significant memory (RAM) and disk space as the collection grows

• Low-end alternatives often do not possess the capabilities to do phrase or natural language searching.

15

Improve www discovery• Implement metatags on your and your partners web

sites• Update content frequently • Register your site with the major search engines

(tools exist to aid in this process)• Perform a basic study of where your site results

within the major search engine providers• Do not spam the search engine providers • Re-evaluate your web site directory structure to

ensure information is appropriately categorized/ described within your URL strings

16

Improve www discovery• Look through your server log files to determine what

users are trying to find on your site and/or the path they are using to find information

• Perform basic usability testing of your site to determine what users expect and can easily gather from your site. This also may determine why users go to an Internet search engine provider versus accessing your site directly.

• Realize that Internet search engines don’t all act the same, index at the same time period, and often value a particular metatag, document date, etc. more than another vendor product. 17

Smart search• Semantically aware search, e.g.

http://noesis.itsc.uah.edu

• Faceted search, e.g. mspace (http://mspace.fm ), Earth System Grid (http://esg.prototype.ucar.edu ), exhibit (MIT)

18

http://noesis.itsc.uah.edu/

http://mspace.fm/

http://esg.prototype.ucar.edu/

NOESIS

19

Faceted search• Semantically aware search, e.g.

http://noesis.itsc.uah.edu

• Faceted search, e.g. mspace (http://mspace.fm ), Earth System Grid (http://esg.prototype.ucar.edu )

20

http://noesis.itsc.uah.edu/

http://mspace.fm/

http://esg.prototype.ucar.edu/

Summary - discovery• Useful to write a few discovery use cases to

drive how your design is developed

• Evolution of your role in facilitating discovery and what/ how others implement access to your information

21

Information integration• Involves combining information residing in different sources

and providing users with a unified view of them. This process becomes significant in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example).

• Integration appears with increasing frequency as the volume and the need to share existing information explodes.

• It has become the focus of extensive theoretical work, and numerous open problems remain unsolved.

• In management circles, people frequently refer to data integration as "Enterprise Information Integration" (EII)” wikipedia

• Is this an information management challenge (rhetorical question)

22

Aiding integration• Standards – formats for sure but also

• Metadata

• Semantics

• As such any integration capability is HIGHLY curated or left entirely to the end user

• If left to the user, results in a new product which should also be managed and shared

• What do you do?23

Recall elements/ forms of information

• Structured/ un-structured, content, context

• Presentation and organization

• Syntax-semantics-pragmatics

• Managed, designed and architected.

• Integration poses an important challenge here– Two forms presented/ organized differently– Different structure, semantics…

• Information back to data back to information 24

Micro life cycle of data

Geospatial

26

• Much of the work on information integration has focused on the dynamic integration of structured data sources, such as databases or XML data.

• With the more complex geospatial data types, such as imagery, maps, and vector data, researchers have focused on the integration of specific types of information, such as placing points or vectors on maps, but much of this integration is only partially automated.

• The challenge is that the dynamic integration of online data and geospatial data is beyond the state of the art of existing integration systems.

Geospatial

27

• The conflation process divides into following tasks: (1) find a set of conjugate point pairs, termed "control point pairs", in both vector and image datasets, (2) filter control point pairs, and (3) utilize algorithms, such as triangulation and rubber-sheeting, to align the rest of the points and lines in two datasets using the control point pairs.

• Typically by human input has been essential to find control point pairs and/or filter control points

Vectors on maps

28

Different contexts?• Heavily relies on metadata, especially on

structural/ use metadata

• Is more than often what leads to new findings, and abduction!

• Exercise – how does integration occur for the other aspects of information?

29

Review of course content

30

Abduction• Is a method of logical inference introduced by

Peirce which comes prior to induction and deduction for which the colloquial name is to have a "hunch".

• Abductive reasoning starts when an inquirer considers of a set of seemingly unrelated facts, armed with an intuition that they are somehow connected.

• The term abduction is commonly presumed to mean the same thing as hypothesis; however, an abduction is actually the process of inference that produces a hypothesis as its end result

31

Use Case

• … is a collection of possible sequences of interactions between the system under discussion and its actors, relating to a particular goal.

• The collection of Use Cases should define all system behavior relevant to the actors to assure them that their goals will be carried out properly.

• Any system behavior that is irrelevant to the actors should not be included in the use cases.– is a prose description of a system's behavior when

interacting with the outside world.– is a technique for capturing functional requirements of

business systems and, potentially, of an IT system to support the business system.

– can also capture non-functional requirements

Developed for NASA TIWG

Table of Contents• ==Plain Language Description==• ===Short Definition===• ===Purpose===• ===Describe a scenario of expected use===• ===Definition of Success===• ==Formal Use Case Description==• === Use Case Identification===• ===Revision Information===• ===Definition===• ===Successful Outcomes===• ===Failure Outcomes===• ==General Diagrams==• ===Schematic of Use case===• ==Use Case Elaboration==• ===Actors===• ====Primary Actors====• ====Other Actors====• ===Preconditions===• ===Postconditions===• ===Normal Flow (Process Model)===• ===Alternative Flows===

• ===Special Functional Requirements===• ===Extension Points===• ==Diagrams==• ===Use Case Diagram===• ===State Diagram===• ===Activity Diagram===• ===Other Diagrams===• ==Non-Functional Requirements==• ===Performance===• ===Reliability===• ===Scalability===• ===Usability===• ===Security===• ===Other Non-functional Requirements===• ==Selected Technology==• ===Overall Technical Approach===• ===Architecture===• ===Technology A===• ====Description====• ====Benefits====• ====Limitations====• ===Technology B===• ====Description====• ====Benefits====• ====Limitations====• ==References==

• ===Special Functional Requirements===• ===Extension Points===• ==Diagrams==• ===Use Case Diagram===• ===State Diagram===• ===Activity Diagram===• ===Other Diagrams===• ==Non-Functional Requirements==• ===Performance===• ===Reliability===• ===Scalability===• ===Usability===• ===Security===• ===Other Non-functional Requirements===• ==Selected Technology==• ===Overall Technical Approach===• ===Architecture===• ===Technology A===• ====Description====• ====Benefits====• ====Limitations====• ===Technology B===• ====Description====• ====Benefits====• ====Limitations====• ==References==

Information theory

• Semiotics, also called semiotic studies or semiology, is the study of sign processes (semiosis), or signification and communication, signs and symbols, into three branches:– Syntactics: Relation of signs to each other in

formal structures– Semantics: Relation between signs and the

things to which they refer; their denotata– Pragmatics: Relation of signs to their impacts on

those who use them

34

Information integrity• Information of a random variable is defined as

the Sum of p x log p, where p=probability. It represents the uncertainty of the variable.

• In later classes we will cover cognitive and social factors in increasing the conditional entropy and thus reducing the uncertainty and thus increasing information content and value

• We will also cover semiotics (signs) as a prelude to visualization as a presentation mechanism for information 35

Information gain/loss• The mutual information of two variables

define how much information one variable contains about the other.

• It is therefore defined as the decrease of the uncertainty of one variable by knowing the other.

• In probabilistic terms, the entropy decreases by conditioning on the distribution.

• What does this mean for an information system? E.g. a website or web service?

36

Noise• Most often refers to ‘data’ but does apply to

information• Uncertainty, especially any that is introduced is a

source of noise, or more accurately – bias in the use or interpretation of the information

• Noise/ bias is context and structure dependent• Noise/ bias contamination is rampant in information

systems• Quality control and verification is less developed for

information sources, e.g. ‘people do not report problems’

37

Library science• Curates the artifacts of knowledge

• Organizes and manages them for consumers– Cataloging and classification

• Preservation– ‘maintaining or restoring access to artifacts,

documents and records through the study, diagnosis, treatment and prevention of decay and damage’ (wikipedia)

• Digital age– Curation and preservation

38

Cognitive Science• Cognitive science is an interdisciplinary study of

the mind and intelligence• It operates at the intersection of psychology,

philosophy, computer science, linguistics, anthropology, and neuroscience.

• Of relevance for data and information science are three significant theoretical underpinnings– mental representation,– the nature of expertise, – and intuition

• Very relevant to model, metamodel choice39

Social Science• Branch of humanities

• Especially as it relates to networks of scientists

• Exploits sociology of groups, teams

• Cultural norms as well as discipline norms– Modes of what and how rewards are given– Between those who produce and those who

consume data and information– How you collect, understand, model and design

models and architectures is as much social as technical skill

40

Presentation• Separation of content from presentation!!

• The theory here is more empirical or semi-empirical

• Is developed based on a solid understanding of minimizing information uncertainty beginning with content, context and structural considerations and, as we will see, adding cognitive and social factors to reduce uncertainty.

• Physiology for humans, color, …41

Organization• Organizations as producers and consumers

• Organization of information presentation, e.g. layout on a web page

• Also (again) content, context and structure

• How do you organize– Information you’ve collected this semester– Information given to you by others

42

Context• Internal - Human context, tacit knowledge

• External

43

Structure• Is information stored or only presented?

• Structural representation of information content can bias presentation, e.g.– Modern image capture devices (digital camera)

often convert 2 byte integer to float, or 4 byte integer, what are the implications

• Appropriate choice of information structure can significantly decrease uncertainty, e.g. returning land images in GeoTIFF, which can encoding geographic location, instead of PNG 44

Content• Presentation

• Translation

• Encoding

45

Mental Representation• Thinking = representational structures +

procedures that operate on those structures.

• Data structures mental representations+ algorithms +procedures= running programs =thinking

• Methodological consequence: study the mind by developing computer simulations of thinking.

46

Semiotics• Also called semiotic studies or semiology, is

the study of sign processes (semiosis), or signification and communication, signs and symbols

47

Semiotic model

48

Syntax• Relation of signs to

each other in formal structures

• … the term syntax is also used to refer directly to the rules and principles that govern the …

• But not the meaning or the use! 49

Semantics

• Relation between signs and the things to which they refer; their denotata

• Study of meaning of … (anything?)

• Mainly need to worry about failures

50

Pragmatics• Relation of signs to their

impacts on those who use them

• the ways in which context contributes to meaning, conveying and use

51

Information Models• Conceptual models, sometimes called domain

models, are typically used to explore domain concepts

• High-level conceptual models are often created as part of initial requirements envisioning efforts as they are used to explore the high-level static business or science or medicine structures and concepts.

• Conceptual models are often created as the precursor to logical models or as alternatives to them

• Followed by logical and physical models 52

Object models• A data model is a logic organization of the

real world objects (entities), constraints on them, and the relationships among objects. – A database (DB) language is a concrete syntax

for an object (data) model. – A DB system implements that model.

53

Object design• Object-oriented modeling is a formal way of

representing something in the real world. It draws from traditional set theory and classification theory. Some basics to keep in mind in object-oriented modeling are that:– Instances are things.– Properties are attributes.– Relationships are pairs of attributes.– Classes are types of things.– Subclasses are subtypes of things.

54

Architectures• Building on content, context,

and users, some illustrate information architecture as an iceberg.

• Just like an iceberg, the majority of information architecture work is out of sight, "below the water."

• The work includes the creation of plans, controlled-vocabularies, and blueprints all before any user interfaces are created.

55

Interfaces

• Increasingly in tiered architectures there are numerous interfaces

• Information flow at interfaces and thus software engineering at those interfaces becomes a very important consideration

56

And relation to design?• “In the context of information systems design,

information architecture refers to the analysis and design of the data stored by information systems, concentrating on entities, their attributes, and their interrelationships.

• It refers to the modeling of data for an individual database and to the corporate data models an enterprise uses to coordinate the definition of data in several (perhaps scores or hundreds) of distinct databases.

57

Design theory• Elements

– Form– Value– Texture– Lines– Shapes– Direction– Size– Color

• Relate these to signs and relations between them

58

Principles of design

• Balance

• Gradation

• Repetition

• Contrast

• Harmony

• Dominance

• Unity

59

Reference architectures• “provides a proven template solution for an

architecture for a particular domain. It also provides a common vocabulary with which to discuss implementations, often with the aim to stress commonality.

• A reference architecture often consists of a list of functions and some indication of their interfaces (or APIs) and interactions with each other and with functions located outside of the scope of the reference architecture.” (wikipedia) 60

Statefull versus stateless• A key distinction between Grids and Web

environments is state, i.e. the knowledge of ‘who’ knows and remembers ‘what’

• Increasingly there is a need for maintaining some form of state, i.e. reducing information entropy in web and internet-based architectures

• Thus, enter the need for ‘state for a defined purpose’…

61

Life-cycle elements• Acquisition: Process of recording or

generating a concrete artefact from the concept (see transduction)

• Curation: The activity of managing the use of data from its point of creation to ensure it is available for discovery and re-use in the future

• Preservation: Process of retaining usability of data in some source form for intended and unintended use

• Stewardship: Process of maintaining integrity across acquisition, curation and preservation 62

Acquisition• Learn / read what you can

about the developer of the means of acquisition– Documents may not be

easy to find

– Remember bias!!!

• Document as you go

• Have a checklist (the Management list) and review it often 63

Curation• From Producers to Consumers

• Consider the organization and presentation of the data

• Document what has been (and has not been) done

• Consider and address the provenance to date

• Be as technology-neutral as possible

• Look to add metainformation64

Preservation• Refers to the full life cycle

• Archiving is a component

• Stewardship is the act of preservation

• Intent is that ‘you can open it any time in the future’ and that ‘it will be there’

• This involves steps that may not be conventionally thought of

• Think 10, 20, 50, 200 years…. looking historically gives some guide to future considerations 65

Summary of Management• Creation of logical collections

• Physical handling

• Interoperability support

• Security support

• Ownership

• Metadata collection, management and access.

• Persistence

• Knowledge and information discovery

• Dissemination and publication 66

67

Workflow• General definition: series of tasks performed

to produce a final outcome

• Information workflow – “analysis pipeline”– Automate tedious jobs that users traditionally

performed by hand for each dataset– Process large volumes of data/ information faster

than one could do by hand

68

Workflows

• Formal models of the flow of data/ information among processing components

• May be simple and linear or more complex• Can process many data/ information types:

– Archives– Web pages– Streaming/ real time– Images (e.g., medical or satellite)– Simulation output– Observational data

Visualization?• Reducing amount of data, quantization

• Patterns

• Features

• Events

• Trends

• Irregularities

• Exit points for analysis

• Leading to presentation of data cognitive science and the mental

representation69

Types of visualization• Color coding (including false color)

• Classification of techniques is based on– Dimensionality– Information being sought, i.e. purpose

• Line plots

• Contours

• Surface rendering techniques

• Volume rendering techniques

• Animation techniques

• Non-realistic, including ‘cartoon/ artist’ style70

Metadata• Metadata is structured information that

describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource.

• Metadata is often called data about data or information about information.

71

Different types of metadata• Descriptive metadata describes a resource

for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords.

• Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters (used).

• Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. 72

Sub-types (admin)• Rights management metadata, which deals

with intellectual property rights

• Preservation metadata, which contains information needed to archive and preserve a resource.

73

Micro life cycle of data

In one slide?• Use case – you have to know the goal (+more)• Conceptual and logical models -> information

models• Understand information flows and uncertainties

(sign systems), the life cycle and manage them• Apply information, library, cognitive, social science,

and design elements to developing a design of an architecture

• Think the design through (e.g. get closer to the physical model (workflow?)) and assess the presentation, organization, content, context, structure, syntax, semantic and pragmatics 75

What would your slide include?

76

Objectives• To instruct future information architects how to

sustainably generate information models, designs and architectures

• To instruct future technologists how to understand and support essential data and information needs of a wide variety of producers and consumers

• For both to know tools, and requirements to properly handle data and information

• Will learn and be evaluated on the underpinnings of informatics, including theoretical methods, technologies and best practices.

77

Learning Objectives• Through class lectures, practical sessions,

written and oral presentation assignments and projects, students should:– Understand and develop skill in Development

and Management of multi-skilled teams in the application of Informatics

– Understand and know how to develop Conceptual and Information Models and Explain them to non-experts

– Knowledge and application of Informatics Standards

– Skill in Informatics Tool Use and Evaluation78

Discussion• About discovery?

• Integration?

• All of the material?

79

Reading for this week• Is retrospective

• Also covers metadata and information modeling

80

What is next• Break on May 4, no class

• Week 13 – Project presentations (May 11, i.e. in 2 weeks)

• IDEA surveys after April 28

81