Upload
theresa-hunter
View
215
Download
0
Embed Size (px)
Citation preview
1
Peter Fox
Xinformatics – ITEC 6961/CSCI 6960/ERTH-6963-01
Week 12, April 27, 2010
Information discovery and integration and course
summary
Contents• Review of last class, reading
• Information discovery
• Information integration
• Summary of this course and what you needed to learn
• Objectives
• Discussion of reading
• Next class
2
Recall forms of information• Structured/ un-structured
• Presentation and organization
• Syntax-semantics-pragmatics
• Managed, designed and architected.
• Goal of this part of the class is to understand how discovery and integration are enabled or disabled based on these factors
3
Discovery• How does someone find your information?
• How would you provide discovery of – collections – files – ‘bits’
• How would you find ->
4
Discoveryo Federated Searcho Folksonomies (user contributed)o Intelligent Agentso Search Engineso Taxonomies
o Find photos of KimoBoy or girl?
5
Use cases• Find a sound recording of a swallow.
• Excuse me?
6
Use cases• Find a sound recording of an African Swallow
• Find a sound recording of a bird that sounds like an African Swallow
• Media types – how can you discover them?
7
Use cases• Find the movie that Jean Tripplehorn first
starred in/ that was her most successful/ was lead actress?
• Has anyone gene sequenced a mouse?
• Discovery can often involve information integration
8
9
Three level ‘metadata’ solution for DATA
Level 1:
Data Registration at the Discovery Level,
e.g. Volcanolocation and activity
Level 2:
Data Registration at the Inventory Level,
e.g. list of datasets,times, products
Level 3:
Data Registration at the Item Detail
Level, e.g. access toindividual quantities
Ontology basedData IntegrationUsing scientific
workflows
Earth Sciences Virtual DatabaseA Data Warehouse where
Schema heterogeneity problem is Solved; schema based integration
Data Discovery Data Integration
A.K.Sinha, Virginia Tech, 2006
10
Three level ‘metadata’ solution?
Level 1:
Registration at the Discovery Level,
e.g. Find the upperlevel entry point to a
source
Level 2:
Registration at the Inventory Level,
e.g. list of datasets,using the logical
organization
Level 3:
Registration at the Item Detail
Level, i.e. annotatione.g. tagging
Integrationusing mappingmanagement
Catalog/ IndexSchema based integration
Information Discovery
Information
Integration
A.K.Sinha, Virginia Tech, 2006
Information discovery• What makes discovery work?
– Metadata– Logical organization– Attention to the fact that someone would want to
discover it– It turns out that file types are a key enabler or
inhibitor to discovery
• What does not work?– Result ranking using *any* conventional
algorithm11
Federated search• “is the simultaneous search of multiple online
databases or web resources and is an emerging feature of automated, web-based library and information retrieval systems. It is also often referred to as a portal or a federated search engine.” wikipedia
• Libraries have been doing this for a long time (Z39.50, ISO23950)
• Key is consistent search metadata fields (keywords)• E.g. Geospatial One Stop http://www.geodata.gov
12
Search engines (1)• Contains an automated spider or crawler • No theoretical limits in the amount of indexing
(limited by hardware) • Support remote indexing• Continual background indexing of content• Custom metatag support (some low-end
products do not support this feature) • Support for indexing PDF, .doc, etc (some low-
end products do not support this feature) • Supports URL and word exclusions &
inclusions13
Search engines (2)• SSI supported
• Search by custom metatags
• Case sensitive or insensitive searching
• Simple Customizable search/results pages
• Boolean Searching capabilities
• Provide users meta description and page title in search results
• Inexpensive – $200
• Easily customizable search/results interface14
Search engines (3)• Result weighting feature
• URL Inclusion list
• Require significant memory (RAM) and disk space as the collection grows
• Low-end alternatives often do not possess the capabilities to do phrase or natural language searching.
15
Improve www discovery• Implement metatags on your and your partners web
sites• Update content frequently • Register your site with the major search engines
(tools exist to aid in this process)• Perform a basic study of where your site results
within the major search engine providers• Do not spam the search engine providers • Re-evaluate your web site directory structure to
ensure information is appropriately categorized/ described within your URL strings
16
Improve www discovery• Look through your server log files to determine what
users are trying to find on your site and/or the path they are using to find information
• Perform basic usability testing of your site to determine what users expect and can easily gather from your site. This also may determine why users go to an Internet search engine provider versus accessing your site directly.
• Realize that Internet search engines don’t all act the same, index at the same time period, and often value a particular metatag, document date, etc. more than another vendor product. 17
Smart search• Semantically aware search, e.g.
http://noesis.itsc.uah.edu
• Faceted search, e.g. mspace (http://mspace.fm ), Earth System Grid (http://esg.prototype.ucar.edu ), exhibit (MIT)
18
NOESIS
19
Faceted search• Semantically aware search, e.g.
http://noesis.itsc.uah.edu
• Faceted search, e.g. mspace (http://mspace.fm ), Earth System Grid (http://esg.prototype.ucar.edu )
20
Summary - discovery• Useful to write a few discovery use cases to
drive how your design is developed
• Evolution of your role in facilitating discovery and what/ how others implement access to your information
21
Information integration• Involves combining information residing in different sources
and providing users with a unified view of them. This process becomes significant in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example).
• Integration appears with increasing frequency as the volume and the need to share existing information explodes.
• It has become the focus of extensive theoretical work, and numerous open problems remain unsolved.
• In management circles, people frequently refer to data integration as "Enterprise Information Integration" (EII)” wikipedia
• Is this an information management challenge (rhetorical question)
22
Aiding integration• Standards – formats for sure but also
• Metadata
• Semantics
• As such any integration capability is HIGHLY curated or left entirely to the end user
• If left to the user, results in a new product which should also be managed and shared
• What do you do?23
Recall elements/ forms of information
• Structured/ un-structured, content, context
• Presentation and organization
• Syntax-semantics-pragmatics
• Managed, designed and architected.
• Integration poses an important challenge here– Two forms presented/ organized differently– Different structure, semantics…
• Information back to data back to information 24
Micro life cycle of data
Geospatial
26
• Much of the work on information integration has focused on the dynamic integration of structured data sources, such as databases or XML data.
• With the more complex geospatial data types, such as imagery, maps, and vector data, researchers have focused on the integration of specific types of information, such as placing points or vectors on maps, but much of this integration is only partially automated.
• The challenge is that the dynamic integration of online data and geospatial data is beyond the state of the art of existing integration systems.
Geospatial
27
• The conflation process divides into following tasks: (1) find a set of conjugate point pairs, termed "control point pairs", in both vector and image datasets, (2) filter control point pairs, and (3) utilize algorithms, such as triangulation and rubber-sheeting, to align the rest of the points and lines in two datasets using the control point pairs.
• Typically by human input has been essential to find control point pairs and/or filter control points
Vectors on maps
28
Different contexts?• Heavily relies on metadata, especially on
structural/ use metadata
• Is more than often what leads to new findings, and abduction!
• Exercise – how does integration occur for the other aspects of information?
29
Review of course content
30
Abduction• Is a method of logical inference introduced by
Peirce which comes prior to induction and deduction for which the colloquial name is to have a "hunch".
• Abductive reasoning starts when an inquirer considers of a set of seemingly unrelated facts, armed with an intuition that they are somehow connected.
• The term abduction is commonly presumed to mean the same thing as hypothesis; however, an abduction is actually the process of inference that produces a hypothesis as its end result
31
Use Case
• … is a collection of possible sequences of interactions between the system under discussion and its actors, relating to a particular goal.
• The collection of Use Cases should define all system behavior relevant to the actors to assure them that their goals will be carried out properly.
• Any system behavior that is irrelevant to the actors should not be included in the use cases.– is a prose description of a system's behavior when
interacting with the outside world.– is a technique for capturing functional requirements of
business systems and, potentially, of an IT system to support the business system.
– can also capture non-functional requirements
Developed for NASA TIWG
Table of Contents• ==Plain Language Description==• ===Short Definition===• ===Purpose===• ===Describe a scenario of expected use===• ===Definition of Success===• ==Formal Use Case Description==• === Use Case Identification===• ===Revision Information===• ===Definition===• ===Successful Outcomes===• ===Failure Outcomes===• ==General Diagrams==• ===Schematic of Use case===• ==Use Case Elaboration==• ===Actors===• ====Primary Actors====• ====Other Actors====• ===Preconditions===• ===Postconditions===• ===Normal Flow (Process Model)===• ===Alternative Flows===
• ===Special Functional Requirements===• ===Extension Points===• ==Diagrams==• ===Use Case Diagram===• ===State Diagram===• ===Activity Diagram===• ===Other Diagrams===• ==Non-Functional Requirements==• ===Performance===• ===Reliability===• ===Scalability===• ===Usability===• ===Security===• ===Other Non-functional Requirements===• ==Selected Technology==• ===Overall Technical Approach===• ===Architecture===• ===Technology A===• ====Description====• ====Benefits====• ====Limitations====• ===Technology B===• ====Description====• ====Benefits====• ====Limitations====• ==References==
• ===Special Functional Requirements===• ===Extension Points===• ==Diagrams==• ===Use Case Diagram===• ===State Diagram===• ===Activity Diagram===• ===Other Diagrams===• ==Non-Functional Requirements==• ===Performance===• ===Reliability===• ===Scalability===• ===Usability===• ===Security===• ===Other Non-functional Requirements===• ==Selected Technology==• ===Overall Technical Approach===• ===Architecture===• ===Technology A===• ====Description====• ====Benefits====• ====Limitations====• ===Technology B===• ====Description====• ====Benefits====• ====Limitations====• ==References==
Information theory
• Semiotics, also called semiotic studies or semiology, is the study of sign processes (semiosis), or signification and communication, signs and symbols, into three branches:– Syntactics: Relation of signs to each other in
formal structures– Semantics: Relation between signs and the
things to which they refer; their denotata– Pragmatics: Relation of signs to their impacts on
those who use them
34
Information integrity• Information of a random variable is defined as
the Sum of p x log p, where p=probability. It represents the uncertainty of the variable.
• In later classes we will cover cognitive and social factors in increasing the conditional entropy and thus reducing the uncertainty and thus increasing information content and value
• We will also cover semiotics (signs) as a prelude to visualization as a presentation mechanism for information 35
Information gain/loss• The mutual information of two variables
define how much information one variable contains about the other.
• It is therefore defined as the decrease of the uncertainty of one variable by knowing the other.
• In probabilistic terms, the entropy decreases by conditioning on the distribution.
• What does this mean for an information system? E.g. a website or web service?
36
Noise• Most often refers to ‘data’ but does apply to
information• Uncertainty, especially any that is introduced is a
source of noise, or more accurately – bias in the use or interpretation of the information
• Noise/ bias is context and structure dependent• Noise/ bias contamination is rampant in information
systems• Quality control and verification is less developed for
information sources, e.g. ‘people do not report problems’
37
Library science• Curates the artifacts of knowledge
• Organizes and manages them for consumers– Cataloging and classification
• Preservation– ‘maintaining or restoring access to artifacts,
documents and records through the study, diagnosis, treatment and prevention of decay and damage’ (wikipedia)
• Digital age– Curation and preservation
38
Cognitive Science• Cognitive science is an interdisciplinary study of
the mind and intelligence• It operates at the intersection of psychology,
philosophy, computer science, linguistics, anthropology, and neuroscience.
• Of relevance for data and information science are three significant theoretical underpinnings– mental representation,– the nature of expertise, – and intuition
• Very relevant to model, metamodel choice39
Social Science• Branch of humanities
• Especially as it relates to networks of scientists
• Exploits sociology of groups, teams
• Cultural norms as well as discipline norms– Modes of what and how rewards are given– Between those who produce and those who
consume data and information– How you collect, understand, model and design
models and architectures is as much social as technical skill
40
Presentation• Separation of content from presentation!!
• The theory here is more empirical or semi-empirical
• Is developed based on a solid understanding of minimizing information uncertainty beginning with content, context and structural considerations and, as we will see, adding cognitive and social factors to reduce uncertainty.
• Physiology for humans, color, …41
Organization• Organizations as producers and consumers
• Organization of information presentation, e.g. layout on a web page
• Also (again) content, context and structure
• How do you organize– Information you’ve collected this semester– Information given to you by others
42
Context• Internal - Human context, tacit knowledge
• External
43
Structure• Is information stored or only presented?
• Structural representation of information content can bias presentation, e.g.– Modern image capture devices (digital camera)
often convert 2 byte integer to float, or 4 byte integer, what are the implications
• Appropriate choice of information structure can significantly decrease uncertainty, e.g. returning land images in GeoTIFF, which can encoding geographic location, instead of PNG 44
Content• Presentation
• Translation
• Encoding
45
Mental Representation• Thinking = representational structures +
procedures that operate on those structures.
• Data structures mental representations+ algorithms +procedures= running programs =thinking
• Methodological consequence: study the mind by developing computer simulations of thinking.
46
Semiotics• Also called semiotic studies or semiology, is
the study of sign processes (semiosis), or signification and communication, signs and symbols
47
Semiotic model
48
Syntax• Relation of signs to
each other in formal structures
• … the term syntax is also used to refer directly to the rules and principles that govern the …
• But not the meaning or the use! 49
Semantics
• Relation between signs and the things to which they refer; their denotata
• Study of meaning of … (anything?)
• Mainly need to worry about failures
50
Pragmatics• Relation of signs to their
impacts on those who use them
• the ways in which context contributes to meaning, conveying and use
51
Information Models• Conceptual models, sometimes called domain
models, are typically used to explore domain concepts
• High-level conceptual models are often created as part of initial requirements envisioning efforts as they are used to explore the high-level static business or science or medicine structures and concepts.
• Conceptual models are often created as the precursor to logical models or as alternatives to them
• Followed by logical and physical models 52
Object models• A data model is a logic organization of the
real world objects (entities), constraints on them, and the relationships among objects. – A database (DB) language is a concrete syntax
for an object (data) model. – A DB system implements that model.
53
Object design• Object-oriented modeling is a formal way of
representing something in the real world. It draws from traditional set theory and classification theory. Some basics to keep in mind in object-oriented modeling are that:– Instances are things.– Properties are attributes.– Relationships are pairs of attributes.– Classes are types of things.– Subclasses are subtypes of things.
54
Architectures• Building on content, context,
and users, some illustrate information architecture as an iceberg.
• Just like an iceberg, the majority of information architecture work is out of sight, "below the water."
• The work includes the creation of plans, controlled-vocabularies, and blueprints all before any user interfaces are created.
55
Interfaces
• Increasingly in tiered architectures there are numerous interfaces
• Information flow at interfaces and thus software engineering at those interfaces becomes a very important consideration
56
And relation to design?• “In the context of information systems design,
information architecture refers to the analysis and design of the data stored by information systems, concentrating on entities, their attributes, and their interrelationships.
• It refers to the modeling of data for an individual database and to the corporate data models an enterprise uses to coordinate the definition of data in several (perhaps scores or hundreds) of distinct databases.
57
Design theory• Elements
– Form– Value– Texture– Lines– Shapes– Direction– Size– Color
• Relate these to signs and relations between them
58
Principles of design
• Balance
• Gradation
• Repetition
• Contrast
• Harmony
• Dominance
• Unity
59
Reference architectures• “provides a proven template solution for an
architecture for a particular domain. It also provides a common vocabulary with which to discuss implementations, often with the aim to stress commonality.
• A reference architecture often consists of a list of functions and some indication of their interfaces (or APIs) and interactions with each other and with functions located outside of the scope of the reference architecture.” (wikipedia) 60
Statefull versus stateless• A key distinction between Grids and Web
environments is state, i.e. the knowledge of ‘who’ knows and remembers ‘what’
• Increasingly there is a need for maintaining some form of state, i.e. reducing information entropy in web and internet-based architectures
• Thus, enter the need for ‘state for a defined purpose’…
61
Life-cycle elements• Acquisition: Process of recording or
generating a concrete artefact from the concept (see transduction)
• Curation: The activity of managing the use of data from its point of creation to ensure it is available for discovery and re-use in the future
• Preservation: Process of retaining usability of data in some source form for intended and unintended use
• Stewardship: Process of maintaining integrity across acquisition, curation and preservation 62
Acquisition• Learn / read what you can
about the developer of the means of acquisition– Documents may not be
easy to find
– Remember bias!!!
• Document as you go
• Have a checklist (the Management list) and review it often 63
Curation• From Producers to Consumers
• Consider the organization and presentation of the data
• Document what has been (and has not been) done
• Consider and address the provenance to date
• Be as technology-neutral as possible
• Look to add metainformation64
Preservation• Refers to the full life cycle
• Archiving is a component
• Stewardship is the act of preservation
• Intent is that ‘you can open it any time in the future’ and that ‘it will be there’
• This involves steps that may not be conventionally thought of
• Think 10, 20, 50, 200 years…. looking historically gives some guide to future considerations 65
Summary of Management• Creation of logical collections
• Physical handling
• Interoperability support
• Security support
• Ownership
• Metadata collection, management and access.
• Persistence
• Knowledge and information discovery
• Dissemination and publication 66
67
Workflow• General definition: series of tasks performed
to produce a final outcome
• Information workflow – “analysis pipeline”– Automate tedious jobs that users traditionally
performed by hand for each dataset– Process large volumes of data/ information faster
than one could do by hand
68
Workflows
• Formal models of the flow of data/ information among processing components
• May be simple and linear or more complex• Can process many data/ information types:
– Archives– Web pages– Streaming/ real time– Images (e.g., medical or satellite)– Simulation output– Observational data
Visualization?• Reducing amount of data, quantization
• Patterns
• Features
• Events
• Trends
• Irregularities
• Exit points for analysis
• Leading to presentation of data cognitive science and the mental
representation69
Types of visualization• Color coding (including false color)
• Classification of techniques is based on– Dimensionality– Information being sought, i.e. purpose
• Line plots
• Contours
• Surface rendering techniques
• Volume rendering techniques
• Animation techniques
• Non-realistic, including ‘cartoon/ artist’ style70
Metadata• Metadata is structured information that
describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource.
• Metadata is often called data about data or information about information.
71
Different types of metadata• Descriptive metadata describes a resource
for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords.
• Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters (used).
• Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. 72
Sub-types (admin)• Rights management metadata, which deals
with intellectual property rights
• Preservation metadata, which contains information needed to archive and preserve a resource.
73
Micro life cycle of data
In one slide?• Use case – you have to know the goal (+more)• Conceptual and logical models -> information
models• Understand information flows and uncertainties
(sign systems), the life cycle and manage them• Apply information, library, cognitive, social science,
and design elements to developing a design of an architecture
• Think the design through (e.g. get closer to the physical model (workflow?)) and assess the presentation, organization, content, context, structure, syntax, semantic and pragmatics 75
What would your slide include?
76
Objectives• To instruct future information architects how to
sustainably generate information models, designs and architectures
• To instruct future technologists how to understand and support essential data and information needs of a wide variety of producers and consumers
• For both to know tools, and requirements to properly handle data and information
• Will learn and be evaluated on the underpinnings of informatics, including theoretical methods, technologies and best practices.
77
Learning Objectives• Through class lectures, practical sessions,
written and oral presentation assignments and projects, students should:– Understand and develop skill in Development
and Management of multi-skilled teams in the application of Informatics
– Understand and know how to develop Conceptual and Information Models and Explain them to non-experts
– Knowledge and application of Informatics Standards
– Skill in Informatics Tool Use and Evaluation78
Discussion• About discovery?
• Integration?
• All of the material?
79
Reading for this week• Is retrospective
• Also covers metadata and information modeling
80
What is next• Break on May 4, no class
• Week 13 – Project presentations (May 11, i.e. in 2 weeks)
• IDEA surveys after April 28
81