45
CSE494/598 Principles of Information Engineering Information Life Cycle

CSE494/598 Principles of Information Engineering Information Life Cycle

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CSE494/598 Principles of Information Engineering Information Life Cycle

CSE494/598 Principles of Information

Engineering

Information Life Cycle

Page 2: CSE494/598 Principles of Information Engineering Information Life Cycle

Lesson Objectives: 

1.     Describe the parts of the Information Life Cycle.

2.     Explain the advantages and disadvantages of coding and compression.

3.     Discuss considerations for information storage.

4.     Describe common factors that must be addressed for proper information presentation.

5.     Relate information analysis to information retrieval.

6.     Classify characteristics of various generations of database management.

 

Advance Reading Material: 

Read “The Top 10 Data Mining Questions”. It can be found at:

http://www.datamining.com/top10

Page 3: CSE494/598 Principles of Information Engineering Information Life Cycle

Information Life Cycle

Information ScienceInformation Science

Acquisition

Analysis/mining/processing

Coding/Compression

Storage

Re -engineering

Preservation

Retrieval

Packaging/Visualization

Presentation

Transport

Discard

Information ScienceInformation Life Cycle

Acquisition

Analysis/mining/processing

Coding/Compression

Storage

Re -engineering

Preservation

Retrieval

Packaging/Visualization

Presentation

Transport

Discard

Page 4: CSE494/598 Principles of Information Engineering Information Life Cycle

1. Information Acquisition

• Acquiring of business-related information in digital form

• Traditionally, record based data mostly in table form

• Now multimedia data– Conversion to digital form for on-line

processing– Overall organization for seamless integration

Page 5: CSE494/598 Principles of Information Engineering Information Life Cycle

2. Coding & Compression

• Coding of data in order to minimize its representation for reducing the storage requirement and reducing the bandwidth requirement in communication

• Need different techniques for each type of media and even each type of object– Facsimile vs. aerial pictures vs. portrait

• Technique must be fast, one-pass, adaptive and invertible, and must not impose unreasonable requirements on resources.

Page 6: CSE494/598 Principles of Information Engineering Information Life Cycle

3. Analysis & mining

• Raw numbers, words, images and sounds are not immediately useful – their contents must be analyzed and represented in machine processable form:– Mining of databases for useful information– Extraction of contents from images and video– Conceptualization of text– Feature analysis of audio segments

Page 7: CSE494/598 Principles of Information Engineering Information Life Cycle

4. Storage• Business data can be very large and heterogeneous

with respect to all parameters• Appropriate storage techniques ensure: proper

management, location and distribution, and the flow of objects.

• Among issues to be considered:– Data placement– What technology (medium) to use for storage– Distribution: local, remote, out-sourced– Speed of delivery

Page 8: CSE494/598 Principles of Information Engineering Information Life Cycle

5. Re-engineering

• Legacy systems make up most of the business data systems

• Maintenance and modernization of these systems represents a large portion of IT efforts

• Important decisions: – Maintenance– replace & migrate– modernize for co-existence

Page 9: CSE494/598 Principles of Information Engineering Information Life Cycle

Is legacy code like Chernobyl?

• Remember Chernobyl? The meltdown of the Nuclear reactor. Officials poured concrete over it and hoped that, someday, it would just go away!

• Legacy code and Chernobyl: Too messy to clean up but too dangerous to ignore!

Page 10: CSE494/598 Principles of Information Engineering Information Life Cycle

Legacy code...

• Theory: rebuild the legacy system from ground up with– a relational (or OO) database– graphical user interfaces– client/server architecture

• Practice: expensive and risky, because of size, complexity and poor documentation.

Page 11: CSE494/598 Principles of Information Engineering Information Life Cycle

Case study 1

• 700 clients

• 120,000,000 credit cards (mid-90s figure)

• Over 14 tera bytes of data

• 2 billion transactions per month– 19 billion disk/tape I/O per month

• Around 23 million transactions are processed from 8:00 pm to 2:00 am

Page 12: CSE494/598 Principles of Information Engineering Information Life Cycle

Case study 2

• 22 million telephone customers• “zero downtime” must be guaranteed • COBOL code: Hundreds of millions of lines• Many tera bytes of data “owned” by

applications – no sharing -> redundant storage

• Regulatory change: “rate of return” to “price cap”

• Reengineer 80% of the business process

Page 13: CSE494/598 Principles of Information Engineering Information Life Cycle

Case study 2

• Incremental migration into a client server computing architecture

• Began in late 80s ago, still on-going…• Around 10,000 workstations, and growing• Biggest challenge: Inability of mainframe

to participate in distributed C/S computing– CICS unable to cooperate in a nested sub-

transaction Integrity?

Page 14: CSE494/598 Principles of Information Engineering Information Life Cycle

About Legacy Systems

• Large, with millions of lines of code

• Over 10 years old

• Mission-critical - 24-hour operation

• Difficulty in supporting current/future business requirements

• 80-90% of IT budget

• Instinctive solution: Migrate!

Page 15: CSE494/598 Principles of Information Engineering Information Life Cycle

Migration Strategies

• Complete rewrite of legacy code– Many problems– Risky– Prone to failure

• Incremental migration– Migrate the legacy system in place by small

incremental steps– Control risk by choosing increment size.

Page 16: CSE494/598 Principles of Information Engineering Information Life Cycle

One-step Migration Impediments

• Business conditions never stand still• Specifications rarely exist• Undocumented dependencies• Management of large projects

– too hard, tend to bloat

• Migration with live data • Analysis paralysis sets in • Fear of change

Page 17: CSE494/598 Principles of Information Engineering Information Life Cycle

Incremental Migration

• Incrementally analyze the legacy IS• Incrementally decompose• Incrementally design the target interfaces• Incrementally design the target applications• Incrementally design the target database• Incrementally install the target environment• Create and install the necessary gateways• Incrementally migrate the legacy database• Incrementally migrate the legacy applications• Incrementally migrate the legacy interfaces• Incrementally cut over to the target IS

Page 18: CSE494/598 Principles of Information Engineering Information Life Cycle

A Comparison

One step IncrementalSuited for Non-decomposable Decomposable

Risk Huge Controllable

Failure Entire project Step at a time

Benefits Immediate Incremental

Outlook Unpredictable untildeadline

Conservativelyoptimistic

Page 19: CSE494/598 Principles of Information Engineering Information Life Cycle

6. Preservation

• Similar to physical security measures for protecting buildings, cash and other tangible assets, information must be protected while recorded, processed, stored, shared, transmitted, or retrieved.

• Must protect against loss, alteration, and disclosure• Must prevent unauthorized access and unauthorized

use of– Computer systems– Networks– Information

Page 20: CSE494/598 Principles of Information Engineering Information Life Cycle

7. Retrieval

• Query languages have come a long way from old style navigational queries to today’s content-based query languages

• Important: Any constraint (e.g., a processable feature) may be used as the criterion for search

• Require efficient retrieval techniques, similar to those for data retrieval, for all types of information

Page 21: CSE494/598 Principles of Information Engineering Information Life Cycle

Web Search Engines

• A text retrieval system with a Web interface• The document collection of a search engine

can be either a pre-compiled special collection or a set of Web pages collected from many web servers by a program called Web robot.

• Each document is preprocessed and represented as a vector of terms with weights.

Page 22: CSE494/598 Principles of Information Engineering Information Life Cycle

Web Search Engines (cont’d..)

• The steps are,

• Stopward removal: Remove non-content words such as “a” and “is” from each document.

• Stemming: Map variations of the same word into a term.

• Term weighting: Assign a weight to each term in a document to indicate the importance of the term in representing the importance of the term in representing the contents of the document.

Page 23: CSE494/598 Principles of Information Engineering Information Life Cycle

Web Search Engines (cont’d..)

• A query is also transformed into a vector with weights.

• The similarity between a query and a document can be measured by the dot product of their respective vectors.

• When documents are HTML web pages, other factors can influence a term's weight in a document.

– title or the header

– enclosed in special tags or in special fonts

– Google and AltaVista use tag and location information

Page 24: CSE494/598 Principles of Information Engineering Information Life Cycle

Web Search Engines (cont’d..)

• Web pages are hyperlinked. There are pointers going from one page to another.

• Associated with each pointer are words (anchor terms), which show the users what trey are likely to find if the pointer is followed.

• Anchor terms are utilized to index referenced/pointed pages.

• Linkage information can also be combined with similarity information to improve the retrieval effectiveness.

Page 25: CSE494/598 Principles of Information Engineering Information Life Cycle

Meta-Search Engines

• Has a number of modules.• The user interface module accepts the user’s

query which will be forwarded, with necessary reformatting, by the query dispatcher module to the various search engines.

• When the search engines return the sets of the retrieved documents to the metasearch engine, these sets are merged by the result merger module into a single ranked list of documents.

Page 26: CSE494/598 Principles of Information Engineering Information Life Cycle

Meta-Search Engines (cont’d..)

• Certain number of the top documents from this list are displayed to the user.

• When the number of search engines underlying a metasearch engine is large, forwarding each user query to each search engine is very inefficient. To overcome this, a database selection module is included.

• Its function is to identify for each user query the search engines that are likely to return useful documents to the user.

Page 27: CSE494/598 Principles of Information Engineering Information Life Cycle

8. Presentation

• Information must be presented to the user in a form that is usable– Cookies take care of part of the issue

• Issues are diverse and range from formatting, visualization, language, and even cultural barriers

• In the case of multimedia information, both temporal and spatial issues must be dealt with

Page 28: CSE494/598 Principles of Information Engineering Information Life Cycle

9. Transport

• Moving of data/information from one location to another– Most common form: digital communication

• Technology selection for information transport:– What communication service?– What protocols?– What quality of service?– What physical resources?

Page 29: CSE494/598 Principles of Information Engineering Information Life Cycle

10. Information discard

• Destruction of information once its useful life is over– Generally, preserve data – unless discard is

needed

• Methods for discard

• Legal issues must be taken into account

Page 30: CSE494/598 Principles of Information Engineering Information Life Cycle

3. Analysis & mining

Additional notes…

Page 31: CSE494/598 Principles of Information Engineering Information Life Cycle

Information Analysis and Mining• In multimedia objects:

– Extraction of features– Their representation– Indexing on the basis of contents

• For data: Mining in order to find useful patterns and correlations

• For text: – Conceptual representation – Ontological classification of concepts

Page 32: CSE494/598 Principles of Information Engineering Information Life Cycle

Analysis of Images

• Extract features– Color– Shape– Texture– Spatial relationships

• Create a logical representation for the image– Semantic nets are effective

• Classify and index so that the search process will be efficient

Page 33: CSE494/598 Principles of Information Engineering Information Life Cycle

Analysis of Video• Determine video segments by detecting scene cuts

(Scene cut detection process) • Select a representative frame for each segment• Extract Spatial features :

– color, texture, shape, and relative object positions

• Extract Temporal features:– object trajectories, camera motion, viewing perspective– temporal relationships among objects

• Represent each segment with an object that can be efficiently indexed by its features

Page 34: CSE494/598 Principles of Information Engineering Information Life Cycle

Video Indexing process

Scene Change

Detection

Camera Operation +

Object Motion Extraction

Text Analysis

Closed Caption

Analysis

Audio

Analysis Spatial FeaturesExtraction

Keywords Keywords CameraOperation

Object Trajectory

ShapeSketch

Texture

SpatialRelation-ships

Keywords Color

Sound Characteristic

Representative Frame

Selection/Creation

Object Segmentation

Description Text

Objects

Page 35: CSE494/598 Principles of Information Engineering Information Life Cycle

Analysis of Audio• For Speech:

– Textual information from speech (then sound retrieval becomes text retrieval)

– Speaker Information (identification)• For Generic Sound:

– Loudness– Pitch– Tone– Cepstrum– Derivatives

• For Music:– Rhythm– Event– Instrument

Page 36: CSE494/598 Principles of Information Engineering Information Life Cycle

Analysis of data• The hardest task: Integration of data from multiple

databases– Despite many years of work, we still have difficulty in

this area

• Data mining tasks: descriptive, predictive– Descriptive: Characterize general properties of the data

– Predictive: Perform inference on the data to make predictions

• Most common types: Specialized abstracts and integrated tables

Page 37: CSE494/598 Principles of Information Engineering Information Life Cycle

Data Collection and Database Creation

(1960s and earlier)

-Primitive file processing

Early days of databases

Page 38: CSE494/598 Principles of Information Engineering Information Life Cycle

Database Management Systems

(1970s-early 1980s)

-Hierarchical and network database systems

-Relational database systems

-Data modeling tools: entity-relationship model, etc.

-Indexing and data organization techniques:

B+ -tree, hashing etc.

-Query languages: SQL, etc.

-User interfaces, forms and reports

-Query processing and query optimization

-Transaction management: recovery,

concurrency control,etc.

-On-line transaction processing(OLTP)

Database management systems

Page 39: CSE494/598 Principles of Information Engineering Information Life Cycle

Advanced Database Systems

(mid-1980s-present)

-Advanced data models:

extended-relational,

object-oriented,

object-relational, deductive

-Application-oriented:

spatial, temporal,

multimedia, active, scientific,

knowledge bases

Current databases

Page 40: CSE494/598 Principles of Information Engineering Information Life Cycle

Data Warehousing and Data Mining

(late 1980s-present)

-Data warehouse and OLAP technology

-Data mining and knowledge discovery

Data Integration

Web-based Database Systems

(1990s-present)

-XML based database systems

-Web mining

Page 41: CSE494/598 Principles of Information Engineering Information Life Cycle

Data Collection and Database Creation

(1960s and earlier)

-Primitive file processing

Database Management Systems

(1970s-early 1980s)

-Hierarchical and network database systems

-Relational database systems

-Data modeling tools: entity-relationship model, etc.

-Indexing and data organization techniques:

B+ -tree, hashing etc.

-Query languages: SQL, etc.

-User interfaces, forms and reports

-Query processing and query optimization

-Transaction management: recovery,

concurrency control,etc.

-On-line transaction processing(OLTP)

Data Warehousing and Data Mining

(late 1980s-present)

-Data warehouse and OLAP technology

-Data mining and knowledge discovery

New Generation of Integrated Information Systems

(2000-…)

Advanced Database Systems

(mid-1980s-present)

-Advanced data models:

extended-relational,

object-oriented,

object-relational, deductive

-Application-oriented: spatial, temporal, multimedia, active, scientific, knowledge bases

Web-based Database Systems

(1990s-present)

-XML based database systems

-Web mining

Putting it all Together…

Page 42: CSE494/598 Principles of Information Engineering Information Life Cycle

Information mining process

• Data cleaning– Reformatting and conversation may be necessary

• Data integration– Heterogeneity possible in any aspect

• Data selection• Data transformation• Data mining and evaluation of patterns• Presentation of knowledge

Page 43: CSE494/598 Principles of Information Engineering Information Life Cycle

Cleaning and

Integration

Data

Warehouse

Selection and

Transformation

Data MiningData MiningPatternsPatterns

KnowledgeKnowledgeEvaluation andEvaluation and

PresentationPresentation

Flat filesFlat files DatabasesDatabases

Page 44: CSE494/598 Principles of Information Engineering Information Life Cycle

Data Warehousing and ETL

• An organized repository of data from multiple data sources – A unified schema for all of the participating

databases

• Provides data analysis capabilities, collectively known as On-Line Analytical Processing (OLAP)

• A number of pieces are needed: tools, gateways, and conversion routines

Page 45: CSE494/598 Principles of Information Engineering Information Life Cycle

Data source in Location 2

Data source in Location 1

Data source in Location 3

Data source in Location 4

Clean

Transform

Integrate

Load

Data

Warehouse

Query and

analysis tools

Client

Client

Typical architecture of a data warehouse