EIM Intro - Information Lifecycle - Doc

8/9/2019 EIM Intro - Information Lifecycle - Doc

1/9

CRF-RDTE-TR-20091102-02

2/2/2009

Public Distribution| Michael Corsello

CORSELLO

RESEARCH

FOUNDATION

INFORMATION LIFECYCLE BASICSINTRODUCTION TO ENTERPRISE INFORMATION MANAGEMENT


2/9

Corsello Research Foundation

Public Distribution CRF-RDTE-TR-20091102-02

AbstractInformation follows a basic lifecycle from creation to disposal. Management of information is based

upon the concepts of architecting the structures and practices for managing the lifecycle of information

and all stages of information handling and processing.


3/9



Table of ContentsAbstract ......................................................................................................................................................... 2

Introduction .................................................................................................................................................. 4

Information Lifecycle .................................................................................................................................... 4

Creation ......................................................................................................................................................... 5

Capture...................................................................................................................................................... 5

Continual ............................................................................................................................................... 5

Bulk ....................................................................................................................................................... 5

Manual .................................................................................................................................................. 5

Derived .................................................................................................................................................. 6

Assessment ............................................................................................................................................... 6

Ingestion ................................................................................................................................................... 7

Distribution and Use ..................................................................................................................................... 7

Discoverability ........................................................................................................................................... 7

Accessibility ............................................................................................................................................... 8

Usability .................................................................................................................................................... 8

Maintenance ................................................................................................................................................. 8

Disposition .................................................................................................................................................... 8

Conclusions ................................................................................................................................................... 9

Appendices .................................................................................................................................................... 9

References ................................................................................................................................................ 9


4/9



IntroductionKnowledge is defined by the Oxford English Dictionary as (i) expertise, and skills acquired by a person

through experience or education; the theoretical or practical understanding of a subject, (ii) what is

known in a particular field or in total; facts and information or (iii) awareness or familiarity gained by

experience of a fact or situation (Wikipedia contributors , 2009).

Knowledge, as it pertains to information technology (IT), is the understanding of a subject by a human

based upon information presented by a computer and the humans prior knowledge of related subjects.

This is a key concept in that IT is responsible for the information that then results in knowledge. If

information is poorly presented to a person, knowledge is not effectively gained.

Information presented by a computer to a human is a set of data aggregated, processed and formatted

for human interpretation. The computer aggregates data based upon rules, such as through queries

that intend to limit which data is aggregated. Since computers can only perform actions as they are

programmed to, this process of sub-setting a corpus of data is constrained by how the data is managed

and how the computer is programmed.

Data is a set of simple values. When collected and presented under a context, those simple values

become information. The handling and management of data over its relevant lifetime is the information

lifecycle.

Information LifecycleThe information lifecycle is the processes by which data comes into existence, is managed over time and

eventually is discarded. There are generally four basic states of the information lifecycle:

Creation, collection or capture Distribution, use and access Maintenance, update or change Disposition, archival or destruction

Each of these states contributes significantly to the effectiveness of data to participate as information

that may then become knowledge to a user. Each information state may also involve multiple IT

systems, or none at all (such as paper notes). The information lifecycle includes activities involving

information both inside and outside of IT systems, as well as the movement of data between IT systems

and applications.


5/9



CreationThe creation of data involves the entire process from initial data generation through to the final storage

of data within a permanent repository. For some data, the entire lifecycle may be outside of IT systems,

such as paper records. In this case, analysis results produced from this data may be then managed

electronically.

The data creation phase of the information lifecycle is broken into three primary areas:

Capture Assessment and Approval Ingestion

Capture

Data capture can be divided into four primary categories:

Continual (feed)

Bulk Manual Derived

Continual

Data is continual if captured via an automated mechanism that directly provides the collected data to an

information system. This is commonly known as a data feed, such as in supervisory control and data

acquisition (SCADA) sources. Feed data is characterized by a minimum of human handling and is directly

fed into an information system or database. This level of capture expects a high fidelity of capture

(minimal data loss). Errors and quality control of this type of data is more immediate than other formsand often required staging of feed data to allow for assessment. In continual data, gaps are primarily

caused by system failures or calibration errors.

Bulk

When data is captured via an automated device and stored on a local, immediate storage device (such

as in a data logger), it is classed as a bulk capture system. Like a continual system, there is minimal

human interaction with the data itself as it is logged automatically. However, unlike a continual feed,

there may be additional data losses in time due to handling of devices to transfer the bulk data.

Continual and bulk mechanisms are often used together to form a data assessment chain in which

continual data is fed into a queue where it is evaluated in bulk for quality (quality assessment / QA) then

transferred into the permanent store as a block of records.

Manual

When a human records data by hand, either in writing or in a manual data logger (such as a field GPS

unit) that is manually collected data. Manual data is then entered by hand into an information system,

or transferred from the manual data logger. Manual data is prone to several additional sources of error


6/9



including typos and general human blunders. These errors are non-systematic and automatic correction

is not possible in most cases.

Derived

If a data set is generated as the output of an automated process, such as an analysis routine, it is a

derived data set. Derived data is often of the greatest direct value in a business, and is generated usingdata created by other means. Derived data will contain errors propagated from its constituent source

data and from analytic errors from any of a number of sources:

Numeric precision or rounding Incompatible input sources or scales Faulty analysis choices, such as false positives andfalse negatives Incomplete input source data

Derived data is often generated on an automated schedule and can be of great value when persisted in

a usable form for other analyses. Derived data sets are commonly used in reports, which may result in adata format that is only useful for direct human interpretation.

Assessment

Once data is captured, the assessment process involved the evaluation of data to ensure it meets pre-

defined criterion for acceptance. This process has two primary parts:

Quality Assurance (QA) Quality Control (QC)

Quality assurance is the set of practices that are performed to ensure data will meet acceptance criteria

prior to being created. This involves activities such as maintenance and calibration of instruments and

the usage of proper instruments. Additionally, evaluation of data once collected drives the quality

assurance activities for future collections.

Quality control is the set of practices, including QA, that ensure the quality of data within a business

repository will meet or exceed quality criterion. The term QC generally refers to the subset of QC

practices that are distinct from the QA portion of quality control.

The control of data quality includes the evaluation of data for quality prior to loading into the business

repositories and the management of all activities related to the history of the data quality controlled.

This history is a chain of custody and a sequence of events that define how the data came to be in ausable state within the business repository. Data within a repository can therefore be selected by this

chain of custody if any portion of the activities within the chain are found to be suspect (such as a faulty

collection instrument). It is common to construct information systems specifically around this facet of

information management.


7/9



Ingestion

Once data is captured and assessed for quality, it is only then loaded into the appropriate business

repositories. The process of ingesting data into a repository may involve transformation of the data to

match the destination format. This is a common requirement for automated collection mechanisms

where the source instrument produces a fixed data format. To maintain a full and verifiable chain of

custody, the raw data is kept in addition to the transformed data that resides within the business

repositories. For space savings, the raw files are often archived to an offline store such as optical media.

Distribution and UseThe use of data within a repository is the primary purpose for the datas existence. Data use is

considered in several ways:

Discoverability Accessibility Usability

Discoverability

Once data is within a repository is may be used. In order to use that data it must be discovered by a

potential user. The mechanisms put in place to facilitate the location of data are discovery mechanisms.

If data cannot be found, it cannot be used. Discoverability is key in the storage of data and the

availability of that storage from a user system. If a user must search in multiple locations to find the

data required, it is of marginal discoverability. For data to be discovered, the discovery data (metadata

or catalog) must also be accessible.


8/9



Accessibility

Once discovered, the data must be accessed to provide value. The accessibility of data involves aspects

such as security, logical location and format. If data is secured so that potential users cannot access it,

the value of the data is diminished to those users. In sensitive domains, this is expected and desired.

Logical location further limits accessibility if the data is contained within a repository that cannot be

accessed, such as behind a firewall. Further, if the logical location is simply far away, then data

transfer may take too long, rendering the data irrelevant once finally accessible. Finally, if the data is in

formats that are proprietary or poorly supported, then the data may not be accessible to the tools used

to process it. Overall, accessibility is a balancing act with security, need and cost.

Usability

After data has been discovered and accessed, it must then be usable. If the data is in an unusable

format given the available tools, it is unusable. If data must be processed prior to being used, such as

reformatting or translation of data it will be less usable. If the data processing takes too much time, the

data may become irrelevant once it is in a usable format.

Usability also has more subtle implications as well such as the scale, accuracy and precision of the data

itself. Low precision data cannot be used in a high precision analysis. As the cost of data creation is a

function of its volume and quality, this is always a trade off against anticipated use.

MaintenanceFor any data that changes over time, the maintenance of the data values is critical. This data editing is

still subject to discovery, access and usage in addition to the need for performing the edits. In some

scenarios, only the current values are relevant, where in others temporal changes are of greater

significance than the current values, which will affect and influence data management strategies.

The entire set of practices and processes that govern how data is managed and maintained within the

business repositories is the maintenance phase of the lifecycle. Issues such as archival, availability,

continuity of operations (COOP), fault-tolerance, performance and total costs are of key consideration in

the maintenance phase.

DispositionThe disposition phase of the information lifecycle involves the processes and practices by which data is

aged within the business repository. Disposition includes the archival or removal of old data,

segregation of history data from live data and mechanisms for making segregated data available. It

is common that disposition is driven by storage costs and legal mandates such as SarbanesOxley,

ClingerCohen or Health Insurance Portability and Accountability Act (HIPAA).


9/9



ConclusionsArchitecting information solutions for an organization is a complex set of practices and trade-offs to

maximize capabilities while minimizing cost. Given that information solutions take a great deal of time

and care to construct, proper planning is required well in advance of need to ensure solutions are

available by the time the need arises without wasted efforts.

Various strategies exist for planning information repositories, software implementations and user facing

applications. Planning for reuse of repositories and software back-end components and services is of

great importance. Stakeholders involved with information strategies need to understand the difference

between the data repositories containing data, back-end software processing data and the user

interfaces that present data and processing. The separation of these concepts in the minds of those

involved in planning can yield great results in long-term cost savings and capabilities realized.

Appendices

References

Wikipedia contributors . (2009, November 13). Knowledge . Retrieved November 13, 2009, from

Wikipedia, The Free Encyclopedia:

http://en.wikipedia.org/w/index.php?title=Knowledge&oldid=325539292

Documents

EIM Intro - Information Lifecycle - Doc