Upload
michael-corsello
View
217
Download
0
Embed Size (px)
Citation preview
8/9/2019 EIM Intro - Information Lifecycle - Doc
1/9
CRF-RDTE-TR-20091102-02
2/2/2009
Public Distribution| Michael Corsello
CORSELLO
RESEARCH
FOUNDATION
INFORMATION LIFECYCLE BASICSINTRODUCTION TO ENTERPRISE INFORMATION MANAGEMENT
8/9/2019 EIM Intro - Information Lifecycle - Doc
2/9
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-02
AbstractInformation follows a basic lifecycle from creation to disposal. Management of information is based
upon the concepts of architecting the structures and practices for managing the lifecycle of information
and all stages of information handling and processing.
8/9/2019 EIM Intro - Information Lifecycle - Doc
3/9
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-02
Table of ContentsAbstract ......................................................................................................................................................... 2
Introduction .................................................................................................................................................. 4
Information Lifecycle .................................................................................................................................... 4
Creation ......................................................................................................................................................... 5
Capture...................................................................................................................................................... 5
Continual ............................................................................................................................................... 5
Bulk ....................................................................................................................................................... 5
Manual .................................................................................................................................................. 5
Derived .................................................................................................................................................. 6
Assessment ............................................................................................................................................... 6
Ingestion ................................................................................................................................................... 7
Distribution and Use ..................................................................................................................................... 7
Discoverability ........................................................................................................................................... 7
Accessibility ............................................................................................................................................... 8
Usability .................................................................................................................................................... 8
Maintenance ................................................................................................................................................. 8
Disposition .................................................................................................................................................... 8
Conclusions ................................................................................................................................................... 9
Appendices .................................................................................................................................................... 9
References ................................................................................................................................................ 9
8/9/2019 EIM Intro - Information Lifecycle - Doc
4/9
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-02
IntroductionKnowledge is defined by the Oxford English Dictionary as (i) expertise, and skills acquired by a person
through experience or education; the theoretical or practical understanding of a subject, (ii) what is
known in a particular field or in total; facts and information or (iii) awareness or familiarity gained by
experience of a fact or situation (Wikipedia contributors , 2009).
Knowledge, as it pertains to information technology (IT), is the understanding of a subject by a human
based upon information presented by a computer and the humans prior knowledge of related subjects.
This is a key concept in that IT is responsible for the information that then results in knowledge. If
information is poorly presented to a person, knowledge is not effectively gained.
Information presented by a computer to a human is a set of data aggregated, processed and formatted
for human interpretation. The computer aggregates data based upon rules, such as through queries
that intend to limit which data is aggregated. Since computers can only perform actions as they are
programmed to, this process of sub-setting a corpus of data is constrained by how the data is managed
and how the computer is programmed.
Data is a set of simple values. When collected and presented under a context, those simple values
become information. The handling and management of data over its relevant lifetime is the information
lifecycle.
Information LifecycleThe information lifecycle is the processes by which data comes into existence, is managed over time and
eventually is discarded. There are generally four basic states of the information lifecycle:
Creation, collection or capture Distribution, use and access Maintenance, update or change Disposition, archival or destruction
Each of these states contributes significantly to the effectiveness of data to participate as information
that may then become knowledge to a user. Each information state may also involve multiple IT
systems, or none at all (such as paper notes). The information lifecycle includes activities involving
information both inside and outside of IT systems, as well as the movement of data between IT systems
and applications.
8/9/2019 EIM Intro - Information Lifecycle - Doc
5/9
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-02
CreationThe creation of data involves the entire process from initial data generation through to the final storage
of data within a permanent repository. For some data, the entire lifecycle may be outside of IT systems,
such as paper records. In this case, analysis results produced from this data may be then managed
electronically.
The data creation phase of the information lifecycle is broken into three primary areas:
Capture Assessment and Approval Ingestion
Capture
Data capture can be divided into four primary categories:
Continual (feed)
Bulk Manual Derived
Continual
Data is continual if captured via an automated mechanism that directly provides the collected data to an
information system. This is commonly known as a data feed, such as in supervisory control and data
acquisition (SCADA) sources. Feed data is characterized by a minimum of human handling and is directly
fed into an information system or database. This level of capture expects a high fidelity of capture
(minimal data loss). Errors and quality control of this type of data is more immediate than other formsand often required staging of feed data to allow for assessment. In continual data, gaps are primarily
caused by system failures or calibration errors.
Bulk
When data is captured via an automated device and stored on a local, immediate storage device (such
as in a data logger), it is classed as a bulk capture system. Like a continual system, there is minimal
human interaction with the data itself as it is logged automatically. However, unlike a continual feed,
there may be additional data losses in time due to handling of devices to transfer the bulk data.
Continual and bulk mechanisms are often used together to form a data assessment chain in which
continual data is fed into a queue where it is evaluated in bulk for quality (quality assessment / QA) then
transferred into the permanent store as a block of records.
Manual
When a human records data by hand, either in writing or in a manual data logger (such as a field GPS
unit) that is manually collected data. Manual data is then entered by hand into an information system,
or transferred from the manual data logger. Manual data is prone to several additional sources of error
8/9/2019 EIM Intro - Information Lifecycle - Doc
6/9
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-02
including typos and general human blunders. These errors are non-systematic and automatic correction
is not possible in most cases.
Derived
If a data set is generated as the output of an automated process, such as an analysis routine, it is a
derived data set. Derived data is often of the greatest direct value in a business, and is generated usingdata created by other means. Derived data will contain errors propagated from its constituent source
data and from analytic errors from any of a number of sources:
Numeric precision or rounding Incompatible input sources or scales Faulty analysis choices, such as false positives andfalse negatives Incomplete input source data
Derived data is often generated on an automated schedule and can be of great value when persisted in
a usable form for other analyses. Derived data sets are commonly used in reports, which may result in adata format that is only useful for direct human interpretation.
Assessment
Once data is captured, the assessment process involved the evaluation of data to ensure it meets pre-
defined criterion for acceptance. This process has two primary parts:
Quality Assurance (QA) Quality Control (QC)
Quality assurance is the set of practices that are performed to ensure data will meet acceptance criteria
prior to being created. This involves activities such as maintenance and calibration of instruments and
the usage of proper instruments. Additionally, evaluation of data once collected drives the quality
assurance activities for future collections.
Quality control is the set of practices, including QA, that ensure the quality of data within a business
repository will meet or exceed quality criterion. The term QC generally refers to the subset of QC
practices that are distinct from the QA portion of quality control.
The control of data quality includes the evaluation of data for quality prior to loading into the business
repositories and the management of all activities related to the history of the data quality controlled.
This history is a chain of custody and a sequence of events that define how the data came to be in ausable state within the business repository. Data within a repository can therefore be selected by this
chain of custody if any portion of the activities within the chain are found to be suspect (such as a faulty
collection instrument). It is common to construct information systems specifically around this facet of
information management.
8/9/2019 EIM Intro - Information Lifecycle - Doc
7/9
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-02
Ingestion
Once data is captured and assessed for quality, it is only then loaded into the appropriate business
repositories. The process of ingesting data into a repository may involve transformation of the data to
match the destination format. This is a common requirement for automated collection mechanisms
where the source instrument produces a fixed data format. To maintain a full and verifiable chain of
custody, the raw data is kept in addition to the transformed data that resides within the business
repositories. For space savings, the raw files are often archived to an offline store such as optical media.
Distribution and UseThe use of data within a repository is the primary purpose for the datas existence. Data use is
considered in several ways:
Discoverability Accessibility Usability
Discoverability
Once data is within a repository is may be used. In order to use that data it must be discovered by a
potential user. The mechanisms put in place to facilitate the location of data are discovery mechanisms.
If data cannot be found, it cannot be used. Discoverability is key in the storage of data and the
availability of that storage from a user system. If a user must search in multiple locations to find the
data required, it is of marginal discoverability. For data to be discovered, the discovery data (metadata
or catalog) must also be accessible.
8/9/2019 EIM Intro - Information Lifecycle - Doc
8/9
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-02
Accessibility
Once discovered, the data must be accessed to provide value. The accessibility of data involves aspects
such as security, logical location and format. If data is secured so that potential users cannot access it,
the value of the data is diminished to those users. In sensitive domains, this is expected and desired.
Logical location further limits accessibility if the data is contained within a repository that cannot be
accessed, such as behind a firewall. Further, if the logical location is simply far away, then data
transfer may take too long, rendering the data irrelevant once finally accessible. Finally, if the data is in
formats that are proprietary or poorly supported, then the data may not be accessible to the tools used
to process it. Overall, accessibility is a balancing act with security, need and cost.
Usability
After data has been discovered and accessed, it must then be usable. If the data is in an unusable
format given the available tools, it is unusable. If data must be processed prior to being used, such as
reformatting or translation of data it will be less usable. If the data processing takes too much time, the
data may become irrelevant once it is in a usable format.
Usability also has more subtle implications as well such as the scale, accuracy and precision of the data
itself. Low precision data cannot be used in a high precision analysis. As the cost of data creation is a
function of its volume and quality, this is always a trade off against anticipated use.
MaintenanceFor any data that changes over time, the maintenance of the data values is critical. This data editing is
still subject to discovery, access and usage in addition to the need for performing the edits. In some
scenarios, only the current values are relevant, where in others temporal changes are of greater
significance than the current values, which will affect and influence data management strategies.
The entire set of practices and processes that govern how data is managed and maintained within the
business repositories is the maintenance phase of the lifecycle. Issues such as archival, availability,
continuity of operations (COOP), fault-tolerance, performance and total costs are of key consideration in
the maintenance phase.
DispositionThe disposition phase of the information lifecycle involves the processes and practices by which data is
aged within the business repository. Disposition includes the archival or removal of old data,
segregation of history data from live data and mechanisms for making segregated data available. It
is common that disposition is driven by storage costs and legal mandates such as SarbanesOxley,
ClingerCohen or Health Insurance Portability and Accountability Act (HIPAA).
8/9/2019 EIM Intro - Information Lifecycle - Doc
9/9
Corsello Research Foundation
Public Distribution CRF-RDTE-TR-20091102-02
ConclusionsArchitecting information solutions for an organization is a complex set of practices and trade-offs to
maximize capabilities while minimizing cost. Given that information solutions take a great deal of time
and care to construct, proper planning is required well in advance of need to ensure solutions are
available by the time the need arises without wasted efforts.
Various strategies exist for planning information repositories, software implementations and user facing
applications. Planning for reuse of repositories and software back-end components and services is of
great importance. Stakeholders involved with information strategies need to understand the difference
between the data repositories containing data, back-end software processing data and the user
interfaces that present data and processing. The separation of these concepts in the minds of those
involved in planning can yield great results in long-term cost savings and capabilities realized.
Appendices
References
Wikipedia contributors . (2009, November 13). Knowledge . Retrieved November 13, 2009, from
Wikipedia, The Free Encyclopedia:
http://en.wikipedia.org/w/index.php?title=Knowledge&oldid=325539292