EIM Intro - Information Lifecycle - Doc

Embed Size (px)

Citation preview

  • 8/9/2019 EIM Intro - Information Lifecycle - Doc

    1/9

    CRF-RDTE-TR-20091102-02

    2/2/2009

    Public Distribution| Michael Corsello

    CORSELLO

    RESEARCH

    FOUNDATION

    INFORMATION LIFECYCLE BASICSINTRODUCTION TO ENTERPRISE INFORMATION MANAGEMENT

  • 8/9/2019 EIM Intro - Information Lifecycle - Doc

    2/9

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-02

    AbstractInformation follows a basic lifecycle from creation to disposal. Management of information is based

    upon the concepts of architecting the structures and practices for managing the lifecycle of information

    and all stages of information handling and processing.

  • 8/9/2019 EIM Intro - Information Lifecycle - Doc

    3/9

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-02

    Table of ContentsAbstract ......................................................................................................................................................... 2

    Introduction .................................................................................................................................................. 4

    Information Lifecycle .................................................................................................................................... 4

    Creation ......................................................................................................................................................... 5

    Capture...................................................................................................................................................... 5

    Continual ............................................................................................................................................... 5

    Bulk ....................................................................................................................................................... 5

    Manual .................................................................................................................................................. 5

    Derived .................................................................................................................................................. 6

    Assessment ............................................................................................................................................... 6

    Ingestion ................................................................................................................................................... 7

    Distribution and Use ..................................................................................................................................... 7

    Discoverability ........................................................................................................................................... 7

    Accessibility ............................................................................................................................................... 8

    Usability .................................................................................................................................................... 8

    Maintenance ................................................................................................................................................. 8

    Disposition .................................................................................................................................................... 8

    Conclusions ................................................................................................................................................... 9

    Appendices .................................................................................................................................................... 9

    References ................................................................................................................................................ 9

  • 8/9/2019 EIM Intro - Information Lifecycle - Doc

    4/9

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-02

    IntroductionKnowledge is defined by the Oxford English Dictionary as (i) expertise, and skills acquired by a person

    through experience or education; the theoretical or practical understanding of a subject, (ii) what is

    known in a particular field or in total; facts and information or (iii) awareness or familiarity gained by

    experience of a fact or situation (Wikipedia contributors , 2009).

    Knowledge, as it pertains to information technology (IT), is the understanding of a subject by a human

    based upon information presented by a computer and the humans prior knowledge of related subjects.

    This is a key concept in that IT is responsible for the information that then results in knowledge. If

    information is poorly presented to a person, knowledge is not effectively gained.

    Information presented by a computer to a human is a set of data aggregated, processed and formatted

    for human interpretation. The computer aggregates data based upon rules, such as through queries

    that intend to limit which data is aggregated. Since computers can only perform actions as they are

    programmed to, this process of sub-setting a corpus of data is constrained by how the data is managed

    and how the computer is programmed.

    Data is a set of simple values. When collected and presented under a context, those simple values

    become information. The handling and management of data over its relevant lifetime is the information

    lifecycle.

    Information LifecycleThe information lifecycle is the processes by which data comes into existence, is managed over time and

    eventually is discarded. There are generally four basic states of the information lifecycle:

    Creation, collection or capture Distribution, use and access Maintenance, update or change Disposition, archival or destruction

    Each of these states contributes significantly to the effectiveness of data to participate as information

    that may then become knowledge to a user. Each information state may also involve multiple IT

    systems, or none at all (such as paper notes). The information lifecycle includes activities involving

    information both inside and outside of IT systems, as well as the movement of data between IT systems

    and applications.

  • 8/9/2019 EIM Intro - Information Lifecycle - Doc

    5/9

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-02

    CreationThe creation of data involves the entire process from initial data generation through to the final storage

    of data within a permanent repository. For some data, the entire lifecycle may be outside of IT systems,

    such as paper records. In this case, analysis results produced from this data may be then managed

    electronically.

    The data creation phase of the information lifecycle is broken into three primary areas:

    Capture Assessment and Approval Ingestion

    Capture

    Data capture can be divided into four primary categories:

    Continual (feed)

    Bulk Manual Derived

    Continual

    Data is continual if captured via an automated mechanism that directly provides the collected data to an

    information system. This is commonly known as a data feed, such as in supervisory control and data

    acquisition (SCADA) sources. Feed data is characterized by a minimum of human handling and is directly

    fed into an information system or database. This level of capture expects a high fidelity of capture

    (minimal data loss). Errors and quality control of this type of data is more immediate than other formsand often required staging of feed data to allow for assessment. In continual data, gaps are primarily

    caused by system failures or calibration errors.

    Bulk

    When data is captured via an automated device and stored on a local, immediate storage device (such

    as in a data logger), it is classed as a bulk capture system. Like a continual system, there is minimal

    human interaction with the data itself as it is logged automatically. However, unlike a continual feed,

    there may be additional data losses in time due to handling of devices to transfer the bulk data.

    Continual and bulk mechanisms are often used together to form a data assessment chain in which

    continual data is fed into a queue where it is evaluated in bulk for quality (quality assessment / QA) then

    transferred into the permanent store as a block of records.

    Manual

    When a human records data by hand, either in writing or in a manual data logger (such as a field GPS

    unit) that is manually collected data. Manual data is then entered by hand into an information system,

    or transferred from the manual data logger. Manual data is prone to several additional sources of error

  • 8/9/2019 EIM Intro - Information Lifecycle - Doc

    6/9

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-02

    including typos and general human blunders. These errors are non-systematic and automatic correction

    is not possible in most cases.

    Derived

    If a data set is generated as the output of an automated process, such as an analysis routine, it is a

    derived data set. Derived data is often of the greatest direct value in a business, and is generated usingdata created by other means. Derived data will contain errors propagated from its constituent source

    data and from analytic errors from any of a number of sources:

    Numeric precision or rounding Incompatible input sources or scales Faulty analysis choices, such as false positives andfalse negatives Incomplete input source data

    Derived data is often generated on an automated schedule and can be of great value when persisted in

    a usable form for other analyses. Derived data sets are commonly used in reports, which may result in adata format that is only useful for direct human interpretation.

    Assessment

    Once data is captured, the assessment process involved the evaluation of data to ensure it meets pre-

    defined criterion for acceptance. This process has two primary parts:

    Quality Assurance (QA) Quality Control (QC)

    Quality assurance is the set of practices that are performed to ensure data will meet acceptance criteria

    prior to being created. This involves activities such as maintenance and calibration of instruments and

    the usage of proper instruments. Additionally, evaluation of data once collected drives the quality

    assurance activities for future collections.

    Quality control is the set of practices, including QA, that ensure the quality of data within a business

    repository will meet or exceed quality criterion. The term QC generally refers to the subset of QC

    practices that are distinct from the QA portion of quality control.

    The control of data quality includes the evaluation of data for quality prior to loading into the business

    repositories and the management of all activities related to the history of the data quality controlled.

    This history is a chain of custody and a sequence of events that define how the data came to be in ausable state within the business repository. Data within a repository can therefore be selected by this

    chain of custody if any portion of the activities within the chain are found to be suspect (such as a faulty

    collection instrument). It is common to construct information systems specifically around this facet of

    information management.

  • 8/9/2019 EIM Intro - Information Lifecycle - Doc

    7/9

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-02

    Ingestion

    Once data is captured and assessed for quality, it is only then loaded into the appropriate business

    repositories. The process of ingesting data into a repository may involve transformation of the data to

    match the destination format. This is a common requirement for automated collection mechanisms

    where the source instrument produces a fixed data format. To maintain a full and verifiable chain of

    custody, the raw data is kept in addition to the transformed data that resides within the business

    repositories. For space savings, the raw files are often archived to an offline store such as optical media.

    Distribution and UseThe use of data within a repository is the primary purpose for the datas existence. Data use is

    considered in several ways:

    Discoverability Accessibility Usability

    Discoverability

    Once data is within a repository is may be used. In order to use that data it must be discovered by a

    potential user. The mechanisms put in place to facilitate the location of data are discovery mechanisms.

    If data cannot be found, it cannot be used. Discoverability is key in the storage of data and the

    availability of that storage from a user system. If a user must search in multiple locations to find the

    data required, it is of marginal discoverability. For data to be discovered, the discovery data (metadata

    or catalog) must also be accessible.

  • 8/9/2019 EIM Intro - Information Lifecycle - Doc

    8/9

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-02

    Accessibility

    Once discovered, the data must be accessed to provide value. The accessibility of data involves aspects

    such as security, logical location and format. If data is secured so that potential users cannot access it,

    the value of the data is diminished to those users. In sensitive domains, this is expected and desired.

    Logical location further limits accessibility if the data is contained within a repository that cannot be

    accessed, such as behind a firewall. Further, if the logical location is simply far away, then data

    transfer may take too long, rendering the data irrelevant once finally accessible. Finally, if the data is in

    formats that are proprietary or poorly supported, then the data may not be accessible to the tools used

    to process it. Overall, accessibility is a balancing act with security, need and cost.

    Usability

    After data has been discovered and accessed, it must then be usable. If the data is in an unusable

    format given the available tools, it is unusable. If data must be processed prior to being used, such as

    reformatting or translation of data it will be less usable. If the data processing takes too much time, the

    data may become irrelevant once it is in a usable format.

    Usability also has more subtle implications as well such as the scale, accuracy and precision of the data

    itself. Low precision data cannot be used in a high precision analysis. As the cost of data creation is a

    function of its volume and quality, this is always a trade off against anticipated use.

    MaintenanceFor any data that changes over time, the maintenance of the data values is critical. This data editing is

    still subject to discovery, access and usage in addition to the need for performing the edits. In some

    scenarios, only the current values are relevant, where in others temporal changes are of greater

    significance than the current values, which will affect and influence data management strategies.

    The entire set of practices and processes that govern how data is managed and maintained within the

    business repositories is the maintenance phase of the lifecycle. Issues such as archival, availability,

    continuity of operations (COOP), fault-tolerance, performance and total costs are of key consideration in

    the maintenance phase.

    DispositionThe disposition phase of the information lifecycle involves the processes and practices by which data is

    aged within the business repository. Disposition includes the archival or removal of old data,

    segregation of history data from live data and mechanisms for making segregated data available. It

    is common that disposition is driven by storage costs and legal mandates such as SarbanesOxley,

    ClingerCohen or Health Insurance Portability and Accountability Act (HIPAA).

  • 8/9/2019 EIM Intro - Information Lifecycle - Doc

    9/9

    Corsello Research Foundation

    Public Distribution CRF-RDTE-TR-20091102-02

    ConclusionsArchitecting information solutions for an organization is a complex set of practices and trade-offs to

    maximize capabilities while minimizing cost. Given that information solutions take a great deal of time

    and care to construct, proper planning is required well in advance of need to ensure solutions are

    available by the time the need arises without wasted efforts.

    Various strategies exist for planning information repositories, software implementations and user facing

    applications. Planning for reuse of repositories and software back-end components and services is of

    great importance. Stakeholders involved with information strategies need to understand the difference

    between the data repositories containing data, back-end software processing data and the user

    interfaces that present data and processing. The separation of these concepts in the minds of those

    involved in planning can yield great results in long-term cost savings and capabilities realized.

    Appendices

    References

    Wikipedia contributors . (2009, November 13). Knowledge . Retrieved November 13, 2009, from

    Wikipedia, The Free Encyclopedia:

    http://en.wikipedia.org/w/index.php?title=Knowledge&oldid=325539292