Corsello Term Paper 2008 FINAL

Embed Size (px)

Citation preview

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    1/12

    George Washington University, CSci175 Information Policy

    Document Retention

    Policies, Law and

    IssuesImpacts and issues in the software development process

    Michael Corsello

    10/18/2008

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    2/12

    AbstractDocument retention has become an area of increasing importance including a dramatic increase in

    regulation regarding the organizational policies for standardizing the retention practices for documentsand content in general.

    This paper will discuss and describe some relevant regulation covering document retention overall and

    specifically detail impacts and issues in the specialized area of software development. The development

    of software includes generation of source code and documents which detail the design and process by

    which the software is developed, begging the question as to what is a document and which content

    needs to be retained for regulatory purposes. Furthermore, the software being developed will be

    subject to regulation for content retention as hidden requirements that may dramatically increase the

    overall cost of developing software applications.

    In the practice of software development, there is little distinction between documents and content as

    one is simply a semantically constrained subset of the other. By altering the definition of document, the

    concept of content becomes largely indistinguishable from a document. It is for this very reason that

    this paper covers the concepts as being interchangeable.

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    3/12

    CSci 175 Information Policy Michael Corsello

    1

    Fall 2008 Booz Allen Cohort 2

    Background

    Document retention is a subset of the larger concept of content retention. Content retention is the

    collection of policy and practices surrounding the standardization of practices involving the collection,

    storage, tracking, security, retention and disposal of any content. Any data produced in the course of

    conducting business that is pertinent to the business is content. Content retention is a portion of the

    larger concept of content management, which consists of the portions of management involving the

    disposition of the content from creation to destruction.

    Business case for content retention

    Content retention is of critical importance to any organization simply due to the legal implications of

    non-compliance. Beyond the legal implications, standardization of content management practices

    forces an organization to address the how, what, when, where and why of all information the

    organization owns. In order to guarantee compliance with regulation it is required that some measure

    of standardization of content management is performed. The standardization of content management

    will include the identification of which content to keep for what purpose, and for how long.

    Regulations on retention

    In recent years the requirements imposed on organizations by the governments of the world has

    dramatically increased with respect to content retention. The public failures of organizations practices

    such as the Enron scandal and the Veterans Affairs data loss are partially responsible for the new

    regulations. At the federal level, laws that impact the content retention policies of organizations

    include:

    Clinger Cohen (National Defense Authorization Act for Fiscal Year 1996)

    Sarbanes Oxley (Sarbanes-Oxley Act of 2002)

    HIPPA (Health Insurance Portability and Accountability Act)DoD 5015.02-STD (Electronic Records Management Software Applications Design Criteria

    Standard)

    In addition to these regulations defining explicit requirements on content practices, there are many

    other regulations that directly or indirectly require standardization of content management practices.

    Retention practices

    Content retention overall involves the practice of holding or retaining content from the time it is

    captured or created to the time it is released or destroyed. This concept introduces two sides of the

    paradigm of retention: to retain and to destroy. These will continually play against one another in this

    paper.

    Standard time period

    For each type of content, that content has a definable purpose for an organization over a standard

    period. That purpose will bring value to the organization by the use of the content retained. At the time

    the content is no longer of positive value to the organization, it should be disposed of. The

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    4/12

    CSci 175 Information Policy Michael Corsello

    2

    Fall 2008 Booz Allen Cohort 2

    standardization and documentation of this duration and the practices involving the transition of the

    content between these states is the primary goal of retention.

    Disposal processes

    Once a document has outlived its standard period of benefit it will be disposed of. The disposal of

    content must also be standardized and documented. Disposal will involve the actual process ofdiscovering expired content among the full corpus of content and the process for removing expired

    content from the organization storage repositories. This should also document the results of the

    destruction to provide the level of confidence that the content is unrecoverable once disposed of.

    Content RetentionTo appreciate the complexity of content retention the individual concepts of content and retention must

    be understood.

    What is a document

    Prior to a discussion on content and document retention, it is critical to understand what each of these

    concepts truly represents. Content can be any information of any type, structured or unstructured. This

    concept of content can include something as simple as a single word. When content is placed in a

    context such as an order form, that content becomes a record. A record that is stored to some

    persistent media is a document. In that manner, an order submitted in a web form is a document once

    saved to a database or printed out. This makes the structure of the persistence mechanism the actual

    form of the saved content and therefore a critical issue to the developers of software persisting such

    content.

    In a software application such as an online shopping site, which will persist the orders as records in a

    database, the persisted structure of the data representing the document will in no way resemble the

    format of the document presented to the user. This presents a number of considerations to a

    developer:

    What must be retained

    How is data to be removed

    What is the beginning and ending of the document

    For the example of an order, several pieces of information comprise the document:

    Customer information

    Billing information

    Shipping information

    Order items

    Metadata

    o Date of transactiono Date of shipment

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    5/12

    CSci 175 Information Policy Michael Corsello

    3

    Fall 2008 Booz Allen Cohort 2

    o Date of arrivalo Disposition of order (returns)

    In an information system, the structure of this information is not together as a comprehensive

    document, but instead as fragments of related data that connected via keys. All of these issues as to

    the nature of a document must be addressed via retention policies to ensure the proper portions of dataare disposed of when appropriate.

    What is retention

    Retention is the entire lifecycle of content on persistent media. The concept of retention must include

    the eventual destruction of the content from the media and potentially the destruction of the media

    itself when no longer viable for re-use. For the media itself, retention must also cover applicable reuse

    of the media once the content it contained is disposed of. For paper, this must include scenarios such as

    secondary use of paper for fax machines. It is obviously critical to ensure that sensitive documents

    printed on paper are not re-used for fax paper once out of date.

    All of the practices regarding keeping, re-using and disposal of content and media are within the scope

    of content retention. Since the coupling of retention is so tight with the practices of management of

    content, the two areas are largely interchangeable, though management also involves other practices as

    well.

    Backups and Continuity of Operations (COOP) is not retention

    There is a critical distinction between retention of content and disaster recovery. In general, any

    disaster recovery or continuity practices are separate from retention practices. However, it is critical to

    understand that disaster recovery content is still subject to use by enforcement agencies as a source of

    data. Given this point, it is critical to include such content in the retention planning process to ensure

    proper destruction of such content to prevent secondary avenues of data exploitation.

    All evaluation of all content must be on an even and level playing field to ensure proper handling and

    disposition. Overall, any information can illustrate both good and bad points depending upon who has

    the information and how they attained that information. Therefore, it is quite important that all

    information that can be disposed of be disposed of as soon as possible to minimize the potential liability.

    This includes the destruction of information on backup and COOP media in addition to all production

    media. Backup usage and planning should also consider this and forbid the use of backups as a standard

    mechanism of restoring content due to use fault. This practice would count as an accepted form of

    content retrieval and thereby make backups considered production media as well. Backup and COOP

    content must be restricted to use during disasters resulting in hardware failure only as part of theretention plan.

    Configuration and Content Management (CM)

    The practices of configuration and content management in an organization are not specific to software

    development but do have specialty areas in software development organizations. The overall concepts

    of both forms of CM involve the management of content produced in the course of operations. The

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    6/12

    CSci 175 Information Policy Michael Corsello

    4

    Fall 2008 Booz Allen Cohort 2

    primary concern in CM is the tendency to desire to retain content. In CM practices, content is generally

    versioned over time to illustrate the history of content. From a retention perspective, this must be

    balanced with the need to purge content as its use is diminished over time.

    Documentation of retention practices

    All practices must be standardized and documented to provide a public and formal proof of internalpractices. This must also be distributed and actively practiced by members of the organization. Practice

    without documentation is much more difficult to prove, justify and measure actual compliance.

    Software Development Practices

    The development of software is a complex process that is more similar to the process of invention than

    that of manufacturing. The development of a software system will involve several technical

    specialization areas to ensure the system built addresses a business need, meets all regulatory

    requirements, is relatively free of defects, is easy to use and understand and generally functions as

    specified and desired.

    High-level introduction

    The process of software development involves several phases during which a specific portion of the total

    system becomes defined. The basic development phases include inception, elaboration, construction

    and transition. Prior to starting a development project, the customer and software provider commit to

    a contract. This pre-inception phase involves the aggregation of business processes to automate, the

    scoping of the effort, identification of the key stakeholders, base lining an anticipated timeline for

    completion and a rough cost estimate.

    Inception

    The inception phase of the development effort involves the creation of a set of requirements that depict

    the business processes to automate, all applicable regulations and policies, any performance constraints

    and the general constraints on the overall construction. Once completed, this will result in several

    formal documents including meeting notes, possibly audio or video of meetings, rough sketches and

    business documentation from the client. All of this content is managed through the CM processes.

    Elaboration

    During the elaboration phase the content produced in the inception phase is analyzed to produce a

    workable design for the system. The design may also include prototype code for demonstrating design

    concepts. Again, several formal documents are produced and all content is managed through the CM

    processes.

    Construction

    The construction phase is where construction, testing and validation of actual production quality

    software is performed against the documents produced in the earlier phases to ensure compliance with

    the stated requirements. Again, there are formal documents produced as well as the source code for

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    7/12

    CSci 175 Information Policy Michael Corsello

    5

    Fall 2008 Booz Allen Cohort 2

    the system itself. The CM processes are used to manage all of this content. Technically, at this point,

    the construction is complete and all deliverables are provided to the client. Therefore, it could be argued

    that there is minimal value in the retention of any content produced under this contract at this time.

    Transition

    The transition phase involves the integration of the new software into the client business and thecontinual maintenance and upgrading of the software over time. If the same company has the

    maintenance contract, the entire body of content may be useful to evolve the software. The

    management of all content over time during the ongoing transition phase is still performed under the

    CM processes.

    The general theme of CM in the software development lifecycle (SDLC) is to retain content including all

    revisions to all source code throughout the life of the project and beyond. In many cases, content is

    applicable to multiple contracts and as such is desirable as a source of content to expedite content

    creation. Unfortunately, the contracts themselves often do not discuss the legality of content reuse at

    all and simply the nature of unrealistic time expectations drives the reuse of such content. The balanceof the value argument to the liability of perpetual or non-standardized retention is not generally

    realized.

    Process documents

    Over the course of the SDLC there are several documents produced to support the construction of the

    software system.

    Business processes

    Generally business process documentation is produced by the client and delivered as-is to the software

    development team. These processes are entered into the CM library (CML) for retention as

    documentation of the processes to be automated. This serves as accountability for the development

    team back to the client to ensure compliance of software. These process documents are generally

    accompanied in bulk by a signed inventory sheet depicting the versions and delivery of these documents

    to the development team.

    If any changes occur to the formal processes followed by the client during development, any resulting

    changes to the software being developed can be at cost to the client by referencing changes to these

    documents.

    Meeting notesNotes are stored in the CML for each meeting throughout the development process. Audio or video are

    often captured for meetings such as requirements elicitation meetings. The size of audio and video

    content is an issue for CML storage, but when captured it is also stored.

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    8/12

    CSci 175 Information Policy Michael Corsello

    6

    Fall 2008 Booz Allen Cohort 2

    Requirements

    Because of the requirements process there are several documents generated. The primary

    requirements documents are the Software Requirement Specification (SRS) and the Requirements

    Traceability Matrix (RTM). These two documents form the basis for all work performed during the SDLC

    and are the most critical to retain.

    Design documents

    Based upon the content of the SRS the software design will depict the expected structure and function

    of the application to build. The design will consist of one or more documents collectively known as the

    Software Design Document (SDD). The SDD is mapped to the SRS in the RTM where each design artifact

    in the SDD is mapped to the requirements in the SRS that design artifact will partially or completely

    realize.

    Design as a process takes a significant amount of time and is argued as being of l ittle practical value in

    an Agile development methodology. The risk of not performing a detailed design may reduce the

    accountability and tracking of requirements if not properly documented.

    Testing documents

    Each portion of the software application must be tested to ensure it works to design specification and to

    requirement. The testing process, the tests performed and the results of each round of testing are

    documented and stored in the CML. Once all tests pass and the system is delivered, the results of the

    incremental tests leading up to a passing score are of little business value.

    Configuration and Content Management (CM)

    The overall process of managing content produced in the SDLC is called configuration management. The

    concept of a configuration is any portion of content that results in a specific configuration. A

    configuration is the manner in which an operationally deployed application is structured, configured and

    works. This concept of configuration management is a specialized subset of content management (also

    CM) regarding the software and system development cycles.

    The library

    All configuration content is stored in a repository known collectively as the configuration management

    library or CML. The CML includes all content across the entire lifetime of the project. The CML is

    responsible for the maintenance of proper naming standards (and their enforcement), versioning of

    content and accountability for access and dissemination of content. The only official source of content

    in a development project is from the CML.

    Responsible parties

    The configuration manager and their team manage the CML. A client representative generally will have

    visibility into the content within the CML. The Information Assurance Officer (IAO) will also have

    visibility into the CML and oversight to ensure the management of the CML follows the defined content

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    9/12

    CSci 175 Information Policy Michael Corsello

    7

    Fall 2008 Booz Allen Cohort 2

    management policies. Finally, all contributing personnel are responsible for submitting content to the

    CM team for inclusion in the CML.

    Source code retention

    One key type of content for any software development project is the source code for the applicationbeing developed. The source code is the actual content that is or becomes the application itself. There

    is continual modification and re-factoring of source code over the lifetime of the project. The

    management of the source code takes place is a special repository called a source code management

    system (SCM).

    Source code concepts

    Source code is represented in text files that contain text written in a computer programming language.

    For many languages this code is then compiled into binary executables such as an exe or dll file.

    Other languages such as HTML (hypertext markup language) are used directly and not compiled.

    There are several concerns with the management of such files:

    What is managed, source or compiled libraries?

    How are changes tracked over time?

    What is a change?

    Are daily variations tracked?

    Is the SCM part of the CML?

    In general, the source is the only thing that is content managed over time. However, compiled files are

    tracked via the testing process to ensure only tested files make it to use by other developers and to

    production. Changes are tracked in the SCM automatically by deltas or saving what has changed with

    each edit. This however must be managed for which changes are significant and make it through the

    testing process. If a change is made to a source code file, it is not significant to track alone. Instead,

    changes are defined more by progress over time than static edits themselves. Likewise, daily changes to

    code do not represent changes, but instead more provide a means of sharing code between developers

    to aid in productivity.

    Overall, due to the nature of the SCM it should NOT be part of the CML, but instead be governed by the

    CM Team to ensure proper management of the source in the SCM. The only source code that should be

    tracked in the CML are baselines, or releases that have meaning to the schedule or otherwise to the

    client. These should be stored outside of the SCM to ensure distinction from the code in the SCM. Inpractice, this is rarely done and the SCM is considered a key part of the CML. This is largely because of

    how an SCM works.

    Versions and Baselines

    As changes are made to source code, each action of checking in code results in a new change set or

    revision to those files checked in. These change sets are revisions in time that are selected by time. To

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    10/12

    CSci 175 Information Policy Michael Corsello

    8

    Fall 2008 Booz Allen Cohort 2

    view the entire application at once (often many thousands of individual files), a time is selected from the

    calendar to represent the view of the system to acquire. This view will show the state of all files in the

    system at that point in time. This is used practically as a means of rolling back when a change is made

    that is later found to be less than beneficial.

    As development proceeds, a label may be placed in the SCM on the current state of all files in the SCMat that point in time. This label indicates a version for the code base and is often a release milestone.

    Given the SCM has this power built-in it is often simply adopted as the de facto means ofmanaging

    source code content.

    Retention or disposal

    Since source code is the application and it evolves over time based upon changes made to the source

    there is a high value placed upon the code itself. The source code in the SCM is considered to be the

    primary source of value in all development projects and may often be reused in part or in whole across

    projects. While there are issues of intellectual property rights at stake with source code, the time

    demands for completing a development project often outweigh any considerations for replicating effortfor similar work product.

    Due to this high-perceived value proposition and due to the inexpensive nature of storing this content it

    is rarely every disposed of until it is entirely out of date. This often results in the retention of source

    code for years beyond the conclusion of a project including all edits ever made during the development

    process.

    Since the cost of developing software is so high and the demands upon development teams are

    generally quite unrealistic, many sloppy processes exist which are poorly followed. The proper disposal

    of software development content including source code is of high importance and is rarely done

    properly. There is a tremendous opportunity for investigating how software is actually developed to

    respond to a disappointed client as mostly all project leave circumstantial evidence around to be

    retrieved many years beyond their practical usefulness.

    The entire process of software development has never been required to address the issue of content

    disposal practices, as IT professionals are primarily concerned with retaining information. Overall, a

    guidance package is required to illustrate the liabilities of not actively planning, scheduling and following

    a standard process for content retention and disposal.

    New software applications are created to solve specific practical problems in business. These solutions

    generally are not planned based upon legal or liability implications of how the applications are used. It

    will become increasingly important to ensure that software applications are developed based upon a

    dynamic set of uses that can be modified to adapt to unplanned purposes.

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    11/12

    CSci 175 Information Policy Michael Corsello

    9

    Fall 2008 Booz Allen Cohort 2

    Software Implementation Requirements

    Software applications meet a set of requirements defined when planning the application. Any unknown

    or unanticipated requirements at that time are likely to be unsupported by the application when

    completed. For commercial applications there are no clients directing the requirements. Instead, all

    requirements are anticipated needs for expected or current clients using other products.

    For emerging regulatory requirements, few applications can currently support those demands. That

    results in a requirement to modify existing applications to support the requirements or to fulfill those

    requirements outside of the system (often manually). Defining what content in an application must be

    regulated and purged is a challenge when the client will often not understand the legal implications of

    the application data storage.

    A major area of development in software applications is the use of data for new purposes such as data

    mining and analysis. This will have growing legal implications as advanced analytics become easier to

    produce. As applications are developed that increasingly centralize data into consolidated databases,

    these databases may violate regulation implicitly via data aggregation due to poor alignment ofregulators understanding technology and technology implementers understanding regulation. The

    centralization of data and security is a major area driving application architecture and enabling

    enterprise analytics. These converging aspects are happening to maintain high levels of performance

    given increased data volumes at a cost of separation of concerns.

    Politically and socially the free sharing of information is at the forefront of progress in the area of

    information technology. However, the issues of security, privacy and piracy are most likely of higher

    importance. There is an increasing number of sophisticated attackers attempting to compromise

    systems and information for profit. Regulation to protect this information and privacy must be in step

    with technologies to ensure both can be realistically implemented and enforced given the workforce

    and tools available. Regulation that is too difficult, technical or costly to implement will not be and

    skilled workers will not become available as education systems are already being streamlined to

    increase the rate of production.

    Software is arguably the most complex undertaking of mankind with technology being implemented in

    dependent layers. Each layer of technology relies on the one below it, with the lower layers each being

    older than the preceding layer. Older technologies tend to have less emphasis on security and multi-

    user synchronization. Therefore, we should have no expectation of fixing our problems any time soon

    without unrealistic costs. Over time the best solution will be to replace technologies to implement the

    required capabilities prior to becoming regulation.

    Summary and Conclusions

    Content retention is a complex topic that has impacts on all aspects of the software development

    process. The actual practice of developing software is impacted by the content created during the SDLC

    and by retention of that content after completion of the efforts. The applications produced are also

  • 8/9/2019 Corsello Term Paper 2008 FINAL

    12/12

    CSci 175 Information Policy Michael Corsello

    10

    Fall 2008 Booz Allen Cohort 2

    impacted by content retention issues in a much more significant way than is currently addressed in that

    application developers will be required to construct applications that enforce compliance with retention

    policies and regulations.

    Document retention legislation has a significant impact on the software industry and the personnel

    responsible for the construction of applications overall. The skills of developers in the industry arealready stretched with much higher demand than supply of skilled workers. Clients do not understand

    the implications of implementing compliant systems and allowances for time costs will likely not be

    acceptable. The practical reality of the need for document retention practices will remain

    overshadowed by the practical costs of doing so for some time.

    Information sharing is of increasing importance to the businesses using information technologies. This

    need to share culture that is developing will further expand the issues of content retention and

    technological implementations to address the social implications of this content sharing. Overall,

    technology is focused on opening up information and capabilities for widespread use, while little

    attention is paid to illegitimate or illegal use of this information.

    In summary, the concepts of content management including retention and disposal are in need of

    immediate attention from technologists and policy makers alike. The emerging trends around sharing,

    privacy, security and discovery must be addressed to ensure a sustainable approach is defined and

    followed by technology implementers and users alike.