11
1 Mars Exploration Rovers Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

Mars Exploration Rovers Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

  • Upload
    lahela

  • View
    18

  • Download
    5

Embed Size (px)

DESCRIPTION

Mars Exploration Rovers Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004. Introduction. Purpose: This is an informational presentation to discuss IV&V involvement with the MER program as it relates to the memory consumption on Spirit on Sol-18 Agenda - PowerPoint PPT Presentation

Citation preview

Page 1: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

1

Mars Exploration Rovers

Spirit SOL-18 Anomaly:NASA IV&V Involvement

April 2004

Page 2: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

April 2004 MSR Spirit Anomaly - 2

IV&V Facility

Introduction

• Purpose:– This is an informational presentation to discuss IV&V involvement with the

MER program as it relates to the memory consumption on Spirit on Sol-18

• Agenda– Background on the system memory problem

– Background on IV&V involvement with the MER program

– IV&V issues related to the system memory and file system

– IV&V Lessons Learned

– Summary

Page 3: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

April 2004 MSR Spirit Anomaly - 3

IV&V Facility

Summary of Spirit Sol-18 System Memory Consumption

• Sol 18– 9:00 LST – The planned DTE HGA communication session began.– ~9:11 LST – Event Reports were received indicating uplink errors were occurring. Downlink was spotty.– ~9:16 LST – The signal was lost. This was ~14 minutes earlier than expected– 11:20 LST – Commanded a 30-minute high priority HGA communication session. No signal was seen.– 12:45 LST – Commanded an LGA beep. The beep occurred as predicted (start and duration).– 16:18 LST: Odyssey UHF pass over Spirit, no carrier seen

• Sol 19– 1:45 LST – The MGS UHF communications session lasted only 2 minutes and 20 seconds. It did start at the correct

time but only a repeating PsuedoNoise code was present in the data.– 4:39 LST – No early morning UHF communication session with the Odyssey spacecraft (no signal or data).– 9:00 LST – No morning HGA DTE communication session. No signal or data were detected.– 11:00 LST – Looked for 10 bps LGA DTE communication session initiated by a system fault protection response. No

signal was seen.– 14:40 LST – Commanded beep at 7.8125 bps. Beep was seen!– 15:24 – No afternoon UHF communication session with the Odyssey spacecraft (no signal or data). – 15:27 – Attempted to command an LGA DTE communication session. No signal or data was received.

• A system level fault had occurred on Sol 19 that put the rover in a degraded communication state and allowed some commanding

• At this point JPL was able to determine that FSW was in a continuous delayed reset loop. The first reset occurred during the Sol 18 morning DTE session coincident with an actuator checkout

• Both commanded and autonomous shutdowns were failing and the vehicle probably had not shutdown in a while

Page 4: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

April 2004 MSR Spirit Anomaly - 4

IV&V Facility

Root Cause

• The root cause was traced to two configuration parameters in the VxWorks operating system

– Configuration parameters of the dosFsLib module3 permitted the unbounded consumption of memory from the system memory heap as the FLASH file system was populated with an increasing number of files

– The configuration parameters of the memPartLib module4 were set so that the logic would suspend the execution of any task that requested memory when no additional memory was available

• This had the undesirable effect of suspending a critical task when the memory space was exhausted

• Other effects included memory corruption, inability to turn vehicle off (due to task deadlock), repeating system resets

• Contributing factors included the compressed development schedule, unanticipated behavior of the FSW, incomplete development (analysis of the effects of the dosFsLib parameters was never fully completed), test program was not equivalent to operational use, and inadequate telemetry

Page 5: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

April 2004 MSR Spirit Anomaly - 5

IV&V Facility

IV&V Activities for MER

• Initial assessment of the MER project performed in June 2001– Results of assessment noted that the file system was a very critical portion of the

FSW, however, the scores for the technology being used and the maturity of the software indicated low risk

– Some portions were rated as high complexity– Overall the file system software was within the IV&V scope though at a low level

• Initial estimate of the IV&V resources was 9-10 FTEs– The MER Project had not budgeted for that level of IV&V resources– Final IV&V resources were 4-5 FTEs

• Reduction in resources necessitated changes in the approach to IV&V– Goal was to cover the MER FSW to a reasonable depth so that the IV&V Team could

feel comfortable supporting launch and operational readiness reviews for the project– Tasking was “pulled up” to a higher level than normal – analysis applied at a

complete FSW level rather than at a software component level

• Additional issue in regards to a limited number of FSW requirement artifacts

Page 6: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

April 2004 MSR Spirit Anomaly - 6

IV&V Facility

IV&V Findings Related to the System Memory

• Requirement and test completeness– IV&V Risk #1 on Requirements (and extended to include test) was remaining risk in “Significant

Concern” status at time of upload– Chief concern was that software requirements discovery was not complete and that software had

not been adequately tested at the time of the upload

• Specific TIM’s– Specific TIMs were written against the insufficient unit tests for portions of the file system using

the system memory – Project asserted testing was complete but without documentation– These TIMs were still in “Open” state at the time of the final upload

• Code Complexity– Portions of the file system using the system memory was consistently reported to be very

complex– Modules were reported to have poor testability and poor maintainability

• Code Stability– File system modules were being worked on until the last release (R8.1d, 11/20/03)– File Meta Engine had 10% of its total code changed as late as Release 8.0, and had 9% of its

total code changed for Release 8.1

• Note that the file system was not the cause of the problem, but brought the lack of memory to light and created the task deadlock

Page 7: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

April 2004 MSR Spirit Anomaly - 7

IV&V Facility

IV&V Concerns over Requirements & Test

• Upload Readiness Review (11/25/03)– Plans were to upload final FSW on 12/2/05; review was to determine readiness– IV&V recommended further testing before upload, delaying upload past Dec 2

• Operational Readiness Review (12/5/03)– “Aggregate of requirement and test issues represent a risk being tracked in IV&V Risks”– Final Requirements Risk status was “Significant Concern” (middle of three possible

levels)• IV&V Concern: “There remains an IV&V concern about the possibility of requirements-related

surprises during operations. IV&V has a less optimistic view of the requirements discovery than does the project.”

• Potential Consequence for Surface ops: “Possible loss of science return” (“Possible loss of science return” means the situation we are currently seeing: significant time to detect, understand, and correct problems on the surface)

– Reiteration of 11/25/03 IV&V recommendation for further testing before upload (which by 12/5/03 had already occurred, the project having proceeded with planned upload on 12/2/03)

• Recommendation to “Continue testing to the extent possible”• Recommendation to “Ensure test results are adequately reviewed”

• Project emphasis on “test as you fly” (vs. formal unit and requirements-based tests) didn’t find the problem

Page 8: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

April 2004 MSR Spirit Anomaly - 8

IV&V Facility

IV&V Lessons Learned

• Resources– The low level of resources being applied to such a large and complex project was not

sufficient• The goal of analyzing the software at a depth that would allow the IV&V Team to feel

confident when supporting project readiness reviews had to be maintained

• Forced a shift from a software component approach to a more whole system approach

– Resources for IV&V should be such that a software component approach can be maintained throughout a project SDLC

• Lack of Artifacts– Current IV&V Facility processes are very requirements driven– The lack of FSW requirements artifacts on the MER Project affected the IV&V work

being performed and also helped to move the approach away from a component level analysis

– And given that projects are not generally required to follow a standardized software development life cycle

– The IV&V Facility needs to examine its requirements driven approach and generate some alternative approaches to performing IV&V on projects lacking software artifacts

Page 9: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

April 2004 MSR Spirit Anomaly - 9

IV&V Facility

IV&V Lessons Learned

• Pursuing Risks– Early on the IV&V Team documented the requirements risk

• Project would only address specific problems that were realization of the risk not the risk itself with the IV&V Team

• Otherwise, the planned testing program mitigated the risk in the project’s eyes

– The IV&V Team was still concerned, but the lack of FSW requirements made it difficult to fully examine the consequences and likelihood of the risk

• The IV&V Team eventually accepted the test program as a mitigation to the risk• However as milestone reviews neared, the testing in some cases had not been completed

– The project continued testing up to the last minute• Additionally, the lack of requirements artifacts placed the MER Project into the position of

testing with incomplete requirements• Testing was driven more by scenarios generated by system engineers such that they felt

that the system was fully exercised – IV&V had no insight into how the scenarios were developed

– The IV&V Team needs to be more proactive in assessing mitigation efforts early in the SDLC so as to more effectively support projects

– Additionally projects should enforce and follow good software engineering practices that includes good requirements development to support a mature test program

Page 10: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

April 2004 MSR Spirit Anomaly - 10

IV&V Facility

IV&V Contributing Factors

• The IV&V Team needs to be more intimately involved with the development team– The MER project’s compressed schedule created a schedule risk from

outside parties– The IV&V team was not able to work directly with the developer– Additionally there was no access to the development issue database or the

low level testing artifacts that would allow IV&V to perform a more in-depth analysis

– Projects need to integrate the IV&V process into the development process in order to gain maximum advantage of the resources being offered

• More specific attention to COTS products– The root cause in this case was the incorrect use of a COTS product– The IV&V team usually analyzes the use of and interfaces between COTS

and developed code since the content of most COTS products is not visible– The IV&V team was not able to perform that level of analysis on this

mission due to resource constraints

Page 11: Mars Exploration Rovers  Spirit SOL-18 Anomaly: NASA IV&V Involvement April 2004

April 2004 MSR Spirit Anomaly - 11

IV&V Facility

Summary

• The anomaly consumed the available system memory and created a deadlock between tasks

• The IV&V approach was modified based on various project specific factors that caused the analysis approach to be elevated to a full system approach rather than the normal software component approach

• Even at the full system approach, the IV&V team identified potential troubling areas involving the system memory usage: risk tracking, issue tracking, code analysis, requirements analysis, test analysis, code complexity, and code stability

• However, the lack of complete requirements documents and testing documentation, both identified by IV&V as project deficiencies, hindered finding the specific problem prior to upload

• The IV&V Facility is examining the lessons learned to determine what actions to take to ensure better service on other IV&V projects