HAB Software Woes

Preview:

DESCRIPTION

My talk from the UKHAS 2012 conference about problems in HAB software.

Citation preview

HAB Software WoesJohn Graham-CummingSeptember 2012

Or “My capsule didn’t crash but my software did”

Background> 30 years of

programming experience

One HAB flight◦ GAGA-1

http://blog.jgc.org/2011/04/gaga-1-flight.html

https://github.com/jgrahamc/gaga

Where’s your flight’s complexity?Example: GAGA-1

◦One balloon, parachute, polystyrene box◦Many metres of cord attached with knots◦An off-the-shelf camera

◦2,836 lines of code◦Common to see defect rates of 2 to 4 per

KLOC◦So GAGA-1 likely has 5 to 10 errors in it

Real Stuff Seen on HAB flightsComplete computer crashAltitude going negativeLatitude and longitude garbledCutdown triggered in back of carLong periods of no transmissionNot setting the GPS up before launchNot turning the camera onRunning out of camera disk spaceAltitude jumping around rhythmically

The Curse and Joy of DeterminismComputers do what you tell them

to◦Precisely what you tell them to◦Not what you think you told them to

doA Curse

◦Will do things you don’t expect◦Will process bogus input without

complaintThe Joy

◦Easy to test that it does what’s expected

HAB Is A Harsh EnvironmentColdVibrationStuff breaks in flight

Software needs to be able to cope with failing hardware

Very important to think about failure modes

YOUR CODE IS ON ITS OWN OUT THERE

Deadly SinsThe “It works!” FallacyThe Last Minute ChangeBeing Far Too CleverOverlooking Odd BehaviourCopying Other People’s CodeAssuming Finding A Bug Solves

The Problem

The “It works!” FallacyIf you’re an inexperienced (and

sometimes experienced) programmer…◦You hack some code together◦It works once◦You assume it will always work

Only solution to this is◦Testing◦Paranoia

The Last Minute ChangeNever, ever change anything in

code at the last minute no matter how simple.

Example: HABE 1◦Complete camera failure◦Maximum integer size in uBASIC on

CHDK is 999,999◦Last minute change of integer from

600,000 to 1,000,000 caused total failure

Being Far Too CleverExample: GAGA-1

◦Entered the wrong value of 2 * pi in code to do GPS position conversion from radians to degrees

◦Caught before flight because I verified the location of my own back garden

◦Note to self: 2 * pi != 6.2818.

https://github.com/jgrahamc/gaga/blob/master/gaga-1/flight/gaga1/gps.cpp#L113

Overlooking Odd BehaviourExample: GAGA-1

◦ In tests RTTY output was fine some of the time, garbled at other times

◦Turned out to be interrupts from the GPS messing up the RTTY timing

◦Solution: disable GPS serial interface while sending RTTY string

ALWAYS BE HONEST WITH YOURSELF ABOUT YOUR CODE

EXPECT THE SPANISH INQUISITION!

https://github.com/jgrahamc/gaga/blob/master/gaga-1/flight/gaga1/tsip.cpp#L229

Copying Other People’s CodeDon’t do this, you have no idea

what you are copying or who they copied it from

Better practice is to look at other people’s code and…◦Write your own version◦That you understand◦That you are able to test◦Example: GAGA-1

Read lots of people’s RTTY code, wrote my ownhttps://github.com/jgrahamc/gaga/blob/master/gaga-1/

flight/gaga1/rtty.cpp

APRS Tracker using copied code

If the altitude in metres contained an 8 or a 9 the altitude reported would be wrong

http://sharon.esrac.ele.tue.nl/users/pe1rxq/aprstracker/aprstracker.html

Assuming Finding The Bug Solves The ProblemJust because you’ve found A bug

doesn’t mean it was THE bugLots of research in computer

science shows bugs tend to cluster

Example: CLOUD1, CLOUD2◦Three bugs in printing latitude,

longitude and altitude◦One fixed on CLOUD1, …

“The One Thing I Didn’t Test”

http://ukhas.org.uk/guides:common_coding_errors_payload_testing

Common problems with uCLack of floating point supportSmall integers

You might never be a great programmer…

… but you can be a paranoid tester!

Good Things To DoNo infinite loopsSelf-CheckingUnexpected Error HandlingHandle ExceptionsSimulationSimplify, Simplify, SimplifyUnit TestWrite Log Files

No Infinite LoopsNever sit in a loop waiting foreverExample: ATLAS 3while (1) {    // Make sure data is available to read    if (Serial.available()) {      b = Serial.read();            if(bytePos == 8){        navmode = b;        return true;      }                              bytePos++;    }    // Timeout if no valid response in 3 seconds    if (millis() - startTime > 3000) {      navmode = 0;      return false;    }  }}

https://github.com/jamescoxon/Atlas-Flight-Computer/blob/master/Atlas3/Atlas3_3.pde#L211

Self-Checking-- Now enter a self-check of the manual mode settings

log( "Self-check started" )

assert_prop( 49, -32764, "Not in manual mode" )assert_prop(  5,      0, "AF Assist Beam should be Off" )assert_prop(  6,      0, "Focus Mode should be Normal" )assert_prop(  8,      0, "AiAF Mode should be On" )assert_prop( 21,      0, "Auto Rotate should be Off" )assert_prop( 29,      0, "Bracket Mode should be None" )assert_prop( 57,      0, "Picture Mode should be Superfine" )assert_prop( 66,      0, "Date Stamp should be Off" )assert_prop( 95,      0, "Digital Zoom should be None" )assert_prop( 102,      0, "Drive Mode should be Single" )assert_prop( 133,      0, "Manual Focus Mode should be Off" )assert_prop( 143,      2, "Flash Mode should be Off" )assert_prop( 149,    100, "ISO Mode should be 100" )assert_prop( 218,      0, "Picture Size should be L" )assert_prop( 268,      0, "White Balance Mode should be Auto" )assert_gt( get_time("Y"), 2009, "Unexpected year" )assert_gt( get_time("h"), 6, "Hour appears too early" )assert_lt( get_time("h"), 20, "Hour appears too late" )assert_gt( get_vbatt(), 3000, "Batteries seem low" )assert_gt( get_jpg_count(), ns, "Insufficient card space" )

https://github.com/jgrahamc/gaga/blob/master/gaga-1/camera/gaga-1.lua#L96

Self-CheckingExample: ALTAS 3Makes sure uBlox GPS will work

at high altitude; fixes it if not    if((count % 10) == 0) {     digitalWrite(6, LOW);     checkNAV();     delay(1000);     if(navmode != 6){       setupGPS();       delay(1000);     }     checkNAV();     delay(1000);     digitalWrite(6, HIGH);   }

https://github.com/jamescoxon/Atlas-Flight-Computer/blob/master/Atlas3/Atlas3_3.pde#L342

Unexpected Error Handlingdef temperature(): t = at.cmd( 'AT#TEMPMON=1' )

# Command returns something like: # # #TEMPMEAS: 0,28 # # OK # # So split on whitespace first to isolate the temperate 0,28 # and then split on comma to get the temperature

w = t.split() if len(w) < 2: logger.log( "Temperature read returned %s" % t ) return -1000 m = w[1].split(',') if len(m) != 2: logger.log( "Temperature read returned %s" % t ) return -1000 else: return int(m[1])

https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/util.py

Handle ExceptionsIf your language can generate

exceptions then you’d better handle them!

Example: GAGA-1◦Recovery computer used Python◦Exception could have killed it◦Global exception handler

Bonus: What’s wrong with that code?

except: logger.log( "Caught exception in main loop: %s" % sys.exc_info()[1] )

https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/gaga-1.py#L144

SimulationSimulate a flightExample: UKHAS wiki has

example of using a PC as a fake GPS

Example: GAGA-1◦To test the embedded Telit module

wrote modules that faked the entire Telit Python interface.

http://www.ukhas.org.uk/guides:common_coding_errors_payload_testing

https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/GPS.py

https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/MDM.py

Simplify, Simplify, SimplifyMake your code as simple as

possibleNever have ‘duplicated’ or ‘copy

and paste’ codeBreak it up into small functions

that you understandMake sure you understand the

limitations of the functions you call

Unit TestBreak your program up into

small, separate functionsWrite tests that call that function

and make sure it does what you expect.

Lots of ways to do this◦Use something like cpptest◦ArduinoUnit◦Write your own test program

Unit Test ExampleIn the bad APRS programTurn metres to feet code into a

separate function: int m_to_f(int m)assertEquals(m_to_f(1000),3300)assertEquals(m_to_f(2000),6600)assertEquals(m_to_f(3000),9900)assertEquals(m_to_f(4000),13200)assertEquals(m_to_f(5000),16500)assertEquals(m_to_f(6000),19800)assertEquals(m_to_f(7000),23100)assertEquals(m_to_f(8000),26400)assertEquals(m_to_f(9000),29700)assertEquals(m_to_f(10000),33000)

Write Log FilesWrite detailed log files to non-

volatile memory for post flight debugging

Data sent via RTTY or APRS is limited

Log exceptions and errors in detail

Make sure you have a timestamp

Perform system testingTest your entire system before flight

◦Put your tracker in the garden◦Get a GPS lock◦Listen to the RTTY on your radio◦Look at the decoded RTTY on your

computer◦Test uploaded data on the tracker*

◦*I didn’t do that step, on the day people had to fix the tracker for me.

Recommended