35
Recovering Device Recovering Device Drivers Drivers Michael M Swift, Michael M Swift, Muthukaruppan Annamalai, Muthukaruppan Annamalai, Brian N Bershad and Henry Brian N Bershad and Henry Levy Levy

Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Embed Size (px)

Citation preview

Page 1: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Recovering Device Recovering Device DriversDrivers

Michael M Swift, Michael M Swift, Muthukaruppan Annamalai, Muthukaruppan Annamalai, Brian N Bershad and Henry Brian N Bershad and Henry

LevyLevy

Page 2: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

IntroductionIntroduction

►Device drivers fail more than anything Device drivers fail more than anything elseelse XP: 85% of all crashesXP: 85% of all crashes Linux: 7x the bug rate of the mainline Linux: 7x the bug rate of the mainline

kernelkernel

►Existing work protects the kernelExisting work protects the kernel►Applications left to fend for themselvesApplications left to fend for themselves

Page 3: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

PrinciplesPrinciples

►Device driver failures should be Device driver failures should be concealed from the driver’s clientsconcealed from the driver’s clients

►Recovery logic should be centralized in Recovery logic should be centralized in a single subsystema single subsystem

►Driver recovery logic should be Driver recovery logic should be genericgeneric

►Recovery services should have low Recovery services should have low overhead when not neededoverhead when not needed

Page 4: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Shadow DriversShadow Drivers

►Conceals driver failure from applicationConceals driver failure from application►Logs driver activityLogs driver activity

Driver state (ioctls)Driver state (ioctls) IO requests/callsIO requests/calls

►On failure On failure Intercepts IO requestsIntercepts IO requests Resets driver state by replaying logResets driver state by replaying log

►Model is abstract enough to be Model is abstract enough to be implemented for wide range of driversimplemented for wide range of drivers

Page 5: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Why programs crashWhy programs crash

►““Most drivers fail due to bugs that Most drivers fail due to bugs that result from unexpected inputs or result from unexpected inputs or events [34]”events [34]” [34] V. Orgovan, Systems Crash Analyst, [34] V. Orgovan, Systems Crash Analyst,

Windows Core OS Group, Microsoft Corp. Windows Core OS Group, Microsoft Corp. private communication, 2004private communication, 2004

Do we really need a reference for this?Do we really need a reference for this? What sort of reference is that anyway?What sort of reference is that anyway?

Page 6: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Driver FaultsDriver Faults

►DeterministicDeterministic Set sequence of repeatable configuration or IO Set sequence of repeatable configuration or IO

requestsrequests Unrecoverable with generic toolsUnrecoverable with generic tools

► TransientTransient Infrequent inputs or environment settingsInfrequent inputs or environment settings

► Fail-stopFail-stop Kernel is protected from failing driversKernel is protected from failing drivers Faults are detected before collateral damage occursFaults are detected before collateral damage occurs

► Shadow drivers require transient and fail-stop Shadow drivers require transient and fail-stop behaviorbehavior

Page 7: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

NooksNooks

►Earlier work in kernel protectionEarlier work in kernel protection►Provides fail-stop facilitiesProvides fail-stop facilities

Detects memory violationsDetects memory violations Excessive CPU usageExcessive CPU usage Bad kernel parametersBad kernel parameters 75% success rate75% success rate

►Simply reboots the driver after a faultSimply reboots the driver after a fault

Page 8: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Shadow Driver OperationShadow Driver Operation

► Passive ModePassive Mode Normal operationNormal operation Monitors all explicit communication Monitors all explicit communication

►Replicated procedure callsReplicated procedure calls►Not DMANot DMA

Logs driver configurationLogs driver configuration

► Active ModeActive Mode Recovery operationRecovery operation Reinitializes driver to known stateReinitializes driver to known state Impersonates driver to the kernelImpersonates driver to the kernel

Page 9: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

TapsTaps

►Mechanism allowing replication and Mechanism allowing replication and redirection of communication channelsredirection of communication channels

►Passive OperationPassive Operation Calls driver function then shadow functionCalls driver function then shadow function

►Active modeActive mode Redirects all calls to shadow driverRedirects all calls to shadow driver

Page 10: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Passive TapsPassive Taps

Page 11: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Active TapsActive Taps

Page 12: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Shadow ManagerShadow Manager

►Controls all shadow driversControls all shadow drivers►Manages recovery operationsManages recovery operations►Controls Tap insertionControls Tap insertion►Monitors device failuresMonitors device failures

Page 13: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

General InfrastructureGeneral Infrastructure

►NooksNooks Isolation serviceIsolation service Redirection mechanismRedirection mechanism Object tracking serviceObject tracking service

►Shadow ManagerShadow Manager Installs shadow driversInstalls shadow drivers

Page 14: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

ArchitectureArchitecture

Page 15: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Passive MonitoringPassive Monitoring

► Tracks IO requestsTracks IO requests Connection-oriented: offset/positioningConnection-oriented: offset/positioning Request-oriented: pending request logRequest-oriented: pending request log

► Logs configuration commandsLogs configuration commands Only information stored in a persistent logOnly information stored in a persistent log Does not replicate driver stateDoes not replicate driver state

► Tracks kernel objects obtained by driverTracks kernel objects obtained by driver Prevents memory leaksPrevents memory leaks

►Many of the replicated calls are no-opsMany of the replicated calls are no-ops Read/write to sound deviceRead/write to sound device

Page 16: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Active Mode RecoveryActive Mode Recovery

► Impersonates driver to kernel and Impersonates driver to kernel and applicationsapplications

►Recovers driverRecovers driver Stops failed driverStops failed driver Reinitializes driverReinitializes driver Transfers state back into driverTransfers state back into driver

Page 17: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Stopping the Failed DriverStopping the Failed Driver

►Shadow managerShadow manager Signals shadow driver of failureSignals shadow driver of failure Switches taps to redirectionSwitches taps to redirection

►Shadow DriverShadow Driver Disables hardware deviceDisables hardware device Garbage collects unnecessary resourcesGarbage collects unnecessary resources

Page 18: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Reinitializing the DriverReinitializing the Driver

►Shadow driver uses cached data Shadow driver uses cached data sectionsection

► Initializes driverInitializes driver►Reattaches driver to kernel resourcesReattaches driver to kernel resources►Reenables hardware resourcesReenables hardware resources

Page 19: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Transferring Driver StateTransferring Driver State

►Shadow Driver resubmits any Shadow Driver resubmits any outstanding IO requestsoutstanding IO requests Possible replication of IOPossible replication of IO If device cannot handle duplicate IO, If device cannot handle duplicate IO,

request is canceledrequest is canceled

►Replays logged configuration commandsReplays logged configuration commands►Shadow Driver signals Shadow ManagerShadow Driver signals Shadow Manager►Taps set back to passive modeTaps set back to passive mode

Page 20: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Proxying of RequestsProxying of Requests

►Depends on driver mechanics and Depends on driver mechanics and interfaceinterface

►Possible actionsPossible actions Respond with recorded informationRespond with recorded information Silently drop requestSilently drop request Queue request for laterQueue request for later Block requestBlock request Report driver busyReport driver busy

Page 21: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

LimitationsLimitations

►Requires dynamic loading and unloadingRequires dynamic loading and unloading►Requires explicit communication Requires explicit communication

channelschannels DMA doesn’t workDMA doesn’t work

►Assumes driver failure has no external Assumes driver failure has no external effectseffects

►Requires effective isolation and Requires effective isolation and protection serviceprotection service

►Cannot make real-time guaranteesCannot make real-time guarantees

Page 22: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

EvaluationEvaluation

► PerformancePerformance Overhead during passive modeOverhead during passive mode

► Fault-ToleranceFault-Tolerance Does it workDoes it work

► LimitationsLimitations How many failures can be dealt withHow many failures can be dealt with

► Code SizeCode Size Amount of kernel modification neededAmount of kernel modification needed

► Either the advisor is a jerk or the grad Either the advisor is a jerk or the grad students need a social lifestudents need a social life

Page 23: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Tested DriversTested Drivers

Page 24: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Tested ApplicationsTested Applications

Page 25: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

PerformancePerformance

►Three configurationsThree configurations Linux-Native: Stock kernelLinux-Native: Stock kernel Linux-Nooks: kernel protectionLinux-Nooks: kernel protection Linux-SD: Shadow driver implementationLinux-SD: Shadow driver implementation

►No additional penalty vs Linux-NooksNo additional penalty vs Linux-Nooks►Only 1-3% performance hit vs Linux-Only 1-3% performance hit vs Linux-

NativeNative

Page 26: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Relative PerformanceRelative Performance

Page 27: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

CPU UtilizationCPU Utilization

Page 28: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Fault ToleranceFault Tolerance

► Bugs culled from bug-fixes posted to the Bugs culled from bug-fixes posted to the linux-kernel mailing listlinux-kernel mailing list

► Bugs were replicated inside each driverBugs were replicated inside each driver► Placed bugs in rarely taken pathsPlaced bugs in rarely taken paths

Unusual hardware conditionsUnusual hardware conditions

► Forced driver to take unusual pathForced driver to take unusual path

►What is the difference between that and What is the difference between that and adding a faulting ioctl?adding a faulting ioctl?

Page 29: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Fault ToleranceFault Tolerance

Page 30: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Recovery BehaviorRecovery Behavior

►Not completely seamlessNot completely seamless►Noticeable gap during recoveryNoticeable gap during recovery►Possible temporary data lossPossible temporary data loss

Page 31: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

LimitationsLimitations

►How do shadow drivers perform with How do shadow drivers perform with non fail-stop errorsnon fail-stop errors

►Large scale fault injection experimentsLarge scale fault injection experiments►CasesCases

Failure detectedFailure detected►Recovery hidden from application?Recovery hidden from application?

Failure not detectedFailure not detected

Page 32: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

What would you do for a What would you do for a PhD?PhD?

► In total, we ran 2100 trials across the three drivers and six applications. Between trials, we reset the system and reloaded the driver. For each trial, we injected five random errors into the driver while the application was using it. We ensured the errors were transient by removing them during recovery. After injection, we visually observed the impact on the application and the system to determine whether a failure or recovery had occurred.

Page 33: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Undetected FailuresUndetected Failures

►3 Cases3 Cases IO requests that never completeIO requests that never complete Driver <-> Device interactionDriver <-> Device interaction Certain bad parameters/return codesCertain bad parameters/return codes

►Need better understanding of driver Need better understanding of driver semanticssemantics

Page 34: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Fault OutcomesFault Outcomes

Page 35: Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

Code SizeCode Size