21
RAS: What is it? Why do we need it? Harb Abdulhamid (Qualcomm) Fu Wei (Red Hat) Yazen Ghannam (AMD)

Las16 200 - firmware summit - ras what is it- why do we need it

  • Upload
    linaro

  • View
    293

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Las16 200 - firmware summit - ras what is it- why do we need it

RAS: What is it? Why do we need it?Harb Abdulhamid (Qualcomm)

Fu Wei (Red Hat)Yazen Ghannam (AMD)

Page 2: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

What is it?● Reliability

○ Computation needs be correct and reliable.

○ Failures and errors need be detected and reported.

○ Computation needs to fail when an error is not handled.

● Availability○ System needs to remain available as long as possible.

○ Errors should be corrected and failures handled so that operation can continue.

● Serviceability○ System should provide information to administrator to aid in system servicing.

○ Service time needs to be minimized to maximize uptime.

Page 3: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

Why do we need it?● Increase in system uptime (productivity)

● Less time spent debugging bad or failing hardware (productivity/cost)

● Fewer hardware replacement calls (cost/mindshare)

Page 4: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

Hardware Architecture (How do we do it?)● x86: Machine Check Exceptions (MCE) & Machine Check Architecture (MCA)

○ Architectural features/extensions.

○ Defines a register set that can be used for multiple devices (IMPORTANT!).

○ Poll for correctable errors.

○ APIC LVT or SMI interrupts for correctable thresholding and deferred errors.

○ MCE for uncorrectable errors.

● PCI-E: Advanced Error Reporting (AER)○ Similar concepts to MCE/MCA.

● Implementation-specific features○ ECC in memory controllers

○ ECC in I/O RAMs

○ Poison/bad data markers

○ Flooding I/O links (e.g. Sync Flood)

Page 5: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

Platform Firmware (How do we do it?)● Platform Firmware has intimate knowledge of the system and can handle RAS

features not available through standardized mechanisms.

● Privileged code runs on the main cores or a separate microcontroller.

● Can mask registers from OS view and handle interrupts.

● Handling can be done without OS’s knowledge and information can be

exposed to OS if desired.

● Preferably, will use a standard mechanism, like ACPI, to inform the OS of errors.

● Can directly inform sysadmin of errors using sideband communications like a

baseboard management controller (BMC).

● Can pinpoint bad hardware for easy replacement.

Page 6: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

Kernel (How do we do it?)● Error Detect and Correct (EDAC) for system-specific handling and decoding.

● ISA-specific handling in /arch.

● Drivers for PCI-E AER and ACPI.

● Ideally, most RAS code in the Kernel would be obsoleted by Platform Firmware

handling of errors.

● Kernel could then be only responsible for reporting errors received through

standard mechanisms (e.g. ACPI).

● Kernel could also perform error handling relevant at the kernel-level (e.g. killing

processes or retiring bad/poisoned pages).

Page 7: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

User-space (How do we do it?)● Mcelog

○ Generally considered obsolete.

○ X86 only.

○ Reads data from /dev/mcelog.

● Rasdaemon○ More active.

○ Can be updated to handle various platforms.

○ Reads data from Kernel tracepoints.

○ Can effectively obsolete EDAC modules for error decoding.

Page 8: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

ACPI (How do we do it?)● We’ll get into this next...

Page 9: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

ACPI APEI BERT ● Scenarios : Record errors in

emergency (OS crash/reset)

● BERT:Boot Error Record Table

● Mechanism : report unhandled

errors that occurred in a previous

boot.○ WHERE are the error records

Page 10: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

UEFI spec CPER

Page 11: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

ACPI APEI BERT

Page 12: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

ACPI APEI HEST ● Scenarios : Record errors in runtime

(OS still can work)

● HEST:Hardware Error Source Table

● Mechanism : describes a

standardized mechanism platforms

may use to describe their error

sources by Error Source Structure: ○ HOW to inform

○ WHERE are the error records

○ WHEN records can be free

Page 13: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

ACPI APEI HEST ● Error Source Structure :

○ For IA-32 : MCE/CMC/NMI

○ For PCI: AER Root Port/Endpoint/Bridge

○ Generic Hardware : GHES V1/V2

● For ARM64 : GHES v2○ HOW to inform : Notification Structure

○ WHERE are the error records: Error

Status Address (GAS : Generic Address Structure)

○ WHEN records can be free:Read Ack Register

Page 14: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

ACPI APEI HEST

Page 15: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

ACPI APEI ERST ● Scenarios : Record and Retrieve errors in

persistent storage

● ERST:Error Record Serialization Table

● Mechanism : Operation abstract, provides

details necessary to communicate with

on-board persistent storage

● Plan B: use the UEFI runtime variable services

to carry out error record persistence

operations

Page 16: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

ACPI APEI EINJ ● Scenarios : Test OSPM error handling stack

● EINJ:Error Injection Table

● Mechanism : Operation abstract, provides a

generic interface which OSPM can inject

hardware errors to the platform without

requiring platform specific software.

Page 17: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

RAS on ARM64● Architectural support for RAS is not available but not needed.

● In other words, no need to follow the same historical path as other

architectures.

● Focus should be on Platform Firmware handling of errors.

● Reporting should be through standard methods like ACPI.

● Will possibly need to implement kernel-relevant error handling based on

information received from Platform Firmware.

Page 18: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

Current Work● Add support for ACPI RAS features.

● Testing Platform Firmware to OS interface.

● No platform-specific RAS feature testing.

● Using modified QEMU for testing.

Page 19: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

Future Work● Finish ACPI implementation.

● Investigate kernel handling of poisoned pages and processes.

● Investigate I/O-related error handling in the Kernel.

Page 20: Las16 200 - firmware summit - ras what is it- why do we need it

ENGINEERS AND DEVICESWORKING TOGETHER

Demo

Page 21: Las16 200 - firmware summit - ras what is it- why do we need it

Thank You

#LAS16For further information: www.linaro.org

LAS16 keynotes and videos on: connect.linaro.org