Office 365 service management

3

99.80%

99.90%

100.00%

JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN

Americas Region

99.80%

99.90%

100.00%


European Region

99.80%

99.90%

100.00%


Asia-Pacific Region

In a 12 month period, the uptime of O365 applications averaged > 99.9%

RedundancyPhysical redundancyData redundancyFunctional redundancy

ResiliencyActive load balancingRecovery across “failure domains” regularly tested

Human backupAutomated recovery alerts 24x7 on-call engineerOn-call engineers are core product group members

Distributed WorkloadsDistributed components are more resilientMost failures are contained to a single service.Service component isolation

Complexity avoidance and graceful degradationStandardized hardware Fully automated deploymentBuilt-in workload management mechanisms

Inspectability and predictabilityDetailed log and tracingDeep internal monitoring augmented by extensive outside-in monitoring diagnostics

Addi

tiona

lCh

anne

ls Pr

imar

y Ch

anne

ls

Incident Status

Status Description SHD icon

Investigating Monitors have indicated a service anomaly and/or Microsoft has received reports of a potential service incident. Microsoft is currently investigating.

Service Interruption Microsoft has confirmed that normal services are being impacted. Microsoft is taking immediate action tounderstand the cause of the failure and determine best course of action to restore service.

Service Degradation Services are still active, but service responsiveness and/or delivery times may be slower than usual. Microsoftis working to restore normal service responsiveness.

Restoring Service Microsoft has isolated the likely cause of the incident and is in the process of restoring service

Extended Recovery Services are restored and may be slower than usual

Service Restored Normal system services have been restored

False Positive The service is healthy and a service incident did not actually occur

Additional Information There is additional information provided

Normal Service The service is healthy

?

http://status.office365.com.

http://status.office365.com/

Click on “View history for past 30 days”

Click on “Incident ID MO2708””

For Limited Set of Service IncidentsExplanation of IncidentLocalized Content

Roles and Responsibilities

Are published for Service Availability issues that span multiple customers

Available within 5 business days

PIR downloadable document accessible from SHD

A PIR includes: • Incident Information• Summary• Customer Impact• Incident Start Date and Time• Root Cause• Next Steps

30 day historical view in SHD

Click on “Post-incident report published”

Type Description Channel

Planned Maintenance Update • 5 business days prior notification of planned service

maintenance.• Notification includes start and end time.

• Service Health Dashboard• RSS Admin Feed (for

subscribed admins)

Transparent non-customer impacting service hygiene

More detailed information and programmatic approach around service updates and service incidents

Tenant Level Reporting

Service Health Dashboard Customer Preview Programs

Service Communication Panel Concept

Office 365 Community.

http://community.office365.com/en-us/preview/tools/troubleshooting.aspx)http://community.office365.com/en-us/preview/wikis/diagnostic_tools/2146.aspx#smallbusinesses

http://community.office365.com/en-us/preview/wikis/diagnostic_tools/2146.aspx#enterprises

https://outlook.com/owa.https://<domain>.sharepoint.com/<pagename>.aspx.

https://<domain>.sharepoint.com/personal/<UserAlias>_<domain>/Documents/Forms/All.aspx

http://community.office365.com/en-us/preview/tools/troubleshooting.aspx

http://community.office365.com/en-us/preview/wikis/diagnostic_tools/2146.aspx

http://community.office365.com/en-us/preview/wikis/diagnostic_tools/2146.aspx

https://outlook.com/owa

25

Web browser

Office client

Operating system

2013 20232018 Extended SupportMainstream Support

34

44

Service health summary with quick access to detailed dashboard

Simplified navigation bar with quick access to all workloads

Reports on service usage and performance

45

Manage mailboxes, groups and objects

Search for propertiesConduct an advanced search

Manage roles and permissions

Create policies

Track message delivery

46

Who

Where

What

48

Systems administrator All

49

50

Edit contact detailsManage groups

Manage voice mail and phone settings

The objective is to describe the risk of outage to an individual customer based on the aggregate uptime of the service.Longer outages have greater impact to the percentage

Outages that affect a greater number of users have greater impact

More severe outages in terms of users or duration lead to greater deviations from 100%, which can be used for remedy service credits.

The Office 365 service level agreement expresses uptime in this way:

𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑈𝑈𝑈𝑈 − 𝑑𝑑𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑈𝑈𝑈𝑈

× 100%

The aggregate uptime of service components can be expressed similarly.

Hardware or software failuresMonitoring alerts

Service incidentsCustomer reported incidents

SPO

EXO

Microsoft Online ID

Office 365 Portal

Office 365 Provisioning

Lync

Avoid unnecessary assumptions by on-call engineers

Isolate issues to root cause

Executed from two+ locations to ensure accuracy and redundancySimulates full end user and system transactions Supports every major system and user scenarioFailures at any point are turned into alerts and escalated to engineers

Technology

Office 365 service management