Upload
motty-ben-atia
View
395
Download
2
Embed Size (px)
DESCRIPTION
Office 365 service management
Citation preview
3
99.80%
99.90%
100.00%
JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN
Americas Region
99.80%
99.90%
100.00%
JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN
European Region
99.80%
99.90%
100.00%
JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN
Asia-Pacific Region
In a 12 month period, the uptime of O365 applications averaged > 99.9%
RedundancyPhysical redundancyData redundancyFunctional redundancy
ResiliencyActive load balancingRecovery across “failure domains” regularly tested
Human backupAutomated recovery alerts 24x7 on-call engineerOn-call engineers are core product group members
Distributed WorkloadsDistributed components are more resilientMost failures are contained to a single service.Service component isolation
Complexity avoidance and graceful degradationStandardized hardware Fully automated deploymentBuilt-in workload management mechanisms
Inspectability and predictabilityDetailed log and tracingDeep internal monitoring augmented by extensive outside-in monitoring diagnostics
Addi
tiona
lCh
anne
ls Pr
imar
y Ch
anne
ls
Incident Status
Status Description SHD icon
Investigating Monitors have indicated a service anomaly and/or Microsoft has received reports of a potential service incident. Microsoft is currently investigating.
Service Interruption Microsoft has confirmed that normal services are being impacted. Microsoft is taking immediate action tounderstand the cause of the failure and determine best course of action to restore service.
Service Degradation Services are still active, but service responsiveness and/or delivery times may be slower than usual. Microsoftis working to restore normal service responsiveness.
Restoring Service Microsoft has isolated the likely cause of the incident and is in the process of restoring service
Extended Recovery Services are restored and may be slower than usual
Service Restored Normal system services have been restored
False Positive The service is healthy and a service incident did not actually occur
Additional Information There is additional information provided
Normal Service The service is healthy
?
http://status.office365.com.
Click on “View history for past 30 days”
Click on “Incident ID MO2708””
For Limited Set of Service IncidentsExplanation of IncidentLocalized Content
Roles and Responsibilities
Are published for Service Availability issues that span multiple customers
Available within 5 business days
PIR downloadable document accessible from SHD
A PIR includes: • Incident Information• Summary• Customer Impact• Incident Start Date and Time• Root Cause• Next Steps
30 day historical view in SHD
Click on “Post-incident report published”
Type Description Channel
Planned Maintenance Update • 5 business days prior notification of planned service
maintenance.• Notification includes start and end time.
• Service Health Dashboard• RSS Admin Feed (for
subscribed admins)
Transparent non-customer impacting service hygiene
More detailed information and programmatic approach around service updates and service incidents
Tenant Level Reporting
Service Health Dashboard Customer Preview Programs
Service Communication Panel Concept
Office 365 Community.
http://community.office365.com/en-us/preview/tools/troubleshooting.aspx)http://community.office365.com/en-us/preview/wikis/diagnostic_tools/2146.aspx#smallbusinesses
http://community.office365.com/en-us/preview/wikis/diagnostic_tools/2146.aspx#enterprises
https://outlook.com/owa.https://<domain>.sharepoint.com/<pagename>.aspx.
https://<domain>.sharepoint.com/personal/<UserAlias>_<domain>/Documents/Forms/All.aspx
25
Web browser
Office client
Operating system
2013 20232018 Extended SupportMainstream Support
34
44
Service health summary with quick access to detailed dashboard
Simplified navigation bar with quick access to all workloads
Reports on service usage and performance
45
Manage mailboxes, groups and objects
Search for propertiesConduct an advanced search
Manage roles and permissions
Create policies
Track message delivery
46
Who
Where
What
48
Systems administrator All
49
50
Edit contact detailsManage groups
Manage voice mail and phone settings
The objective is to describe the risk of outage to an individual customer based on the aggregate uptime of the service.Longer outages have greater impact to the percentage
Outages that affect a greater number of users have greater impact
More severe outages in terms of users or duration lead to greater deviations from 100%, which can be used for remedy service credits.
The Office 365 service level agreement expresses uptime in this way:
𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑈𝑈𝑈𝑈 − 𝑑𝑑𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑈𝑈𝑈𝑈
× 100%
The aggregate uptime of service components can be expressed similarly.
Hardware or software failuresMonitoring alerts
Service incidentsCustomer reported incidents
SPO
EXO
Microsoft Online ID
Office 365 Portal
Office 365 Provisioning
Lync
Avoid unnecessary assumptions by on-call engineers
Isolate issues to root cause
Executed from two+ locations to ensure accuracy and redundancySimulates full end user and system transactions Supports every major system and user scenarioFailures at any point are turned into alerts and escalated to engineers