17
Live CS1000 Failover Exercise April 12, 2012 Jessica Mosher School Employees Credit Union of Washington

Live CS1000 Failover Exercise

  • Upload
    hide

  • View
    70

  • Download
    4

Embed Size (px)

DESCRIPTION

Live CS1000 Failover Exercise. April 12, 2012 Jessica Mosher School Employees Credit Union of Washington. Topics Covered. Disaster Modeled in the Exercise Planning Tasks Set the Date & Obtain Executive Sign-off Develop Exercise Timeline & Materials Post-exercise - PowerPoint PPT Presentation

Citation preview

Page 1: Live CS1000 Failover Exercise

Live CS1000 Failover Exercise

April 12, 2012Jessica Mosher

School Employees Credit Union of Washington

Page 2: Live CS1000 Failover Exercise

Topics Covered

• Disaster Modeled in the Exercise• Planning Tasks• Set the Date & Obtain Executive Sign-off• Develop Exercise Timeline & Materials• Post-exercise• Some Discoveries & Action Items

Page 3: Live CS1000 Failover Exercise

Disaster Modeled in The Exercise

Normal Operating Conditions at School Employees Credit Union of Washington

Page 4: Live CS1000 Failover Exercise

Loss of Phone Services Only at the Seattle Office Disaster Modeled in The Exercise

Page 5: Live CS1000 Failover Exercise

Failover Operating Conditions at School Employees Credit Union of WashingtonDisaster Modeled in the Exercise

Page 6: Live CS1000 Failover Exercise

Planning Tasks• Start with a To-Do List that will be dynamically updated• Define the Benefits of the Exercise & Share with Call Center Managers

Staff (trainers) will know exactly what incoming calls will be like for agents

We’ll learn what hotkeys, shortcuts & sub-queues will no longer work

Can our system at the failover site handle the volume?

How will staff change call handling without a queue display?

How will member calls change without automated telephone banking?

We will see if BTN re-routing to the failover site actually works

• Schedule a meeting with technical providers to determine their requirementsIn our case: Arrow S3 Nortel services providerXO Communications telecommunications provider

• Set Basic Expectations for the exerciseRedirect our main 206-, 509-, and toll-free numbers to our failover system in Spokane

Effect open/close hours on the Spokane PBX (Normally handled in Contact Center)

Exercise our call overflow/closed automated message in the failover system

Recover to normal operations prior to the start of the next business day

Page 7: Live CS1000 Failover Exercise

Planning Tasks

• Meeting with Technical PlayersAgenda (shared with invitees in advance):

1. Introductions

2. Outlined disaster scenario & expectations

3. Explain School Employees Credit Union’s Disaster Policy

4. Time to implement?

5. Security requirements – individuals authorized to declare disaster and verify with Arrow S3 and XO Communications

6. Recovery (fail back to normal operations) procedure

1. How will we know main circuits are ready to receive calls?

2. What information needs to be retained to restore Spokane’s original configuration?

3. What coordination needs to occur between XO and Arrow S3?

7. Next steps

•We held a few conference calls over the next few months, to plan the BTN failover and recovery with XO & define coordination points & condition tests

Page 8: Live CS1000 Failover Exercise

Set the Date & Obtain Executive Sign-off• November 12, 2010 – a Friday• Our phones would have an unannounced “extension”

of customer service hours from 6:00 – 6:30pm, 30 minutes after our published 5:30pm closing time

• Participants:– School Employees Credit Union IT Resources– XO Resources & Arrow S3 Resources (one at each location)– Member Services Manager– Phone staffing requirements:

• One agent in each office and a telecommuter. At least one on [our back office queue].• At least one Supervisor • Must be able to handle “cold” calls

– Must have a cellphone to use during the exercise

Page 9: Live CS1000 Failover Exercise

Develop Exercise Timeline & Materials• Questions asked frequently while planning:

– Then what happens?– How long will that take?– Is this a checkpoint?

• Information to Include in Planning:– “Need to Know” list, Tester Contact Info, Tests– Customer Service Rep (CSR) expectations & scripts

"We're sorry, we wouldn't normally do this, but they are going to work on the phone system shortly, is there a number I can call you back on Monday.""We can't review credit reports right now.""Wires are finished at 4:00pm.""If I lose you, here's the number…." (transferring call) "We wouldn't normally be here to help you, but we're working on the phone system. I can help you with simple things right now, otherwise let me take down your name and number and we'll call you back on Monday."

Page 10: Live CS1000 Failover Exercise

Develop Exercise Timeline & Materials

Our Exercise Timeline:

Last saved date: 03-Apr-2012 05:52PMStart Time: End Time:

Questions:In advance of test day Schedule a Closed-hours failover exercise Done. Six weeks in advance.In advance of test day Bring the Spokane MIRAN card online Done.

In advance of test dayArrow S3 tests off-duty ACD phones to the Spokane PBX Done.

In advance of test day Recruit Phone CSR volunteers Done.In advance of test day Volunteer preparation meeting (expectations) Done.

In advance of test daySwitch failover ticket opened with XO Communications Working with Annette & Stacy at XO

November 10thPut Member Alert on our website for Telephone Banking outage Done.

Test DayPhone in Disregard to Arrow S3 NOC - 800-XXX-XXXX; Option (5) Done.

Test DayLog disregard to the SECUWA NOC for the three telecom servers Done.

17:00 17:35Ensure all phones are logged off (no "Not Ready" statuses)

17:35 17:40 Invoke a switch "disaster" in Seattle

Page 11: Live CS1000 Failover Exercise

Develop Exercise Timeline & Materials

Arrow S3 brings the switch offline17:38 17:40 Arrow S3 disables signalling servers in Seattle17:50 17:52 Phones reset to Spokane signalling server

17:52 17:54Make a test-911 outbound call from Seattle line:Test911script.doc

17:52 17:54Make a test-911 outbound call from Spokane phone:Test911script.doc [ Spokane volunteer name ] to carry out.

17:52 17:54Make a test-911 outbound call from Lynnwood fax lineTest911script.doc

[ Lynnwood volunteer name ] to carry out.

XO Technologies re-routes our 4010 BTN

17:501-888-628-4010 & 206-628-4010 re-routed to Spokane circuits

509-XXX-XXXX is the Emergency Forward Number

17:50 17:51

Telecommuter enters new Server IP address in 2050 soft phonehowto-DRSoftPhone.doc [ Telecommuter volunteer ] will carry out.

17:55 18:00 Check screen on IP 2500 telecommuter phone

18:00 18:30 Agents log into phones and take calls ("open")How many queue calls should we instigate?

18:00 18:10[Non main queue ACD ] phone is logged in and placed into "Not Ready"

18:00 18:15 Execute testing TestSteps-PhoneSystem.xls

18:15 18:30

Test four-digit queue numbers (see list)- staffed- unstaffed

18:30 18:40

Test "closing" phones:- Agents log out- Call into 4010 & hear closed recording

18:40 Invoke a switch "recovery" in Seattle

Page 12: Live CS1000 Failover Exercise

Develop Exercise Timeline & Materials

Arrow S3 brings the switch online18:40 18:55 Arrow S3 boots up [ Contact Center Manager ]18:55 19:00 Arrow S3 boots up [ CallPilot card ]18:55 19:00 Arrow S3 re-enables Signalling Servers in Seattle

XO Technologies re-routes our 4010 BTN19:00 19:05 Phones in Lynnwood & Seattle reset to Seattle19:05 19:15 Test phones in Spokane rebooted to detect Seattle

19:15 19:20Telecommuter restores Seattle Server IP address in 2050 soft phone

19:30 19:30 Restoration of test phones to Seattle confirmed

19:30 19:45Execute testing TestSteps-PhoneSystem.xls on a Seattle phone

19:30 19:45Execute testing TestSteps-PhoneSystem.xls on a Lynnwood phone

19:30 19:45Telecommuter tests from TestSteps-PhoneSystem.xls

19:30 19:45Execute testing TestSteps-PhoneSystem.xls in Spokane

19:50 20:00Reboot the Spokane network switch to force a phone reset If needed

19:50 20:00 Reboot the Lynnwood switch to force a phone reset If needed

Page 13: Live CS1000 Failover Exercise

Post-Exercise

• Supervisors & Volunteers provided with checklist reference materials – added in Business Continuity plan materials in their final form

• IT team members phoned in test calls during failover window to ensure call routing was exercised

• CSRs kept a call activity & “hurdle” log– Four Real Member calls (one successfully routed to back office queue)– 1 Wrong Number– Wrote down member callback information for next business day in

the absence of voicemail

Page 14: Live CS1000 Failover Exercise

Post Exercise

• New lists: For Follow-up, Staff Feedback• Final Report titled “Lessons Learned” and only

listed points for these questions:– What did we do well?

– What went as expected?

– What did not occur as expected?

– What would we do differently?

– References:

– Materials Created for Future use:

Page 15: Live CS1000 Failover Exercise

Some Discoveries & Action Items

• Follow-up items:911 dispatch thought we were at an address five miles from the Spokane office (1-509-XXX-XXXX) and a different business

Spokane phones did not failover properly, as they connect to a different Node.

The phone clocks were slow from Spokane. We need to connect the PBXes to the NTP service.

Everyone received incoming calls, regardless of which queue they had built into their phone.Some phones did not fail over to Spokane.Calls could not be transferred to unmonitored queues.Phones were logged in unless they were "MakeSetBusy."

XO's engineers were not sure what a "Successful failover" state sounds like in test calls

Back office telecommuter's soft phone came up as a main queue phone from Spokane.The PBX had difficulties picking up calls after we failed back.All phones came up "MakeSetBusy." (a good thing)

Arrow S3 had difficulty "resetting" phones so they would find the Seattle PBX on recovery.Agent Greeting did not come up on Monday.The Lynnwood ERL does not exist in the Spokane PBX

Page 16: Live CS1000 Failover Exercise

Some Discoveries & Action Items

From the final report (Business & Member perspective):

•What did not occur as expected? – The Lynnwood tester’s phone was not built correctly in the

Spokane PBX & didn’t come up– Calls routed to any available agent or Supervisor—it was

expected that only [ main queue ] phones would receive incoming calls

– Phone system behavior when a queue was unavailable was different than expected

– Calls did not “ring” on Agent phones—they connected with just a buzz

– Calls could not be transferred or conferenced to unstaffed queues– Agents were not trained on how to reboot phones without

disconnecting the powered Ethernet cord

Page 17: Live CS1000 Failover Exercise

Final Thoughts

• Start small – we proved our failover system would take live calls with a minimum of risk to our equipment

• Plan early, especially if there is customer exposure• Prepare phone staff with polished, well-written materials & a

meeting – if IT/Telecom seems prepared, they will feel prepared – including possible problems they may experience

• Mitigate live monitoring system alarms• Test backup components that are normally “dormant” to the fullest

extent possible• Detail the recovery to normal operations as much as the failover• Build in documentation time into the project & repurpose into

overall Business Continuity Plan

Questions?