48
GlobalNOC Services Update 2015 Internet2 Global Summit

GlobalNOC Services Update 2015 Internet2 Global Summit · 4/28/2015  · Service Desk Activity Metrics for 2014 • 1.9 million alarms/year ~ 5200/day • 30,000 tickets created/year

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

GlobalNOC Services Update

2015 Internet2 Global Summit

Annual Report

๏ http://globalnoc.iu.edu/annual-report/2014/

4/28/15

Service Desk

๏ Welcomed ARE-ON and OSHEAN to the GlobalNOC Family

๏ All I2 FootPrints Projects Consolidated Into 1 = 1/5 of the Former Notifications

๏ Grown by 4 Staff and 1 Robot

April 28, 2015

Year in Review:

Service Desk

๏ Conducted DR Exercise in Early December 2015 with Positive Result

๏ Created and Implemented a Major Incident Communication Policy

April 28, 2015

Year in Review:

Service Desk

Activity Metrics for 2014 •  1.9 million alarms/year ~ 5200/day •  30,000 tickets created/year ~ 82/day •  15,600 phone calls received/year ~ 43/day •  264,000 e-mails sent and received ~ 720/

day

April 28, 2015

Service Desk

๏ Pursuing ISO 20,000 certification • Why? • By When? • What Will the Net Effect Be?

Year Ahead:

2015 Priorities

2015 Focus Areas

Automation

Goal

๏Find the worst things to do by hand. Make a machine do those things.๏Things that are:

• Dangerous• Slow• Annoying

Focus Areas๏Business Processes

๏on-call button๏auto-assign issues๏auto-notify๏auto-discover devices in a new network

๏Reporting๏How many times did we call an engineer?

๏Config automation๏alerting on config drift๏generate template config for new boxes๏push & pipeline

๏ Incident Advisor• auto-fix• hints• Annoying

Service Management

Goal

๏MINIMIZE• unplanned work• confusion• inconsistency

๏Stay flexibile, agile, and custom

Huh?

๏STANDARDIZE: for processes where consistency is most important๏ORGANIZE: a simple lightweight structure where custom and novel work

happens

2 Parts

๏Part 1: ISO/IEC 20000 Certification• Sparked by Internet2 effort, working to reach certification• Aligned with ITIL

• Incident Management

• Change Management

• Capacity Management

• Availability Management

• etc…

2 Parts

๏Part 2: Other service-level improvements• Service Dashboard (end users, network owners)• Prioritize improvements• Faster Turn-up• Change Management

So what…

๏ It’s not good enough anymore to talk about boxes and circuits. Everything is more complicated now.

๏We don’t deliver networks, we deliver services๏Requires rigor to make sure those services work, and agility to make sure

those services evolve quickly

example๏What’s the availability of everyone’s IP Service for Internet2?๏complexities:

• multiple sessions• connectors back each other up

๏Let’s define available!๏First, a service is down if packets have to be retransmitted๏So:

• Up = ALL BGP sessions are established, no loss known• At Risk = At least 1 session is down, but at least one route is still in the routing table• Down = no routes

Data Model

EntityRouted R&E

Service

BGP Peering BGP PeeringASN Peer IP

Reporting Engine

BGP Routing Data

Weekly Report

RoutesPeer State

SLA

Service Awareness

Corresponding process

report generated SLAmet?

send to NPT

outage in GRNOC control?

recommend changes

Recommended Changes

Published Report

Approve Changes

?

Published Report with Outline of Changes

NTP

Dir of Op

Sys

yes

no

yes

no

no

yes

Network

Owner

Work Management

Goal

๏Get coherent system to manage our work• systems• tools• disciplines• processes

๏ In other words, track, prioritize, and measure everything we do.

This means

๏For the people who do work:๏ "Where do I go to see everything I'm supposed to be doing? What should I be

doing first?”๏For the managers:

๏ "Are we too busy? Are we working on the right things?”๏For the strategic view:

๏ "Are we doing well/better than a year ago?”

How does work get tracked

๏Tickets๏Emails๏Post-its๏Workflow records๏Meeting docs๏Many todo lists

The future

๏Review ticketing๏Look at structured processes๏Project management๏Unified view of workload and results

Recruiting

Goal

๏Make sure we have enough talented people…now and 5 years from now

Parts

๏Attract & hire๏Pipeline

๏Get more students in๏ Improve Development

Attracting

๏How do we attract experts that fit?๏Challenges

• Scary job descriptions• People don’t know what R&E or GlobalNOC does• Indiana - No really, it’s a nice place!

Pipeline

๏Getting people into the pipeline• Students have worked very well • Summer of Networking• How do we get more?

๏Keeping the talent growing• Develop people well• Level up!

What’s New With

GlobalNOC Software?

SNAPP

๏ High performance SNMP measurement/visualization tool ๏ 3 major revisions, project began in 2002 ๏ RRDtool based storage ๏ High performance SNMP data collector ๏ Web-based data browser and Web-services API

SNAPP 4 with TSDS

๏ Moving from RRDtool to a non-relational database •  “TSDS” Database based on MongoDB •  Sophisticated query language: TSQL •  Rich meta-data integrated with data. Allows for powerful queries; long-term

longitudinal analysis ๏ General Time Series Data Store, not just SNMP data

•  Ex. NOC activity metrics / key performance indicators; optical characteristics (light levels, loss, etc.); environmental/power data; aggregate flow data; OWAMP; BWCTL

Alertmon Improvements

๏ Alert Collapsing •  Collapse services on a host when host is not reachable •  Root cause analysis based on dependency graph allows for intelligent collapsing

of alerts and suggests root cause of multiple alerts. •  Monitoring of management VPN endpoints to collapse alerts behind VPN when

management network access is impaired

NOAA Operations Portal

๏ High-level overview of network status •  Operational Status Map •  Performance Measurement Overview •  Operations Calendars •  Detailed data pulled from other GlobalNOC tools

๏ Multi-network aggregate views

19

SciPass Science DMZ

๏ Campus Networks are enterprise infrastructure •  large number of small flows •  security is a required capability ๏ not elephant flow friendly ๏ could just bypass but that

doesn’t provide required security ๏ what about performance assurance?

Approach

๏ Combine • OpenFlow Switch • Bro • PerfSonar

๏ create reactive system ๏ default to secure /

slow path ๏ use IDS to control

what goes on fast path

•  64 ms - time to detect and bypass •  250 ms - doubled throughput of firewall •  1.5 sec - same throughput as no firewall

Reactive Bypass Performance

Find Out More

๏ Software Page • https://globalnoc.iu.edu/sdn/scipass.html

๏ Code Repository • https://github.com/GlobalNOC/SciPass

๏  email • [email protected] • [email protected]

FlowSpace Firewall ๏ Developed in partnership with Internet2 ๏ Open Source Software ๏ OpenFlow Hypervisor

•  “Slice” OpenFlow 1.0 based on VLAN ID ๏ Currently running on Internet2 AL2S ๏ Other deployments growing. We’re interested in helping get FlowSpace

Firewall running on your OpenFlow network ๏ More Information/Download: http://globalnoc.iu.edu/sdn/fsfw.html/