21
Richard van der Ven 21-11-2017 Alert & Health Monitoring A Splunk and ITSI implementation Public Function Cluster Architect Litho Computing Platform

SplunkLive! Utrecht 2017 - ASML Customer Presentation

  • Upload
    splunk

  • View
    219

  • Download
    5

Embed Size (px)

Citation preview

Richard van der Ven

21-11-2017

Alert & Health MonitoringA Splunk and ITSI implementation

Public

Function Cluster Architect Litho Computing Platform

21 Nov 2017

Slide 2

Public

• Who am I?

• Environment

• Alert & Health Monitoring

• Wrap-up

21 Nov 2017

Slide 3

Public

Who am I?

Worked at ASML for 16 years

• 13 years - IT Infrastructure

• DBA, Storage, ITIL processes

• IT Management

• 3 years - Functional Cluster Architect

• Litho Computing Platform

• Alert & Health Monitoring

Richard van der Ven

21 Nov 2017

Slide 4

Public

ASML makes the machines for making chips

• Lithography is the critical tool

for producing chips

• All of the world’s top chip

makers are our customers

• 2016 sales: €6.8 bln

• More than 17,000 employees

(FTE) worldwide

21 Nov 2017

Slide 5

Public

A global presence

3,900 employees

Source: ASML Q1 2017

Offices in over 60 cities in 16 countries worldwide

9,600 employees 3,600 employees

21 Nov 2017

Slide 6

Public

A tightly integrated set of solutions for scaling and yield

Image

Compute/SW

Measure

21 Nov 2017

Slide 7

PublicLitho Computing Platform

• A cloud infra stack, called the Litho Computing Platform, designed for

high availability and scalability

• Virtual machines are abstracted from the hardware

HW may change or break Virtual machines stay up High

Available

• It’s centralized all applications in one place

• It can serve 40 Scanners & 50 Yieldstars

• It runs in a dark site at ASML customers

An extendable HW platform that scales with application needs

21 Nov 2017

Slide 8

Public

Availability is key

Availability % Downtime per year Downtime per month* Downtime per week

90% ("one nine") 36.5 days 72 hours 16.8 hours

95% 18.25 days 36 hours 8.4 hours

97% 10.96 days 21.6 hours 5.04 hours

98% 7.30 days 14.4 hours 3.36 hours

99% ("two nines") 3.65 days 7.20 hours 1.68 hours

99.5% 1.83 days 3.60 hours 50.4 minutes

99.8% 17.52 hours 86.23 minutes 20.16 minutes

99.90% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes

99.95% 4.38 hours 21.56 minutes 5.04 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds

NOTE: This is availability of the functionality that we sell as perceived by the customerthus Infra + HW + Virtualization layer + Application + Connectivity

21 Nov 2017

Slide 9

Public

Some history

2011

• 1st LCP in the field

• Start off with monitoring infrastructure components with Nagios

• Supported by PHP development

• Changing knowledge experts on custom build setup

End of 2015

• The need for improved monitoring and local analysis came up aftersome situations where:

• Engineers didn’t notice application components failing

• It took a long time to get requested log files via customer approval

• It required several iterations to get the log files needed

Timeline

21 Nov 2017

Slide 10

Public

Alert & Health Monitoring

• Avoid unplanned downtime

• Reduce planned maintenance times

• A smart and robust monitoring solution platform to enable live monitoring

AHM product will enable CS engineers to

• Identify if LCP operation is at risk

• Diagnose root cause of incidents

• Pro-active maintenance

• Capacity planning

• Verify configuration state

Why Alert and Health Monitoring?

21 Nov 2017

Slide 11

Public

Alert & Health Monitoring

Alerting

• Alert when KPI over threshold

Monitoring / quick trouble shooting

• Health monitoringHW / SW/ FW/ Environment health, including network infra, databases OSes

• Configuration reportingExact HW/SW/FW config and changes including licenses and serial numbers

Analysis / debugging

• Timeline reconstructionChronological list of major events and threshold alerts

• Diagnostics deep dive

• Data downloading

Key features

21 Nov 2017

Slide 12

Public

Alert & Health Monitoring

Support flow & organization

AHM

Customer

ASML local

equipment

support

ASML GSC

equipment

support

App 1

App 2

Remote intervention

Alert

email

Troubleshooting

VPN

MonitoringStatus

Report

Under virtual

escort by

customer

Action Plan

21 Nov 2017

Slide 13

Public

AHM High-level Architecture

Alerting

Analysis / Debugging

Monitoring /

Quick troubleshooting

Hardware

Virtualization

Operating

Systems

Middle-

ware

Litho

apps

AHM

Data

Collection

Scripts

Search

HeadIndex

ForwardersForwarders

@

Central Instance

Alert and Health Monitoring

Data Onboarding Data Processing

Config

Manager

AHM

Configurator

Configuration

Metrics

21 Nov 2017

Slide 14

Public

Alert & Health MonitoringKeyfigures

1x

165-239 KPI’s< 5GB daily

6-10

500GB~221 sourcetypes

77 hosts

~2125 sources

> 20GB daily

>25TB> 50x

>3000 hosts

21 Nov 2017

Slide 15

Public

Alert & Health Monitoring

• Lead time

• Importance of log files for monitoring

• What determines application availability

• Changing requirements from stakeholders

• Service model

• Implementation ITSI

Challenges

21 Nov 2017

Slide 16

Public

Alert & Health Monitoring

• Service Model

• Not usable out of the box

• Generated with own tool

• UI: not usable

• ITSI Dashboard: not configurable to our needs

• Glass tables: static, where we need flexibility due to variable applications

• Event alerting

• Implementation customer specific thresholds

Challenges with Splunk core and ITSI

21 Nov 2017

Slide 17

Public

Alert & Health Monitoring

• Service Model

• Generated with own configuration tool

• ‘Manual’ regenerate at every change on applications

• Using Mind Maps for discussions

• UI

• Dashboards build with tables and hyper links

• New feature drill down promising

• Event alerting

• Aligning ITSI queries and core Splunk

• Implementation customer specific thresholds

How did we solve?

21 Nov 2017

Slide 18

Public

Implementation Splunk and ITSI

Easy and clear drill down dashboards

Users are non IT

21 Nov 2017

Slide 19

Public

Alert & Health Monitoring

• Easier access to log files, metrics and application data

• Less time spent on regular service checks

• Combine application and infra data

• Unforseen side effects of changes diagnosed in field and at internal testing

• More confidence in actual system state

• Memory leak issue spotted in field, before impact

Benefits

21 Nov 2017

Slide 20

Public

Alert & Health MonitoringWrap up