19
National Archives and Records Administration Preventative, Incidental, and Routine Maintenance Document Version 1.0 July 10, 2018 Prepared for: National Archives and Records Administration II (NARA II) College Park, MD Prepared by: 1760 Old Meadow Road

Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

National Archives and Records AdministrationPreventative, Incidental, and Routine Maintenance Document

Version 1.0July 10, 2018

Prepared for:National Archives and Records Administration II (NARA II)

College Park, MD

Prepared by:

1760 Old Meadow RoadMcLean, VA 22102

This document contains proprietary information provided by DSA Company.Handle in accordance with proprietary and confidential restrictions.

Page 2: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

Acknowledgements:

DSA would like to acknowledge the significant technical contributions made by the following staff:

Dr. Urmi Majumder Adil Latiwala Bertina Battou Raven Moore Byrav Kadalur Timothy Reynolds Ephriam Toh Gang Chen Veluna Christopher Fawad Shaikh

ii

Page 3: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

Record of Changes

Preventive, Incidental, and Routine Maintenance Document Change LogVersion Date of Change Changed by Summary of Changes

.05 06/20/2018 Urmi Majumder & Fawad Shaikh Initial document

.06 06/25/2018 Fawad Shaikh Update the Template and Format of the document

.07 07/09/2018 Gang Chen Added information for NAC

1.0 07/10/2018 Kellye Sheehan Minor editing changes

DSA Internal ApprovalsName Role Date

Kellye Sheehan Program Manager 7/10/2018

Modifications made to this plan since the last printing are as follows:

iii

Page 4: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

Table of Contents1 INTRODUCTION....................................................................................................................................................... 1

2 OVERVIEW FOR PREVENTIVE MAINTENANCE........................................................................................................... 2

3 SCOPE..................................................................................................................................................................... 3

4 DAS PRESENTATION TIER......................................................................................................................................... 4

4.1 DAS SERVICE TIER.......................................................................................................................................................44.2 DAS DATA TIER PLAN..................................................................................................................................................54.3 DAS INFRASTRUCTURE PLAN..........................................................................................................................................64.4 DAS PREVENTATIVE MAINTENANCE SCHEDULE..................................................................................................................6

5 NAC SYSTEM PREVENTIVE MAITENANCE.................................................................................................................. 8

5.1 NAC SYSTEM PRESENTATION TIER..................................................................................................................................85.1.1 Daily Monitoring................................................................................................................................................85.1.2 Weekly Monitoring.............................................................................................................................................8

5.2 NAC SERVICE TIER.......................................................................................................................................................95.2.1 Daily Monitoring................................................................................................................................................95.2.2 Weekly Reports..................................................................................................................................................9

5.3 NAC DATA TIER PLAN..................................................................................................................................................95.4 NAC INFRASTRUCTURE PLAN.......................................................................................................................................105.5 NAC PREVENTATIVE MAINTENANCE SCHEDULE...............................................................................................................11

iv

Page 5: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

1 IntroductionThis document contains or points to, content describing the following material:

A. Preventive MaintenanceB. Incident ManagementC. Routine Maintenance

Section A.

The National Archives and Records Administration (NARA) and Data Systems Analysts (DSA) have established and agree upon a policy for how the contractor would handle a ‘Preventive Maintenance Window’. This is a regular period of time that should be used when scheduling planned outages to services. The Preventive Maintenance window is not intended to be used for emergency work, or to negatively impact the timing of responses necessary to handle critical and significant incidents.

This document will contain the material which pertains to the Preventive Maintenance Window.

Section B.

The content for Incident Management already exists in a separate document called “Incident Management Plan”. The most recent version of this document was provided to the Government on 4/19/2018.

Section C.

Regarding Routine Maintenance documentation. DSA is already providing separate weekly O&M Release reports describing our ongoing plans regarding Routine Maintenance; and the List of Known Defects is reported weekly out of TFS as part of our Routine Maintenance documents. These are being provided weekly.

In addition, we regularly provide a document called Scheduled Maintenance Window Document which also relates to Routine Maintenance. The most recent version of the Scheduled Maintenance Window Document was provided to the Government on 4/19/2018.

NARA DAS & NAC System Support Team by DSA, Inc. 1

Page 6: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

2 Overview for Preventive MaintenanceDSA has developed a regularly scheduled preventive maintenance plan and schedule. The purpose of this section is to assure the continued and prolonged operational health for both DAS and NAC system environment. In addition, with the accomplished AWS server migration from PV to HVM, improvement for the reliability and stabilization of system and applications will also be observed and overviewed progressivle.

This maintenance plan includes high-level descriptions of systems that will be monitored and maintained that include the system presentation tier, service tier, data tier, infrastructure and the Amazon Web Services (AWS) environment hosting the application components.

NARA DAS & NAC System Support Team by DSA, Inc. 2

Page 7: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

3 ScopeFor DAS system, the system presentation tier, service tier, and data tier environments will have two established checklists. The first set of checklists is to be executed nightly (including on all weekends) and will be automated, moving forward. The second set of checklists is for manual execution on the 1st and 3rd weekends of every month. This is a more thorough checklist with enhanced assessments involved in its completion. The DAS infrastructure tier will have one checklist, which will be performed on the 1st and 3rd weekend of every month.

For NAC system, it would establish both daily monitoring processes and weekly reports to check system presentation tier, service tier, data ingestion, and digital object storage periodically and automatically. With retirement of PRTG monitoring system, AWS cloudwatch is implemented for PRTG replacement to monitor hardware utilization and health check at system tier. Splunk will continue serving to collect the system log information daily and generate weekly reports. Validation for data ingestion and Lambda process will be implemented. Backing up the system and service logs will also be automated.

NARA DAS & NAC System Support Team by DSA, Inc. 3

Page 8: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

4 DAS Presentation Tier 1. Nightly Verifications

a. Verify the continual operational status of the Apache web server application that hosts the DAS UI deployments

i. This includes verifying the operating integrity of Apache as well as analysing Apache logs for any errors

b. Verify the continual operational status of the Apache server infrastructure hosting the DAS UI

i. This includes verifying the operating integrity of the Red Hat server that powers the Apache web server application as well as analysing the AWS instance for defects or scheduled maintenance.

2. 1st & 3rd Weekend Verifications a. Verify the continuity of the Apache web server’s Amazon Machine Image (AMI)

i. This includes comparing the running copy of the server to the image backup on file ii. A new server image will be created if differences are found

4.1 DAS Service Tier 1. Nightly Verifications

a. Verify the continual operational status of the JBoss platform application environment i. This includes verifying the operating integrity of the JBoss Java application environment

(JVM) b. Verify the continual operational status of the JBoss platform logging processes.

i. This includes verifying the operation status of the log rotate process that manages daily JBoss logging

c. Verify the storage of JBoss platform log files i. This includes verifying the JBoss platform application logs are being uploaded to an

offsite permanent storage location. 2. 1st & 3rd Weekend Verifications

a. Verify the health of the Jboss SOA platform i. This includes checking for possible memory leaks and investigating any errors

discovered in the JBoss application logs or the logging process. b. Verify connectivity to the NARA Enterprise LDAP (eLDAP) system

i. Verify the authenticity of the NARA eLDAP SSL certificate and compare to currently installed certificate. This is intended to try to catch any certificate changes before they impact the NARA DAS users.

c. Verify the continuity of the Jboss SOA platform Amazon Machine Image (AMI)

NARA DAS & NAC System Support Team by DSA, Inc. 4

Page 9: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

i. This includes comparing the running copy of the server to the image backup on file ii. A new server image will be created if differences are found

iii. This does not include deployments as those are stored and maintained elsewhere, then delivered to the application server.

 

4.2 DAS Data Tier Plan 1. Nightly Tasks

i. Oracle Database will be completely re-indexed to support normal business operations and migrate imported or modified data into the primary database tables.

i. This includes rebuilding the Oracle context indexes on large tables. ii. Oracle Database logging will be verified and managed.

i. This includes moving logs no longer required by the Oracle Database software to be migrated off of the Oracle server to a permanent storage location.

iii. Verify Oracle Database server volume health. i. This includes comparing the storage capacity of the Oracle server’s storage volumes

against the current storage utilization data. ii. This also includes analysing the accrual of Oracle data as compared to the data volume

capacity to ascertain the health of the volume and volume data accrual rates. iii. This will assist PPC with identifying any volume health or capacity issues before they

become problematic. 2. 1st & 3rd Weekend Verifications

a. Oracle database architectural health will be verified i. Health of Oracle ASM (Automatic Storage Management) and data volumes will be

assessed ii. Oracle table partitions will be assessed and partitions will be added if existing partitions

are filling up with data b. Any alerts which have been issued by the database will be assessed

i. These include the following items: ii. Health of database software

iii. Health of ASM instance iv. Health of table partitions

c. Verify the continuity of the Oracle database Amazon Machine Image (AMI) i. This includes comparing the running copy of the server to the image backup on file

ii. A new server image will be created if differences are found

NARA DAS & NAC System Support Team by DSA, Inc. 5

Page 10: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

4.3 DAS Infrastructure Plan1. 1st & 3rd Weekend Verifications

a. Verify integrity of Red Hat & Oracle Enterprise Linux servers i. This includes verifying patches, running processes and monitoring overall system health

b. Verify integrity of server volumes i. This includes verifying the volumes are operating at a high level of quality and efficiency

by performing regular read/write and availability tests c. Verify that Amazon Machine Images (AMIs) are up to date

i. This includes launching and testing existing AMIs to verify they function in the current DAS Production operating environment.

d. Verify Amazon Web Services maintenance schedules and, if required, identify any NARA DAS servers that may be part of AWS scheduled maintenance.

i. This includes reboots of server instances to accommodate AWS hardware and virtual hypervisor maintenance.

e. Verify the continuity of the Amazon Machine Image (AMI) infrastructure i. This includes spinning up system AMIs and verifying they boot and load into the

operating system environment properly ii. New server images will be created if existing images are found defective

4.4 DAS Preventative Maintenance Schedule 1. Nightly Maintenance Window

a. NARA provides PPC with a nightly maintenance window of 9PM EST to 5AM EST b. Nightly maintenance windows provide for activities such as:

i. Assess any alerts issued by Oracle Enterprise Manager system ii. Merge imported data into primary table space

iii. Rebuild XML text indexes iv. Perform a backup of the database tables v. Verify the health of the database software

vi. Verify the health of the database operating system vii. Verify the health of the Jboss SOA platform software

viii. Verify the health of the Jboss servers and operating systems 2. Weekend Maintenance Window

a. NARA provides PPC with a weekend maintenance window on the 1st and 3rd weekend of every month, starting at 9PM EST Friday, continuing to 5AM EST Monday

i. It is possible that the OPA export cannot be performed during the weekend maintenance. If PPC is unable to perform the OPA export, this must be communicated

NARA DAS & NAC System Support Team by DSA, Inc. 6

Page 11: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

to the NARA DAS Project Manager and NARA DAS Program Manager at least two weeks in advance for authorization from the NARA PM and the NARA Program Manager.

ii. Weekend maintenance windows provide opportunities for PPC to execute nightly maintenance activities as well as more extensive maintenance actions, such as: 1. Assess and implement solutions to alerts issued by the Oracle Enterprise Manager

system 2. Management of the ASM instance 3. Management of the table partitions

NARA DAS & NAC System Support Team by DSA, Inc. 7

Page 12: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

5 NAC System Preventive Maintenance5.1 NAC System Presentation Tier

5.1.1 Daily Monitoring 1. Two AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established

to perform daily NAC PROD system monitoring to provide following NAC system informationi. Time Interval: 5 minutes; Duration: 1h – 15 months

ii. Server Instances:pw01 – pw04; pa01 – pa04; ps01 – ps04;

AWS/Lambda service functions and AWS/Queue:metadata-extraction; vips-processing; lambda-dlq-prod;

iii. Verify the continual operational status of the NAC PROD system to visually check if entire PROD system (including web servers, API servers, Solr search engine servers, and AWS/Lambda services) is in health status

iv. Will be moving forward to extending to other PROD servers (such as database, Lambda app, and content processing servers), UAT and DEV servers.

2. Splunk Monitor Console can provide the disk space usage for all 17 PROD servers dailyi. Time Interval: 2 hours;

ii. Server Instances:pw01 – pw04; pa01 – pa04; pdb01 – pdb02; pcp01 – pcp02; ps01 – ps04; pl01

iii. This verifies the disk space usage for each PROD servers so that enough disk space will be maintained for data or logs.

3. Shutdown UAT 17 server instances daily to reduce the server maintenance and costi. Stop Time: 9pm – 6am; Start Time: 6am – 9pm

ii. Server Instances:uw01 – uw04; ua01 – ua04; udb01 – udb02; ucp01 – ucp02; us01 -us04; ul01

5.1.2 Weekly Monitoring 1. Every week all PROD and UAT servers are inspected with checking their healthy status, memory

and CPU utilization, and disk space usagei. This includes to create AMIs (Amazon Machine Images) for each server on each Friday

ii. A new server AMI will be created on needed base due to server configuration changes (such as network configuration or security group)

2. Each Friday review and inspect AWS notificationsi. Verify any impact with AWS service or hardware retirement plan or schedule

ii. Plan or schedule the modification for NAC system along with AWS changes

NARA DAS & NAC System Support Team by DSA, Inc. 8

Page 13: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

5.2 NAC Service Tier

5.2.1 Daily Monitoring 1. Daily Manual Verifications

i. Verify NAC web application on each PROD web server (pw01, pw02, pw03, pw04) with health check via squid proxy service;

ii. Verify NAC API application on each PROD API server (pa01, pa02, pa03, pa04) with testing query check via squid proxy service;

iii. Verify NAC Solr service on each PROD solr server (ps01, ps02, ps03, ps04) with Apache Solr admin utility tool to check the status of solr nodes on both shard1 and shard2;

iv. Verify NAC content processing service on each PROD cp server (pcp01, pcp02) with Aspire utility tool to check the status of DAS Feeder and Annotation services;

v. During the work hours (i.e. 6am – 6pm), perform the above verification steps (i, ii, iii, iv) for UAT environment;

5.2.2 Weekly Reports 1. Weekly Manual Verifications

i. Verify weekly data content process with following Splunk reportsAuthority Records in the last weekDescriptions with Digital Objects in the last weekDescriptions without Digital Objects in the last weekWebpages (Archives.gov, Presidential Libraries) in the last week

2. Weekly Reports Sent to NARA ClientsThere are total 16 Splunk reports to be sent out each week for NARA staff and clients to review and verify

5.3 NAC Data Tier

5.3.1 Weekly Routine Data Ingestion 1. Post validation for each weekly NAC data ingestion

i. Verify and analyse the counts for completed and error after each weekly data ingestion;ii. Validate the ingested data records in Solr server with scripts (TBD);

2. Validation for Digital Object Url Linksi. Verify the valid or invalid url links for digital objects via API after weekly data ingestion

ii. Verify the existence for digital objects in AWS storage buckets;

NARA DAS & NAC System Support Team by DSA, Inc. 9

Page 14: Introduction · Web viewTwo AWS CloudWatch Dashboards (Production-Lambda and Production-Web) were established to perform daily NAC PROD system monitoring to provide following NAC

Preventative, Incidental, and Routine Maintenance DocumentJuly 10, 2018

Version 1.0

5.3.2 On-Demanded Data Ingestion 1. Recover the missing data during NAC data ingestion

2. Recover the missing data during DAS export process

5.4 NAC Infrastructure Plan1. 1st & 3rd Weekend Verifications

i. Verify integrity of Red Hat & Amazon Enterprise Linux servers This includes verifying patches, running processes and monitoring overall system health

ii. Verify integrity of server volumes This includes verifying the volumes are operating at a high level of quality and efficiency by performing regular read/write and availability tests

iii. Verify that Amazon Machine Images (AMIs) are up to date This includes launching and testing existing AMIs to verify they function in the current NAC Production operating environment. New server images will be created if existing images are found defective

5.5 NAC Preventative Maintenance Schedule1. Monthly System Patch for each server on each NAC environment2. Nightly Maintenance Window

i. NARA authorizes DSA with a nightly maintenance time window between 9PM EST and 5AM EST

ii. Nightly maintenance time windows provide time for preventive maintenance activities, which can include any of the following activities:

Refresh the servers to flush server memory or cache issues;Flush tcp/ip ports to clean up the network traffic;Compress the historical logs or data to release certain disk space;Add necessary disk spaces.

NARA DAS & NAC System Support Team by DSA, Inc. 10