Upload
taylor-dawson
View
28
Download
3
Embed Size (px)
DESCRIPTION
CSE 190: Internet E-Commerce. Lecture 14: Operations. Operations. Everything it takes to keep a web site up and running, 24x7 Deployment Process Monitoring (SNMP) Build system Link rot Maintenance window Load testing Browser compliance Log rotation Database backups Disk failure - PowerPoint PPT Presentation
Citation preview
CSE 190: Internet E-Commerce
Lecture 14: Operations
Operations• Everything it takes to keep a web site up and running, 24x7
– Deployment Process– Monitoring (SNMP)– Build system– Link rot– Maintenance window– Load testing– Browser compliance– Log rotation– Database backups– Disk failure– Router failure– Robots– Staffing– Data centers
• Expense of running a high availability site is comparable to running a physical store front
Deployment Process
• Proceeds in three phases– Development
• Within corporation, not accessible outside
– Stage• Within internet environment• UAT run here• Only operations staff may access
– Live• Accessible to outside world
Monitoring
• SNMP (Simple Network Management Protocol)– Used to monitor both hardware, software– Provides: Counters, Values, Triggers, Statistics– Remote control of services– Information stored in MIB (Management Information
Base)– RMON sometimes used as alternative to SNMPv2
• Software– HP OpenView
Maintenance Window• Installation
– Standard: J2EE standard web service descriptor (XML file with tarball of files)
– InstallShield– Custom installation scripts
• Upgrades– Defined time on Friday or weekend to upgrade site, posted on web site– Process:
• Front page linked to ‘Site down’• Load balancer redirected if appropriate• Application stops accepting new clients• (Pause) Application terminates all active sessions• Application upgraded• Sanity checks performed• Servers rebooted• Load balancer restored
Link Rot
• Link rot: the continual process by which links become invalid over time
• Tracked with custom tools
• Best practice: Pages have permanent URLs
• Referral field:– Tracking this in logs shows who’s linking to
what URL on your site
Load Testing• Network load (60% bandwidth max)
– Average page size (~20-30k)• CPU load: Occurs at least three levels
– HTTP level– Application level– DB query level– Metrics: maximum number of simultaneous users, latency vs. users
• Memory usage (256 M – 1 G per machine)• Disk I/O load
– 1 Gb per machine typical• Tools
– Mercury Interactive: WinRunner– Segue: SilkTest– Rational: SiteLoad– Microsoft: WCAT
Browser Compatibility
• Cost of testing proportional to the number of platforms you’re compatible with
• The same product isn’t the same on different operating systems– E.g. IE4.5 isn’t the same on Mac vs. Windows
• Incompatible DOMs between MS, Netscape, Mozilla
• Browser archive– http://browsers.evolt.org/
Robots• Robots: Automatically traverse web pages to retrieve documents, link
structure, data• Used for:
– Indexing– HTML validation– Link validation– Mirroring
• Problems:– Too much rapid access from single IP– May be indexing dynamic, obsolete data
• Robot exclusion file:# /robots.txt file for mysite.com
User-agent: webcrawlerDisallow:
User-agent: lycra Disallow: /
User-agent: *• Disallow: /jsp
Disallow: /logs
Failure Models• Mean Time To Failure (MTTF) = average amount of time the system is up• Mean Time between Failures (MTBF) = average amount of time between failures• Mean Time To Repair (MTTR) = average amount of time the system is down after it
fails - active repair time (diagnostics and repair)• Mean Down Time (MDT) - average amount of time system is down after it fails - active
repair time + preventive maintenance + logistics time (time spent waiting for personnel, etc)
• Intrinsic availability: Mean Time To Failure (MTTF) Mean Time To Failure (MTTF) + MTTR
• Operational availability: Mean Time Between Failure (MTBF) Mean Time Between Failure (MTBF) + MDT
Burn in Useful Life Wear out Integration Useful Life Obsolete & test
Hardware Failure Rate Software Failure Rate
When things go wrong
• Network operations– Software recovers from common failures– Network staff paged by email if server not
available (via SNMP)– Usually rotating assignment
• Application developers may be called in if restarting servers, etc. fails completely. Only if it doesn’t look like a network problem.
Data Centers
• Data centers: Host your machines in their own premises– Also called “colocation”
• Features– Security: controlled entrance, exit– Weather: maintained temperature, humidity– Power: Backup power, available circuits– Bandwidth: OC-192 connections– Monitoring: 24/7 staff, may reboot misbehaving machines
• Machines typically arranged in “cages”; 1u, 2u machines• Server blades• Examples
– NTT / Verio– Exodus / Global Crossing