CSE291 (H00) Modern Data Center Systemsyiying/cse291h-fall19/reading/Overview.pdf• The Datacenter as a Computer -- An Introduction to the Design of Warehouse-Scale Machines – By

CSE291 (H00) Modern Data Center Systems

Yiying Zhang

Before We Start• This course is mainly about datacenter software

systems. If you feel that this is not for you, e.g., if you are in the wrong class (this is CSE291-H), if you hate reading papers, if you don’t like doing one big group project per semester, if you have no software development experience, please feel free to leave now.

• Class website: https://cseweb.ucsd.edu/~yiying/cse291h-fall19/

• Class time: WF 9am – 10:20am @CSE2154

https://cseweb.ucsd.edu/~yiying/cse291h-fall19/



Course Structure• Mostly seminar style, some lectures

• Paper reading and discussion (~2 papers per week) + learning new systems (40%)– Write short summaries of paper and short answers to 2-3 questions

(due by 9am of class day)– Participate in class discussion– Lead discussion

• Class attendance– Two absent classes allowed throughout the semester

• One research-oriented team project (60%)

More about Paper Discussion and Projects

• Paper discussion– One leader per paper (may have multiple leaders per lecture)– Lead the discussion of paper in class by beginning with paper

summary and preparing several questions– Assigned by name alphabetical order (last name, first name)– Leader will be posted on course website– Start from the 10/9 week

• Projects– Three reports: project proposal (1 page) 15%, project progress

report (1 page) 5%, project summary report with results (6 pages) 30%

– One presentation/demo (10-15 min per group, including questions) 10%

Course Project• One research-oriented project

– Group of 1-3– Be prepared to do substantial programming,

start early!– Most projects have assigned topics (allow

self-defined projects, but need to discuss with me ASAP)

– Form your group by the end of next Friday!

Sample Project Topics• Study: cloud application startup time• Study: impact of different cloud offerings• Study+Build: cloud benchmark• Study+Build: VM vs. container vs. serverless• Build: cloud FPGA-based application• Build: OS as microservices• Build: learned file system

• Propose your own

More Exciting Research at WukLab!

Sample Turn-in of Paper Summary 9/27 Yiying Zhang

Above the Clouds: A Berkeley View of Cloud Computing

Summary and your overall feeling of the paper: 2-5 sentences

• Q1: Name three pros and three cons of cloud.• A: 1-3 sentences

• Q2: Despite the obstacles listed in the paper, cloud has happened and it is almost everywhere in our lives now. Why do you think is the fundamental reasons behind its success?

• A: 1-3 sentences

• Q3: What do you think is the future of cloud computing?• A: 1-3 sentences

Rough Plan of the Semester• 1 week, Basics of data center systems and cloud computing• 1 week, Virtualization and containers• 1 week, Serverless computing and resource disaggregation• 1 weeks, Historical systems and consensus • 1 week, Storage and databases • 1 week, Networking (and remote memory)• 1 week, Resource management and data flow systems • 1 week, Applications (machine learning, streaming or graphs) • 1 week, Case study and hardware • 1 week, Summary

• Most are state-of-art systems used in production, e.g., Docker, Kubernetes, AWS Lambda, ZooKeeper, GFS, Amazon Dynamo, Borg, MapReduce, Spark, TensorFlow, Amazon Nitro, and a lot more

Another Chance to Leave Now

Reading This Week• The Datacenter as a Computer -- An Introduction

to the Design of Warehouse-Scale Machines – By Luiz André Barroso, Jimmy Clidaras, Urs Hölzle– (Ch 1,2,6,7. Briefly Ch 3,4,5)

• Above the Clouds: A Berkeley View of Cloud Computing

http://web.eecs.umich.edu/~mosharaf/Readings/DC-Computer.pdf

http://web.eecs.umich.edu/~mosharaf/Readings/DC-Computer.pdf

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf

Outline• Datacenter Overview• Datacenter Architecture• $ cost and energy• Datacenter Networking• Datacenter Storage• Resource Management• Reliability and Availability• Datacenter Applications• Datacenter Hardware

• Acknowledgement– Slide source includes Jennifer Rexford, Cheng-Zhong Xu, Steve Gribble

Google Data Center in Oregon

Source: Google

Google Data Center Locations

Source: Google

Data Centers• Large server and storage farms

• Traditional data centers– Host a large number of relatively small- or medium-sized

applications, each running on a dedicated hardware infrastructure that is de-coupled and protected from other systems in the same facility

– Usually for multiple organizational units or companies

• Modern data centers (AKA WSC, Wearhouse-Scale Computers)– Usually belong to a single company to run a small number of large-

scale applications• Google, Facebook, Microsoft, Amazon, Alibaba…

– Use a relatively homogeneous hardware and system software– Share a common systems management layer– Sizes can vary depending on needs

Scale Up vs. Scale Out• Scale up: high-cost powerful CPUs, more cores, more

memory• Scale out: adding more low-cost, commodity servers

• Supercomputer vs. data center• Scale

– Blue Waters = 40K 8-core “servers”– Microsoft Chicago Data Center = 50 containers = 100K 8-core servers.

• Network Architecture– Supercomputers: Infiniband, low latency – high bandwidth protocols– Data Center: (mostly) Ethernet based Network

• Data Storage– Supers: separate data farm – DCs: use disk on node + memory cache

Data Center Architecture

Source: The Datacenter as a Computer -- An Introduction to the Design of Warehouse-Scale Machines

A Row of Servers in a Google Data Center


Hierarchy of Data Centers


Latency, Bandwidth, and Capacity of a Data Center

Assume: 2000 servers (8GB mem and 1TB disk), 40 per rack connected by a 48-port 1Gpbs switch (8 uplinks)Arch: bridge the gap in a cost-efficient mannerSW : hide the complexity, exploit data locality



Data Center $ Cost• Total cost of ownership (TCO) of a datacenter

• Capital expenses (Capex) – Investments made upfront and then depreciated over a certain time

frame– E.g., construction cost of a datacenter, the purchase price of a

server

• Operational expenses (Opex)– Recurring costs of actually running the equipment, excluding

depreciation– E.g., electricity costs, repairs and maintenance, salaries of on-site

personnel

• TCO = datacenter depreciation + datacenter Opex + server depreciation + server Opex

Power and Cooling Management in DC

�22Source: The Datacenter as a Computer -- An Introduction to the Design of Warehouse-Scale Machines

Uninterruptible Power Systems (UPS)

• A transfer switch that chooses the active power input, with utility power or generator power

• Contains some form of energy storage (electrical or

mechanical) to bridge the time between the utility failure and the availability of generator power

• Conditions the incoming power feed, removing voltage spikes or sags, or harmonic distortions in the AC feed

�23

Power Distribution Units (PDU) • Takes the UPS output (typically 200~480v) and

break it up into many 110 or 220v circuits that feed the actual servers on the floor. – Each circuit is protected by its own breaker

• A typical PDU handles 75~225kW of load, whereas a typical circuit handles 20 or 30A at 110-220V (a max of 6kW)

• PDU provides additional redundancy (in circuit)

�24

Datacenter Cooling Systems• Loop system to bring heat outside and cool medium in.• Two-loop system

– CRAC Units: computer room air conditioning– Liquid supply discharge heat to outside environment


A Three-Loop Cooling System


“Free Cooling”• Cold weather

• Close to water

• Cheep energy bill

Approximate Distribution of Peak Power Usage


Energy Efficiency• DCPE (DC perf efficiency) : Ratio of amount of

computational work to total energy consumed

(a) Facility efficiency- Power usage effectiveness (PUE): ratio of building power to IT power (currently 1.5 to 2.0)(b) Server power conversion efficiency- Server PUE (SPUE): ratio of total server input power to its useful power(c) Server’s architectural efficiency


Energy Management in Datacenters

• Energy: major component of operational cost of datacenters – Large data centers have energy bills of several million $– Where does it come from?

• Power for servers and cooling • Datacenters also have a large carbon footprint • How to reduce energy usage? • Need energy-proportional systems

– Energy proportionality: energy use proportional to load – But: current hardware not energy proportional

Energy Management • Many approaches possible • Within a server:

– Shut-down certain components (cores, disks) when idling or at low loads

– Use DVFS for CPU • Most effective: shutdown servers you don’t need

– Consolidate workload onto a smaller # of servers – Turn others off

• Thermal management: move workload to cooling or move cooling to where workloads are – Requires sensors and intelligent cooling systems


Datacenter Networking• Modern datacenter network usually built

with commodity Ethernet switches– Switch has limited number of ports and

bandwidth– How to support large-scale data centers?

• Tradeoff between speed, scale, and cost

What characterizes a network?• Topology (what)

– physical interconnection structure of the network graph– Regular vs. irregular (random)

• Routing Algorithm (which)– restricts the set of paths that messages may follow– Table-driven, or routing algorithm based

• Switching Strategy (how)– how data in a message traverses a route– Store and forward vs. cut-through

• Flow Control Mechanism (when)– when a message or portions of it traverse a route– what happens when traffic is encountered?

• Interplay of all of these determines performance

Common data center topologyInternet

Servers

Layer-2 switchAccess

Data Center

Layer-2/3 switchAggregation

Layer-3 routerCore

Fat Tree Topology


Data Center Storage• Large scale => Things may fail

– Google deploys desktop-class disk drives, instead of enterprise-grade disks

• Reliability and availability is important– Data replication, e.g., exact copy (Google), erasure coding

(Microsoft)– Geo-replication

• Tiered storage– Disks, SSDs, NVM– Caching

• Ensure performance for various workloads– Random/sequential, read/write, coarse/fine access

will

Data Center Storage Systems• Distributed storage/compute

– Each machine handles both compute and storage– Faster when accessing local data– More difficult to manage

• Disaggregated storage (Network Attached Storage (NAS))– Manage (and build) compute and storage separately (a compute

pool connected to a storage pool over the network)– Flexible management, easy to deploy– Could be slower

• Global distributed file systems (e.g. Google’s GFS/Spanner)– Hard to implement at the cluster-level, but lower hw costs and

networking fabric utilization

�40


Datacenter Software Infrastructure• Platform-level software: present in all individual servers,

providing basic server-level services• Cluster-level infrastructure: collection of distributed

systems sw that manages resources and provides services at the cluster level– Mesos, MapReduce, Hadoop, Spark, Dynamo, Dryad etc.

• Application-level software: implements a specific service– Online service like web search, gmail, – Offline computations, e.g. data analysis or generate data used

for online services such as building index

�42

Cluster-level SW• Resource Management

– Map user tasks to hardware resources, enforce priorities and quotas, provide basic task management services

– Simple allocation; or automate allocation of resources; fair-sharing of resources at a finer level of granularity; power/energy consideration

– e.g., Google Borg

• HW Abstraction and Basic Services– E.g. reliable distributed storage, message-passing, cluster-

level sync. (GFS, Dynamo, gRPC)

�43

Cluster-level SW (cont’)• Deployment and maintenance

– Software image distribution, configuration management, monitoring service performance and quality, alarm trigger for operators in emergency situations, etc

– E.g. Autopilot of Microsoft, Google’s Health Infrastructure

• Programming Frameworks– Tools like MapReduce would improve programmer

productivity by automatically handling data partitioning, distribution, and fault tolerance

�44

Workload Management• Internet/cloud applications => dynamic workloads • How much capacity to allocate to an application?

– Incorrect workload estimate: over- or under-provision capacity

– Desire: auto-scaling and right-sizing– Major issue for internet facing applications

• Workload surges / flash crowds cause overloads • Long-term incremental growth (workload doubles every few

months for many newly popular apps) – Traditional approach: IT admins estimate peak

workloads and provision sufficient servers • Flash-crowd => react manually by adding capacity

– Time scale of hours: lost revenue, bad publicity for application

Dynamic Provisioning• Track workload and dynamically provision

capacity • Monitor -> Predict -> Provision • Predictive versus reactive provisioning

– Predictive: predict future workload and provision – Reactive: react whenever capacity falls short of

demand • Traditional data centers: bring up a new server

– Borrow from free pool or reclaim under-used server • Virtualized data center: exploit virtualization to

speed up application startup time


Basic Concepts• Failure: A system failure occurs when the delivered service

deviates from the specified service, where the service specification is an agreed description of the expected service. [Avizienis & Laprie 1986]

• Fault: the root cause of failures, defined as a defective state in materials, design, or implementation. Faults may remain undetected for some time. Once a fault becomes visible, it is called an error.– Faults are unobserved defective states– Error is “manifestation” of faults

�48(Source: Salfner’08)

Causes of Faults• Operation error (human, configuration, etc.)

• Software error

• Hardware error– Network– Memory– Disks– Flash

• Power outage

• Natural disaster

Causes of Service-Level Failures• Field data study 1 on Internet services: Operator-caused or

misconfig errors are larger contributors; hw-related faults (server or networking) accounts about 10-25% [Oppenheimer’2003]

• Field data study 2 on early Tandem systems: Hw faults (<10%), sw faults (~60%), op/maintenance (~20%) [Gray’90]

• Google’s observation over a period of 6 weeks

�50

Availability and Reliability• Availability: A measure of the time that a

system was actually usable, as a fraction of the time that it was intended to be usable. (x nines)– Yield: the ratio of requests that is satisfied by the

service to the total

• Reliability Metrics:– Time to failure (TTF)– Time to repair (TTR)– Mean time to failure (MTTF) – Mean time to repair (MTTR)

�51

Challenges of High Service Availability

• Challenge: modern data center often has large scale– Faults in hw, sw, and operation are inevitable– In Google, about 45% servers need to reboot at least

once over a 6-month window; 95%+ requires less often than once a month, but the tail is relatively long.

– The average downtime ~3 hours, implying 99.85% availability

• Determining the appropriate level of reliability is fundamentally a trade-off between the cost of failures and the cost of preventing them

�52

Failure Handling• Redundancy

– ECC– RAID– Erasure coding– Service replication– Geo-replication– Long term backup

Failure Handling• Monitoring and catching errors

• Restart

• Reconstruct

* For some applications, occasional faults/inconsistency are OK (e.g., Google search, Amazon adding items to cart)

Outline• Datacenter Overview• Datacenter Architecture• $ cost and energy• Datacenter Networking• Datacenter Storage• Resource Management• Reliability and Availability• Datacenter Applications

Examples of Application Software• Web 2.0 applications (e.g., search, social network, vlog, e-

commerce, etc.)– User facing– Could be bursty (e.g., Amazon on Thxgiving)

• Data-intensive workloads (e.g., data analytics, deep learning, data lake, etc.)

– Business oriented– Typically run longer and in large volumes– Could also be bursty

• Traditional business in today’s datacenters (e.g., banking, health, government)

– Traditionally run in on-prem cloud (private, local small data farm)– Security could be a major concern (e.g., GDPR, HIPAA)

�56

Big Data Workloads• Data is measured by 3V's:

– Volume: TB – Velocity: TB/sec. Speed of creation or change – Variety: Type (Text, audio, video, images, geospatial,

...)

• Increasing processing power, storage capacity, and networking have caused data to grow in all 3 dimensions.

• Volume, Location, Velocity, Churn, Variety, Veracity (accuracy, correctness, applicability)

Big Data ExamplesExamples: social network data, sensor networks, Internet Search, Genomics, astronomy, …

A Buzz Word: Deep Learning

Outline• Datacenter Overview• Datacenter Architecture• $ cost and energy• Datacenter Networking• Datacenter Storage• Resource Management• Reliability and Availability• Datacenter Application• Datacenter Hardware

Datacenter Hardware• General-purpose, off-the-shelf hardware

– Racks of cheap, commodity hardware (essentially the same as your desktop or laptop)

– Use software to handle failure– Different software for different applications

• Purpose-built hardware – GPGPU, ASIC, FPGA, SoC, SmartNIC, programmable switches,

TPU, etc.– Accelerate applications, storage, network, etc.– Customize hardware for a type of application– When does it make economic sense to build your own hardware?

In Summary• Hardware

– Building blocks are commodity server-class machines, consumer- or enterprise-grade disk drives, Ethernet-based networking fabrics

– Performance of the network fabric and storage subsystems be more relevant to CPU and memory

• Software: – Fault-tolerant sw for high service availability (99.99%)– Programmability, Parallel efficiency, Manageability

• Economics: Cost effectiveness– power and energy factors– Utilization characteristics require systems and components to

be energy efficient across a wide load spectrum, particularly at low utilization level

�62

In Summary: Key Challenges• Rapidly changing workload

– New applications with a large variety of computational characteristics emerge at a fast pace

– Need creative solutions from both hw and sw; but little benchmark available

• Building balanced systems from imbalanced components– Processors outpaced memory and magnetic storage in perf and

power efficiency; more research should be shifted onto the non-cpu subsystems

• Curbing energy usage– Power becomes the first order resources, as speed– Performance under power/energy budget

�63

Documents

CSE291 (H00) Modern Data Center Systemsyiying/cse291h-fall19/reading/Overview.pdf• The Datacenter as a Computer -- An Introduction to the Design of Warehouse-Scale Machines – By