Upload
mona-lucas
View
38
Download
0
Embed Size (px)
DESCRIPTION
CS514: Intermediate Course in Operating Systems. Professor Ken Birman Vivek Vishnumurthy: TA. After the Internet. We’re living at the end of history… For government types, that refers to the fall of the Berlin Wall and the collapse of the USSR For us, it refers to the .COM boom and bust - PowerPoint PPT Presentation
Citation preview
CS514: Intermediate Course in Operating Systems
Professor Ken BirmanVivek Vishnumurthy: TA
After the Internet We’re living at the end of history…
For government types, that refers to the fall of the Berlin Wall and the collapse of the USSR
For us, it refers to the .COM boom and bust The Internet had infinite promise until
2000… now it is approaching maturity What do we know how to do? What major challenges do we face as we look
at the “after the Internet” picture?
Critical InfrastructureRapidly Expanding Web of Dependency
Massive rollout underway Control of restructured power grid New medical information systems link hospital
to other providers, reach right into the home Telephony infrastructure Financial systems: eMoney replaces cash! Disaster response and coordination
Future military will be extremely dependent on information resources and solutions
Tangled Interdependencies
Power Grid
Internet
Telephony
Banking
Internet Software,
COTS Technology
Base
Multiple Concerns Infrastructure industries have been dangerously
naïve about challenges of using Internet and computing technologies in critical ways
Nationally critical information systems poorly protected, fragile, easily disrupted Stems from pervasive use of COTS components Vendors poorly motivated to address the issue
Yet academic research is having little impact No sense of “excitement” or importance Few significant technology transition successes
Most serious issue? Loss of public interest and enthusiasm Government shares this view
“It’s just software; we buy it from Microsoft” Academic researchers often seen as
freeloading at taxpayer’s expense Critical infrastructure components often
look “less critical” considered in isolation Ten thousand networked medical care systems
would worry us, but not individual instances
Concrete Examples of Threats?
Power system requires new generation of technology for preventing cascaded failures, implementing load-following power contracts Industry requires solutions but has no idea how to
build them. Technical concern “masked” by politics DOE effort is completely inadequate
Three branches of military are separately developing real-time information support tools. Scale will be orders of magnitude beyond anything
ever done with Internet technologies Goals recall the FAA’s AAS fiasco (lost $6B!)
Concrete examples of threats? 2003 East Coast blackout
Restructuring of power grid broke it into multiple competing producers / consumers
But technology to monitor and control the restructured grid lagged the need
Consequences of this deficiency? Operators were unable to make sense of
a slowly cascading instability that ultimately engulfed the whole East Coast!
Vendor Perspective? Little interest in better security
“You have zero privacy anyway. Get over it.” Scott McNealy, CEO Sun Microsystems; 1/99
In contrast, Bill Gates has often stated that MSFT needs to improve
But doesn’t have critical infrastructure in mind And he doesn’t point to Internet issues.
Internet technology is adequate for the most commercially lucrative Web functions
But inadequate reliability, security for other emerging needs, including CIP requirements
Issue is that market is the main driver for product evolution, and market for critical solutions is small
Security: Often mistaken for the whole story Even today, most CIP work emphasizes
security and denial of service attacks But critical applications must also work
Correctly When and where required Even when components fail or are overloaded Even when the network size grows or the
application itself is used on a large scale Even when the network is disrupted by failures
Market failure Refers to situations in which a good
technology is unsuccessful as a product For example, everyone wants reliability Many people like group communication But how much will they pay for it?
One metric: “as a fraction of their total software investment for the same machines”
Probably not more than 5-10% Revenue stream may be too small to
sustain healthy markets and product growth
Let’s get technical
A digression to illustrate both the potential for progress but
also the obstacles we confront!
Scalability: Achilles Heel of a Networked World? 1980’s: Client-server architectures.
1 server, 10’s of simultaneous clients 1990’s: Web servers
Small server cluster in a data center or farm 1000’s of simultaneous clients
First decade of 2000? Server “geoplex”: large farms in a WAN setting 10’s of 1000’s of simultaneous clients Emergence of peer-to-peer applications: “live”
collaboration and sharing of objects Wireless clients could add another factor of 10 client
load
Technologies need to keep pace We want predictable, stable
performance, reliability, security … despite
Large numbers of users Large physical extent of network Increasing rates of infrastructure disruption
(purely because of growing span of network) Wide range of performance profiles Growth in actual volume of work
applications are being asked to do
Scalable Publish Subscribe A popular paradigm; we’ll use it to
illustrate our points Used to link large numbers of information
sources in commercial or military settings to even larger numbers of consumers Track down the right servers Updates in real-time as data changes
Happens to be a top military priority, so one could imagine the government tackling it…
Server cluster
Subscriber must identify the best
servers.
Subjects are partitioned
among servers hence one
subscriber may need multiple connections
Publisher offers new events to a proxy server. Subjects are
partitioned among the server sets. In this example there are four
partitions: blue, green, yellow and red. Server set and partition function
can adjust dynamically
Like the subscribers, each publisher connects to the “best” proxy (or proxies) given its own location in the
network. The one selected must belong to the partition handling the subject of the event.
log
publish
Large-scale applications with similar technical requirements Restructured Electric Power Grid Large-scale financial applications Disaster response Community medical systems Large-scale online information provision Decentralized stock markets Network monitoring, control
Poor Scalability Long “rumored” for distributed
computing technologies and tools Famous study by Jim Gray points to
scalability issues in distributed databases
Things that scale well: Tend to be stateless or based on soft state Have weak reliability semantics Are loosely coupled
Do current technologies scale?Category Typical large use Limits?Client-Server and object-oriented environments
LAN system, perhaps 250 simultaneous clients
Server capacity limits scale.
Web-like architectures Internet, hundreds of clients
No reliability guarantees
Publish-subscribeGroup multicast
About 50 receivers, 500 in hierarchies
Throughput becomes unstable with scale. Multicast storms
Many-Many DSM Rarely seen except in small clusters
Update costs grow with cluster size
Shared database Farm: 50-100; RACS: 100’s, RAPS: 10’s
Few successes with rapidly changing real-time data
Recall the Stock Exchange Problem: Vsync. multicast is too “fragile”
Most members are healthy….
… but one is slow
Most members are healthy….
With 32 processes….
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
50
100
150
200
250Virtually synchronous Ensemble multicast protocols
perturb rate
aver
age
thro
ughp
ut o
n no
nper
turb
ed m
embe
rs
ideal
actual
The problem got worse as the system scaled up
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
50
100
150
200
250Virtually synchronous Ensemble multicast protocols
perturb rate
aver
age
thro
ughp
ut o
n no
nper
turb
ed m
embe
rs group size: 32group size: 64group size: 96
32
96
Why doesn’t anything scale? With weak semantics…
Faulty behavior may occur more often as system size increases (think “the Internet”)
With strong semantics… Encounter a system-wide cost (e.g.
membership reconfiguration, congestion control)
That can be triggered more often as a function of scale (more failures, or more network “events”, or bigger latencies)
Gray’s O(n2) database degradation reflects very similar issues… a new law of nature?
Serious issue for our scalable publish-subscribe technology What if we build it for the military or some
other critical use, and it works in the laboratory but not in the field? Early evaluation has ruled out most off-the-
shelf networking technologies They just don’t have the necessary scalability!
In fact, this happened with Navy’s Cooperative Engagement Capability (CEC) They built it… but it melts down under stress!
Fight fire with fire! Turn to randomized protocols… … with probabilistic reliability goals This overcomes the scalability
problems just seen Then think about how to “present”
mechanism to user
Tools in our toolkit Traditional deterministic tools:
Virtual synchrony: Only in small groups Paxos Transactions
New-age probabilistically reliable ones: Bimodal multicast Astrolabe DHTs
Server cluster
Subscriber must identify the best
servers.
Subjects are partitioned
among servers hence one
subscriber may need multiple connections
Publisher offers new events to a proxy server. Subjects are
partitioned among the server sets. In this example there are four
partitions: blue, green, yellow and red. Server set and partition function
can adjust dynamically
Like the subscribers, each publisher connects to the “best” proxy (or proxies) given its own location in the
network. The one selected must belong to the partition handling the subject of the event.
log
publish
We can use Bimodal Multicast here
This replication problem looks like an instance of virtual
synchrony
Perhaps this client can use Astrolabe to pick a server
Server cluster
Subscriber must identify the best
servers.
Publisher uses Astrolabe to identify the correct set of receivers
log
Bimodal Multicast
6.04.1
6.2
Word Version
014.5cardinal011.5falcon
102.0swift
…SMTP?Weblogic?LoadName
6.26.2
4.5
Word Version
01.5gnu103.2zebra
001.7gazelle
…SMTP?Weblogic?LoadName
14.66.71.1214.66.71.83.1Paris
127.16.77.11127.16.77.61.8NJ
123.45.61.17123.45.61.32.6SF
SMTP contactWL contactAvgLoad
Name
San Francisco New Jersey
SQL querySQL query
Virtual “summary” table
Astrolabe manages configuration and connection parameters,
tracks system membership and state.
The combined technologies solve the initial problem!
A glimpse inside a data center
Pub-sub combined with point-to-pointcommunication technologies like TCP
LB
service
LB
service
LB
service
LB
service
LB
service
“front-end applications”, web sites, web services routers
+ legacy systems…
Cornell: QuickSilver platform in a datacenter
Query source Update source
Services are hosted at data centers but accessible system-wide
Server pool
Data center A Data center B
To send a query, client needs a way to “map” to appropriate
partition of the target service and then to locate a suitable
representative of the appropriate cluster
To send an update, we not only need to find the cluster, but also initiate some form of replication protocol: a multicast, chain update, 1SR transaction, etc. Notice the potentially
huge numbers of replications “groups”: the selected technology must not only be fault-tolerant and fast, but it also
needs to scale in numbers of distribution patterns… a dimension as yet unexplored by research community and
overlooked in most products!System administrators will need a way to monitor the state of all these services.
This hierarchical database is a good match with Astrolabe, an example of a
P2P solution Cornell has been exploring. They also need a way to update various control parameters at what may be tens of thousands of locations. The resulting “scalable” reliable multicast problem is also one Cornell has looked at recently.
Best hope for dealing with legacy components is to somehow “wrap” them in a software layer designed to integrate
them with the monitoring and control infrastructure and bring autonomic
benefits to bear on them where practical. By intercepting inputs or replicating
checkpoints may be able to harden these to some degree
Good things? We seem to have technologies that can
overcome Internet limitations using randomized P2P gossip However, Internet routing can “defeat” our
clever solutions unless we know network topology
These have great scalability and can survive under stress
And both are backed by formal models as well as real code and experimental data
Indeed, analysis is “robust” too!
Bad things? These are middleware, and the bottom line
is that only MSFT can sell middleware! Current commercial slump doesn’t help; nobody
is buying anything Indeed, while everything else advances at
“Internet speed”… the Internet architecture has somehow gotten stuck circa 1985!
Is this an instance of a market failure?
The modern Internet: Unsafe at any speed?
The Internet “policy” Assumes almost everything uses TCP TCP is designed to be greedy
Ratchet bandwidth up until congestion occurs Routers are designed to drop packets
They use RED (Random Early Detection) Throw away packets at random until TCP gets
the point and slows down Our problem?
We’re not running TCP and this policy penalizes us, although it “works” for TCP….
Internet itself: Main weak point Our hardest open problems arise in the Internet
Astrolabe and Bimodal Multicast don’t do much for security
They need to know network topology… but the Internet conceals this information
We could perhaps use these tools to detect and react to a DOS attack at the application layer, but in fact such an attack can only be stopped in the network itself
Butler Lampson: “The Internet and the Web are successful
precisely because they don’t need to work very well to succeed”
The Internet got stuck in 1985 Critical Infrastructure Protection hostage to
a perception that the Internet is perfect! Must somehow recapture the enthusiasm
of the field and the commercial sector for evolution and change Scalability: building massive systems that work
really well and yet make full use of COTS Awesome performance, even under stress Better Internet: Time for a “Supernet”?
Lagging public interest An extremely serious problem
The Internet boomed… then it melted down And we’re Internet people
Even worse in the CIP area We predicted disaster in 1996… 1999… 2000 Cyberterrorists… “Internet will melt down”
We’re the people who keep crying wolf Realistically, can’t fight this perception Argues that CIP success will have to come
from other pressures, not a direct public clamor!
A missing “pipeline” Long term research
Fundamental questions: 10 year time horizon
New practical options: 5 years from products
Industry stakeholders ready to apply good ideas in real settings
Companies interested in ideas for new products
Researchers at Cornell
Researchers at SRI
Developers at the Electric Power
Research Institute
COTS solutions
Practical needs
Basic needs
Best hope? Government must work with all three communities:
CIP stakeholders, researchers, vendors A tricky role: consider MSFT initiative on security
Will MSFT trigger a wave of commercial products? Or will the 800lb gorilla just crush the whole market?
Reexamine legal basis for “hold harmless” clauses that indemnify software vendors against damages if products are defective through outright negligence
Growing need for military, homeland defense helps But need to balance against understandable inclination to
keep such programs “ black”
ConclusionsCIP hostage to complacency as an undramatic
threat slowly grows!
Nationally critical infrastructure is exposed to security & reliability problems and this exposure is growing, yet is largely ignored.
Research effort has contracted around an overly theoretical security community
Current trend a recipe for economic stagnation. Inadequate technology blocks new markets