Upload
marvin-mosley
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
A Measurement Based Memory Performance Evaluation of High
Throughput Servers
Garba Isa Yau
Department of Computer Engineering
King Fahd University of Petroleum & Minerals
Dhahran Saudi Arabia
April 14, 2003
Motivation
• CPU – Memory speed gap CPU speed doubles in about 18 months (Moore’s Law) Memory access time improves by only one-third in 10 years
• Hierarchical memory architecture introduced to alleviate CPU–memory speed gap
It works on locality of reference of data temporal locality spatial locality
• Network bandwidth has improved significantly Gigabit per second already deployed on LAN
NIC operates up to 10 Gbps Ethernet switches also available in that range
Motivation
• For applications utilizing these type of data, hierarchical memory architecture becomes ineffective
• Is it all applications that benefit from memory hierarchy?
some data have poor temporal locality (continuous data) working set might be too large to fit into cache even if the data has good spatial locality some data are never reused
SO WHERE IS EXACTLY THE BOTTLENECK?
Memory Access
Streaming Media Servers
• Streaming media content is a continuous data working set is normally large, cannot fit into cache it has very poor temporal locality (data reuse is poor)
• A typical scenario of streaming media transaction (RTSP)
client serverRTSP
client serverRTP
client serverRTCP
• The transaction has: stringent timing requirement high bandwidth requirement CPU intensive high memory requirement
client
memory
disk
data block
CPU cache
TCP/IP stack
Data with RTPheader
IP packet withRTP payload
Typical data flow in streaming using RTP
Memory Access
Web Servers
Memory Access
• Web content is normally a set of small files that make a web document
working set is normally composed of small files (average aggregate size is 10k) poor temporal locality little or no data reuse
• Web transaction (HTTP)
client serverHTTP
client
memory
disk
data block
CPU cache
TCP/IP stack
Data with HTTPheader
IP packet withHTTP payload
Typical data flow in HTTP transaction
• The transaction has: relaxed timing requirement but also high bandwidth requirement high connection rate (as connections are established and torn within a short time – HTTP/1.0)
Memory Access
IP Forwarding
• IP packets are generally small (maximum is 65536 bytes). Due to datagram fragmentation by routers, the packets are typically less than 15 KB (MTU issue).
packets are just forwarded, no data associated with any packet is reused. apart from the need for high speed, no strict timing needs to be maintained At high throughput, a lot of memory copying is involved: moving a lot of data (IP headers) into cache for processing.
memoryincomingpacket
CPU cache
TCP/IP stack
outgoing packet
packet header
Typical data flow in IP forwarding
Server Platform
• Pentium 4 processor (2.0 GHz): L1 cache 8 KB L2 cache 512 KB
• Peripherals: 1 Gbps NIC 40 GB EIDE hard drive (Western Digital WD400) Main memory: 256 MB
• Operating systems: Linux Red Hat 7.2 (kernel 2.4.7-10) Windows 2000 server
• Network (LAN): 1 Gbps layer II switch
Memory Transfer Test
• ECT (extended copy transfer)
Characterizing the memory performance to observe what might be the impact of OS on memory performance
0
1000
2000
3000
4000
5000
6000
block size (working set)
Mem
ory
ban
dw
idth
(M
byt
es/s
ec)
Linux
Windows
• Locality of reference: temporal locality – varying working set size (block size) spatial locality – varying access pattern (strides)
Performance of streaming media servers
1
101
201
301
401
501
601
701
801
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
nu
mb
er o
f ca
che
mis
ses
(mil
lio
ns)
dss, unique
dss, multiple
wms, unique
wms, multiple
1
101
201
301
401
501
601
701
801
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
nu
mb
er o
f ca
che
mis
ses
(mil
lio
ns)
dss, unique
dss, multiple
wms, unique
wms, multiple
• L1 Cache Performance
L1 cache misses (56kpbs) L1 cache misses (300kbps)
• L1 cache misses are mostly influenced by number of streams• Worst-case performance when the number of streams is high: 300kbps encoding rate and multiple media contents are requested by clients
• Memory Performance and throughput
0
100
200
300
400
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
pag
e fa
ult
s /
sec
dss, unique
dss, multiple
wms, unique
wms, multiple
Page fault rate (300kbps)
• Requests for unique media object does not incur much page faults since object can easily be served from memory
• Requests for multiple objects leads to high page fault rate since a lot of data blocks will have to be fetched from the disk
• High page fault rate leads to client’s timeout due to long delay
1
10001
20001
30001
40001
50001
1 10 100 200 300 400 500 600 700 1000
number of streams (clients)
thro
ug
hp
ut
(kb
ps)
dss, unique
dss, multiple
wms, unique
wms, multiple
Throughput (300kbps)
Performance of streaming media servers
Performance of Web servers
0100020003000400050006000700080009000
5B 50B
500B
5KB
10K
B
50K
B
100K
B
500K
B
5MB
50M
B
File size (Kilobytes)
Tra
nsa
ctio
ns/
sec
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
0
100
200
300
400
500
600
700
5B 50B
500B
5KB
10K
B
50K
B
100K
B
500K
B
5MB
50M
B
File size (Kilobytes)
Th
rou
gh
pu
t (M
byt
es/s
ec)
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
Number of transactions per second
Throughput in Mbytes/sec
• Smaller files are transferred within short time, hence more connections are established and released at a high rate.
• For larger files, throughput is high even though, the transactions/sec is low (less connection made)
• Transactions and Throughput
Performance of Web servers
0100200300400500600700800900
1000
5B 500B 10KB 100KB 5MB
File size (Kilobytes)
L1
cach
e m
isse
s (m
illi
on
s)
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
0
500
1000
1500
2000
2500
5B 500B 10KB 100KB 5MB
File size (Kilobytes)
L2
cach
e m
isse
s (m
illi
on
s)
apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
L1 cache misses
L2 cache misses
• L1 and L2 cache performance is poor when the document size is small. WHY?
• Cache Performance
Performance of Web servers
0100200300400500600700800900
1000
5B 50B
500B
5KB
10K
B
50K
B
100K
B
500K
B
5MB
50M
B
File size (Kilobytes)
pag
e fa
ult
s /s
ec apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
0
50
100
150
200
250
300
350
5B 50B
500B
5KB
10K
B
50K
B
100K
B
500K
B
5MB
50M
B
File size (Kilobytes)
late
ncy
(se
c) apache, 1 client
apache, 400 clients
IIS, 1 client
IIS, 400 clients
Page fault rate
Latency
• Unlike a small file, a large file will have to be continuously fetched from disk, leading to more page faults
• Large files significantly increase the average latency of the server. As clients wait for too long, they may time out
• Page Fault and Latency
Performance of IP forwarding
Experimental setup
IDC Card
Card
Card
Card
Linux RouterServer
Router clients
NICs
• Routing (creating and updating of routing table) is done by ‘routed’• IP forwarding – Linux kernel space
Performance of IP forwarding
Routing configuration
host host
host
host
host
host
host
hosthosthosthosthost
hosthost
host
1-1 communication (simplex and duplex)
Double 1-1 communication (simplex and duplex)
1-4 communication (simplex and duplex)
Ring communication (simplex and duplex)
• 1 and 2 • 3 and 4
• 5 and 6 • 7 and 8
Performance of IP forwarding
• Bandwidth
Ethernet interface 0
0
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7 8
Configuration
Mb
its/
sec 64 bytes packet
10K packet
64K packet
Ethernet interface1
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8
Configuration
Mb
its/
sec 64 bytes packet
10K packet
64K packet
Ethernet interface 2
0
50
100
150
200
250
300
350
400
1 2 3 4 5 6 7 8
Configuration
Mb
its/
sec 64 bytes packet
10K packet
64K packet
Ethernet interface 3
0
50
100
150
200
250
1 2 3 4 5 6 7 8
Configuration
Mb
its/
sec 64 bytes packet
10K packet
64K packet
Performance of IP forwarding
• Maximum bandwidth: 449 Mbps at configuration 2 – only two NICs involved in router CPU utilization (system) – mere 19.04% context switching – 1312 (only two NICs switched) Active page – 1006.48 (highest observed)
• Very small packet size (64 bytes) degrades performance. Accounts for highest context switching
• Fairly uniformly distributed active page figures indicates that memory activity is not very intensive.
Performance of IP forwarding
• Other metrics
CPU utilization
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8
Configuration
CP
U u
tili
zati
on
%
64 bytes packet
10K packet
64K packet
context switchinig
0
1000
2000
3000
4000
5000
6000
1 2 3 4 5 6 7 8
Configuration
con
text
/sec 64 bytes packet
10K packet
64K packet
Active page
860
880
900
920
940
960
980
1000
1020
1 2 3 4 5 6 7 8
Configuration
nu
mb
er o
f ac
tive
pag
es
64 bytes packet
10K packet
64K packet
Conclusion
Streaming servers:
• Performance highly degraded due to cache misses and page faults.• Uses continuous data with large working set and poor temporal locality (no data reuse)
Web servers:
• Small working set does not help much as frequent connection setup and tear down degrades performance significantly• When the document is large in size, the server delay becomes unacceptably high, leading to client timeout• Large document size also leads to high page fault rate
IP forwarding:
Conclusion
• Memory performance is not the main factor in the overall performance of IP forwarding in Linux kernel • Context switching overhead is highly significant, and a key factor in performance degradation. The more interface involved in the forwarding of packets, the more the contention for resources (bus contention).
• All CPU activity (kernel space only) is below 100 %. If we resolve bus contention, we can obtain more throughput