A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum

A Measurement Based Memory Performance Evaluation of High

Throughput Servers

Garba Isa Yau

Department of Computer Engineering

King Fahd University of Petroleum & Minerals

Dhahran Saudi Arabia

April 14, 2003

Motivation

• CPU – Memory speed gap CPU speed doubles in about 18 months (Moore’s Law) Memory access time improves by only one-third in 10 years

• Hierarchical memory architecture introduced to alleviate CPU–memory speed gap

It works on locality of reference of data temporal locality spatial locality

• Network bandwidth has improved significantly Gigabit per second already deployed on LAN

NIC operates up to 10 Gbps Ethernet switches also available in that range

Motivation

• For applications utilizing these type of data, hierarchical memory architecture becomes ineffective

• Is it all applications that benefit from memory hierarchy?

some data have poor temporal locality (continuous data) working set might be too large to fit into cache even if the data has good spatial locality some data are never reused

SO WHERE IS EXACTLY THE BOTTLENECK?

Memory Access

Streaming Media Servers

• Streaming media content is a continuous data working set is normally large, cannot fit into cache it has very poor temporal locality (data reuse is poor)

• A typical scenario of streaming media transaction (RTSP)

client serverRTSP

client serverRTP

client serverRTCP

• The transaction has: stringent timing requirement high bandwidth requirement CPU intensive high memory requirement

client

memory

disk

data block

CPU cache

TCP/IP stack

Data with RTPheader

IP packet withRTP payload

Typical data flow in streaming using RTP

Memory Access

Web Servers

Memory Access

• Web content is normally a set of small files that make a web document

working set is normally composed of small files (average aggregate size is 10k) poor temporal locality little or no data reuse

• Web transaction (HTTP)

client serverHTTP

client

memory

disk

data block

CPU cache

TCP/IP stack

Data with HTTPheader

IP packet withHTTP payload

Typical data flow in HTTP transaction

• The transaction has: relaxed timing requirement but also high bandwidth requirement high connection rate (as connections are established and torn within a short time – HTTP/1.0)

Memory Access

IP Forwarding

• IP packets are generally small (maximum is 65536 bytes). Due to datagram fragmentation by routers, the packets are typically less than 15 KB (MTU issue).

packets are just forwarded, no data associated with any packet is reused. apart from the need for high speed, no strict timing needs to be maintained At high throughput, a lot of memory copying is involved: moving a lot of data (IP headers) into cache for processing.

memoryincomingpacket

CPU cache

TCP/IP stack

outgoing packet

packet header

Typical data flow in IP forwarding

Server Platform

• Pentium 4 processor (2.0 GHz): L1 cache 8 KB L2 cache 512 KB

• Peripherals: 1 Gbps NIC 40 GB EIDE hard drive (Western Digital WD400) Main memory: 256 MB

• Operating systems: Linux Red Hat 7.2 (kernel 2.4.7-10) Windows 2000 server

• Network (LAN): 1 Gbps layer II switch

Memory Transfer Test

• ECT (extended copy transfer)

Characterizing the memory performance to observe what might be the impact of OS on memory performance

0

1000

2000

3000

4000

5000

6000

block size (working set)

Mem

ory

ban

dw

idth

(M

byt

es/s

ec)

Linux

Windows

• Locality of reference: temporal locality – varying working set size (block size) spatial locality – varying access pattern (strides)

Performance of streaming media servers

1

101

201

301

401

501

601

701

801

1 10 100 200 300 400 500 600 700 1000

number of streams (clients)

nu

mb

er o

f ca

che

mis

ses

(mil

lio

ns)

dss, unique

dss, multiple

wms, unique

wms, multiple

1

101

201

301

401

501

601

701

801

1 10 100 200 300 400 500 600 700 1000


nu

mb

er o

f ca

che

mis

ses

(mil

lio

ns)

dss, unique

dss, multiple

wms, unique

wms, multiple

• L1 Cache Performance

L1 cache misses (56kpbs) L1 cache misses (300kbps)

• L1 cache misses are mostly influenced by number of streams• Worst-case performance when the number of streams is high: 300kbps encoding rate and multiple media contents are requested by clients

• Memory Performance and throughput

0

100

200

300

400

1 10 100 200 300 400 500 600 700 1000


pag

e fa

ult

s /

sec

dss, unique

dss, multiple

wms, unique

wms, multiple

Page fault rate (300kbps)

• Requests for unique media object does not incur much page faults since object can easily be served from memory

• Requests for multiple objects leads to high page fault rate since a lot of data blocks will have to be fetched from the disk

• High page fault rate leads to client’s timeout due to long delay

1

10001

20001

30001

40001

50001

1 10 100 200 300 400 500 600 700 1000


thro

ug

hp

ut

(kb

ps)

dss, unique

dss, multiple

wms, unique

wms, multiple

Throughput (300kbps)

Performance of streaming media servers

Performance of Web servers

0100020003000400050006000700080009000

5B 50B

500B

5KB

10K

B

50K

B

100K

B

500K

B

5MB

50M

B

File size (Kilobytes)

Tra

nsa

ctio

ns/

sec

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

0

100

200

300

400

500

600

700

5B 50B

500B

5KB

10K

B

50K

B

100K

B

500K

B

5MB

50M

B


Th

rou

gh

pu

t (M

byt

es/s

ec)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Number of transactions per second

Throughput in Mbytes/sec

• Smaller files are transferred within short time, hence more connections are established and released at a high rate.

• For larger files, throughput is high even though, the transactions/sec is low (less connection made)

• Transactions and Throughput


0100200300400500600700800900

1000

5B 500B 10KB 100KB 5MB


L1

cach

e m

isse

s (m

illi

on

s)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

0

500

1000

1500

2000

2500

5B 500B 10KB 100KB 5MB


L2

cach

e m

isse

s (m

illi

on

s)

apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

L1 cache misses

L2 cache misses

• L1 and L2 cache performance is poor when the document size is small. WHY?

• Cache Performance


0100200300400500600700800900

1000

5B 50B

500B

5KB

10K

B

50K

B

100K

B

500K

B

5MB

50M

B


pag

e fa

ult

s /s

ec apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

0

50

100

150

200

250

300

350

5B 50B

500B

5KB

10K

B

50K

B

100K

B

500K

B

5MB

50M

B


late

ncy

(se

c) apache, 1 client

apache, 400 clients

IIS, 1 client

IIS, 400 clients

Page fault rate

Latency

• Unlike a small file, a large file will have to be continuously fetched from disk, leading to more page faults

• Large files significantly increase the average latency of the server. As clients wait for too long, they may time out

• Page Fault and Latency

Performance of IP forwarding

Experimental setup

IDC Card

Card

Card

Card

Linux RouterServer

Router clients

NICs

• Routing (creating and updating of routing table) is done by ‘routed’• IP forwarding – Linux kernel space


Routing configuration

host host

host

host

host

host

host

hosthosthosthosthost

hosthost

host

1-1 communication (simplex and duplex)

Double 1-1 communication (simplex and duplex)

1-4 communication (simplex and duplex)

Ring communication (simplex and duplex)

• 1 and 2 • 3 and 4

• 5 and 6 • 7 and 8


• Bandwidth

Ethernet interface 0

0

50

100

150

200

250

300

350

400

450

500

1 2 3 4 5 6 7 8

Configuration

Mb

its/

sec 64 bytes packet

10K packet

64K packet

Ethernet interface1

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8

Configuration

Mb

its/

sec 64 bytes packet

10K packet

64K packet


0

50

100

150

200

250

300

350

400

1 2 3 4 5 6 7 8

Configuration

Mb

its/

sec 64 bytes packet

10K packet

64K packet


0

50

100

150

200

250

1 2 3 4 5 6 7 8

Configuration

Mb

its/

sec 64 bytes packet

10K packet

64K packet


• Maximum bandwidth: 449 Mbps at configuration 2 – only two NICs involved in router CPU utilization (system) – mere 19.04% context switching – 1312 (only two NICs switched) Active page – 1006.48 (highest observed)

• Very small packet size (64 bytes) degrades performance. Accounts for highest context switching

• Fairly uniformly distributed active page figures indicates that memory activity is not very intensive.


• Other metrics

CPU utilization

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8

Configuration

CP

U u

tili

zati

on

%

64 bytes packet

10K packet

64K packet

context switchinig

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8

Configuration

con

text

/sec 64 bytes packet

10K packet

64K packet

Active page

860

880

900

920

940

960

980

1000

1020

1 2 3 4 5 6 7 8

Configuration

nu

mb

er o

f ac

tive

pag

es

64 bytes packet

10K packet

64K packet

Conclusion

Streaming servers:

• Performance highly degraded due to cache misses and page faults.• Uses continuous data with large working set and poor temporal locality (no data reuse)

Web servers:

• Small working set does not help much as frequent connection setup and tear down degrades performance significantly• When the document is large in size, the server delay becomes unacceptably high, leading to client timeout• Large document size also leads to high page fault rate

IP forwarding:

Conclusion

• Memory performance is not the main factor in the overall performance of IP forwarding in Linux kernel • Context switching overhead is highly significant, and a key factor in performance degradation. The more interface involved in the forwarding of packets, the more the contention for resources (bus contention).

• All CPU activity (kernel space only) is below 100 %. If we resolve bus contention, we can obtain more throughput

Documents

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum