Chapters 8 Storage, Networks, and Other Peripherals

1醫學影像處理實驗室 (Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.

Chapters 8Storage, Networks, and Other Peripherals

授課教師 : 張傳育博士 (Chuan-Yu Chang Ph.D.)

E-mail: [email protected]

Tel: (05)5342601 ext. 4337


Interfacing Processors and Peripherals

• I/O Design affected by many factors (expandability, resilience)

• I/O system 的 Performance 比 CPU 的 performance 更複雜 :

– 有些 device 注重 access latency – 有些 device 注重 Throughput

• I/O system 的 performance 和系統的許多方面有關：

– connection between devices and the system

– the memory hierarchy

– the operating system


Example

– Suppose we have a benchmark that executes in 100 seconds of elapsed time, where 90 seconds is CPU time and the rest is I/O time. If CPU time improves by 50 % per year for the next five years but I/O time doesn’t improve, how much faster will our program run at the end of five years?

– Solution:已知實耗時間 = CPU time + I/O time

100 = 90 + I/O time所以 I/O time = 10 (s)

幾年後 CPU time I/O time 實耗時間 % IO time

0 90 10 100 10%

1 90/1.5 = 60 10 70 10/70 =14%

2 60/1.5 = 40 10 50 10/50 = 20%

3 40/1.5 = 27 10 37 10/37 = 27%

4 27/1.5 = 18 10 28 10/28 = 36%

5 18/1.5 = 12 10 22 10/22 = 45%

所以五年後 CPU 效能提升了 90/12 = 7.5 倍實耗時間提升了 100/22 = 4.5 倍，但 I/O time 佔實耗時間的比率從 10% 增加到 45%


Type and characteristics of I/O Devices

• I/O 裝置具有相當多的變化，可歸納出三種特性：– behavior (i.e., input vs. output) – partner (who is at the other end?) – data rate

Device Behavior Partner Data rate (KB/sec)Keyboard input human 0.01Mouse input human 0.02Voice input input human 0.02Scanner input human 400.00Voice output output human 0.60Line printer output human 1.00Laser printer output human 200.00Graphics display output human 60,000.00Modem input or output machine 2.00-8.00Network/LAN input or output machine 500.00-6000.00Floppy disk storage machine 100.00Optical disk storage machine 1000.00Magnetic tape storage machine 2000.00Magnetic disk storage machine 2000.00-10,000.00



• Mouse– 滑鼠與系統間的介面可以是下列中的一種：

• 當滑鼠移動時產生一連串的脈衝 (pulse)• 當滑鼠移動時會增加或減少計數器。

Initialposition

of mouse+20 in X– 20 in X

+20 in Y+20 in Y+20 in X

+20 in Y– 20 in X

– 20 in Y– 20 in Y+20 in X

– 20 in Y– 20 in X


I/O Example: Disk Drives

Platter

Track

Platters

Sectors

Tracks

• 硬碟的組成– Platter– Track– Sector

• 為所能 read/write 的最小單位。• Logical Block Access (LBA) 使所能 read/write 的最小單位變成block

• 每個 track 有相同數量的 sector

• Zone Bit Recording (ZBR) 讓外圈有較多的 sector ，以增加容量。

– Cylinder

每個 sector 間會有 gap

每個 sector 內會有 ECC


• Disk access time:– Seek time:

• Move the head to the proper track (8 to 20 ms. avg.)

– Rotational latency: • wait for desired sector to rotate under the read/write head

– Transfer time : • grab the data (one or more sectors) 2 to 15 MB/sec

– Controller time• The overhead the controller imposes in performing an I/O access

– Disk access time = Seek time+ Rotational latency+ Transfer time+ Controller time



• 硬碟和軟碟相比具有下列優點：– The hard disk can be larger because it is rigid.– The hard disk has higher density because it can be controlled

more precisely.– The hard disk has a higher data rate because it spins faster.– Hard disks can incorporate more than one platter.


Example

• Disk Read time– What is the average time to read or write a 512-byte sector for a typical

disk rotating at 5400 RPM? The advertised average seek time is 12 ms, the transfer rate is 5MB/sec, and the controller overhead is 2 ms. Assume that the disk is idle so that there is no waiting time.

– Solution:• Disk access time = seek time + rotation time + transfer time + controller

overhead• Disk access time =

ms

M

7.1921.06.512

25

512100060

5400

1

2

112


RAID

• Redundant Array of Independent Disks • Redundant Array of Inexpensive Disks• 6 levels in common use• Not a hierarchy• Set of physical disks viewed as single logical drive by O/S• Data distributed across physical drives• Can use redundant capacity to store parity information


RAID 0

• No redundancy• Data striped across all disks• Round Robin striping• Increase speed

– Multiple data requests probably not on same disk– Disks seek in parallel– A set of data is likely to be striped across multiple disks


RAID 1

• Mirrored Disks• Data is striped across disks• 2 copies of each stripe on separate disks• Read from either• Write to both• Recovery is simple

– Swap faulty disk & re-mirror– No down time

• Expensive


RAID 2

• Disks are synchronized• Very small stripes

– Often single byte/word

• Error correction calculated across corresponding bits on disks

• Multiple parity disks store Hamming code error correction in corresponding positions

• Lots of redundancy– Expensive– Not used


RAID 3

• Similar to RAID 2• Only one redundant disk, no matter how large the array• Simple parity bit for each set of corresponding bits• Data on failed drive can be reconstructed from surviving

data and parity info• Very high transfer rates


RAID 4

• Each disk operates independently• Good for high I/O request rate• Large stripes• Bit by bit parity calculated across stripes on each disk• Parity stored on parity disk


RAID 5

• Like RAID 4• Parity striped across all disks• Round robin allocation for parity stripe• Avoids RAID 4 bottleneck at parity disk• Commonly used in network servers

• N.B. DOES NOT MEAN 5 DISKS!!!!!


RAID 6

• Two parity calculations• Stored in separate blocks on different disks• User requirement of N disks needs N+2• High data availability

– Three disks need to fail for data loss– Significant write penalty


RAID 0, 1, 2


RAID 3 & 4


RAID 5 & 6


Data Mapping For RAID 0


Optical Storage CD-ROM

• Originally for audio• 650Mbytes giving over 70 minutes audio• Polycarbonate coated with highly reflective coat, usually aluminium• Data stored as pits• Read by reflecting laser• Constant packing density

– CD-ROM contains a single spiral track– Sectors near the outside of the disk are the same length as those near

the inside.– Information is packed evenly across the disk in segment of the same size.

• Constant linear velocity– The disk rotate more slowly for accesses near the outer edge than for

those near the center.


I/O Example: Buses

• Shared communication link (one or more wires)• Bus 的優點：

– 多樣性 (versatility) 、低成本 (low cost)

• Difficult design:– may be bottleneck– length of the bus– number of devices– tradeoffs (buffers for higher bandwidth increases latency)– support for many different devices– cost


Buses: Connecting I/O Device to Processor and Memory

• Bus transaction– Read: transfers data from memory

– Write : write data to the memory

– Input: putting data from the device to memory

– Output: data will be read from memory and sent to the device.

1. CPU 送出 Read 控制訊號，及 address 給 memory

2. memory 讀取所需的資料

3. memory 將資料送出至 data lines ，並且送出 data 可用訊號給 disk。

4. Disk 將 data line上的資料寫入 disk。

The three steps of an output operation



• Input Operation ( 將磁碟的內容載入 memory)

1. CPU 送出 write request 控制訊號，及 address 給 memory

4. Memory 將 data line 上的資料寫入 Memory

2. 通知 disk ， memory 已準備就緒。

3.Disk 將資料送上 data line 。



• Types of buses:– processor-memory bus

• Short, high speed, to maximize memory-processor bandwidth

– backplane bus• Allow processor, memory, and I/O devices to coexist on a single

bus.

• high speed, often standardized, e.g., PCI

– I/O bus • lengthy, different devices, standardized, e.g., SCSI


I/O Bus Standards

• Today we have two dominant bus standards:



Processor MemoryBackplane bus

a. I/O devices

Processor MemoryProcessor-memory bus

b.

Busadapter

Busadapter

I/Obus

I/Obus

Busadapter

I/Obus

Processor MemoryProcessor-memory bus

c.

Busadapter

Backplanebus

Busadapter

I/O bus

Busadapter

I/O bus



• Synchronous and Asynchronous Bus• Synchronous

– use a clock and a synchronous protocol, such as processor-memory bus– The bus can run very fast and the interface logic will be small– 缺點：

• every device must operate at same rate • “clock skew “ requires the bus to be short

• Asynchronous– don’t use a clock and instead use handshaking– Handshaking: Assume that there are there control lines:

• ReadReq:– Indicate a read request for memory. Put the address on the data lines.

• DataRdy:– Indicate the data word is now ready on the data lines.

• Ack:– Used to acknowledge the ReadReq or DataRdy signal of the other party.


I/O read a word from memory

DataRdy

Ack

Data

ReadReq 13

4

57

642 2

1. I/O 送出 ReadReq 的同時，也送出 address 於 data bus 。

2. Memory 回應 Ack ，並讀取 data bus 上的 address ；此時 I/O 裝置收到 Ack 後，將 release ReadReq 及 Data bus。

3. Memory 偵測到 ReadReq low ， release Ack 。

4. Memory 準備好 data ，並且將 data 放上 data bus 上，同時送出 DataRdy 訊號通知 I/O 。

5. I/O 偵測到 DataRdy，開始讀取 data bus上的 data ，同時送出Ack 訊號通知 Memory。

6. Memory 收到 Ack ，釋出 DataRdy及 data bus 。

7. I/O 偵測到 DataRdy low ， release Ack 。傳輸結束。


Example

• Performance Analysis of Synchronous Vs. Asynchronous Bus– The synchronous bus has a clock cycle time of 50 ns, and each bus tran

smission takes 1 clock cycle. The asynchronous bus requires 40 ns per handshake. The data portion of both buses is 32 bits wide. Find the bandwidth for each bus when performing one-word reads from a 200-ns memory.

– Solution:從題目可知，在同步 bus 中每一次傳輸需要花費 1 個時脈週期 (50ns) ，所以從記憶體中讀取一個字組，需要花費： 1. Send the address to memory: 50ns + 2. Read the memory: 200 ns + 3. Send the data to the device: 50nstotal time = 300ns所以傳輸 4 bytes 需花 300ns 4/300ns = 13.3 MB/sec

– 在非同步 bus 中，每次 handshake 需花費 40 ns ，而非同步 bus 的七個步驟中有需多步驟可以重疊進行，步驟 2~4 可重疊 ( 因為 memory access 時間較長 ) ，所以從記憶體中讀取一個字組，需要花費： Step 1: 40ns + Step 2, 3, 4: max (3*40ns, 200ns): 200 ns + Step 5, 6, 7: 3x40 120nstotal time = 360ns所以傳輸 4 bytes 需花 360ns 4/360ns = 11.1 MB/sec



• Increasing the Bus Bandwidth– Data bus width

• 增加 data bus 的寬度– Separate versus multiplexed address and data lines

• 將 data 和 address 分別用不同的 bus ，如此在一個 bus cycle 可同時傳送 address 和 data 。

– Block transfer• 不需送出位址及釋放 bus ，允許 bus 一個接一個的傳送 multiple wo

rds ，如此將降低傳送大量區塊資料的時間。


Example:

• Performance Analysis of two bus schemes– Suppose we have a system with the following characteristics:

• A memory and bus system supporting block access of 4 to 16 32-bit words.

• A 64-bit synchronous bus clocked at 200MHz, with each 64-bit transfer taking 1 clock cycle, and 1 clock cycle required to send an address to memory.

• Two clock cycles needed between each bus operation. (Assume the bus is idle before an access.)

• A memory access time for the first four words of 200ns; each additional set of four words can be read in 20 ns. Assume that a bus transfer of the most recently read data and a read of the next four words can be overlapped.

– Find the sustained bandwidth and the latency for a read of 256 words for transfers that use 4-word blocks and for transfers that use 16-word blocks. Also compute the effective number of bus transactions per second for each case. Recall that a single bus transaction consists of an address transmission followed by data.


Example:

• Solution:– Bus clock = 200MHz 一個 clock cycle=1/200MHz = 5ns– 針對 4-word block transfer ，每一個 block 需要

• 傳送 address 到 memory: 1 clock cycle• 讀取 memory 中的 data: 200ns / 5ns = 40 clock cycle• 從 memory 傳送 data: 2 clock cycle• 每一次傳輸之間的暫停 : 2 clock cycle

– 共需要 1+40+2+2=45 clock cycle ， 256/4=64 次傳輸。– 因此，共需要 45x64=2880 clock cycle = 2880x5ns =14400ns– Transaction per second = 64/14400ns = 4.44M transaction/sec– Bus bandwidth = (256x4)/14400 = 71.11MB/sec– 針對 16-word block transfer ，每一個 block 需要

• 傳送 address 到 memory: 1 clock cycle• 讀取 memory 中的 data: 200ns / 5ns = 40 clock cycle• 從 memory 傳送 data: 2 clock cycle x 4 = 8 clock cycle• 每一次傳輸之間的暫停 : 2 clock cycle x 4 = 8 clock cycle

– 共需要 1+40+8+8=57 clock cycle ， 256/16=16 次傳輸。– 因此，共需要 57x16=912 clock cycle = 912x5ns =4560ns– Transaction per second = 16/4560ns = 3.51 M transaction/sec– Bus bandwidth = (256x4)/4560 = 224.56MB/sec



• Obtaining Access to the Bus– In a single-master system, all bus requests must be controlled by th

e processor.

– 缺點： processor 必須處理每一個 bus transaction.

Memory Processor

Bus request lines

Bus

Disks

Bus request lines

Bus

Disks

Processor

Bus request lines

Bus

Disks

a.

b.

c.

ProcessorMemory

Memory



• Bus Arbitration:

– Deciding which bus master gets to use the bus next.

– 仲裁時須注意 bus priority 及 fairness 。– 四種匯流排仲裁：

• Daisy chain arbitration (not very fair)

• Centralized arbitration (requires an arbiter), e.g., PCI

• Distributed arbitration by self selection, e.g., NuBus used in Macintosh

• Distributed arbitration by collision detection, e.g., Ethernet

Device n

Lowest priority

Device 2Device 1

Highest priority

Busarbiter

Grant

Grant Grant

Release

Request


D1 D2 D3 D4

Bus Busy

Bus

匯流排控制器

BG1

BR1

Centralized arbitration (requires an arbiter), e.g., PCI


• Communicating with the Processor:

– Polling• The process of periodically checking status bits to see if it

is time for the next I/O operation.

– Interrupts• When an I/O device requires attention from the processor.

– Direct Memory Access, DMA• Off-loading the processor and having the device controller

transfer data directly to or from the memory without

involving the processor.


Interrupts

• Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of processing– Program

• e.g. overflow, division by zero

– Timer• Generated by internal processor timer

• Used in pre-emptive multi-tasking

– I/O• from I/O controller

– Hardware failure• e.g. memory parity error


Program Flow Control


Interrupt Cycle

• Added to instruction cycle• Processor checks for interrupt

– Indicated by an interrupt signal

• If no interrupt, fetch next instruction• If interrupt pending:

– Suspend execution of current program – Save context– Set PC to start address of interrupt handler routine– Process interrupt– Restore context and continue interrupted program


Instruction Cycle (with Interrupts) - State Diagram


Multiple Interrupts

• Disable interrupts– Processor will ignore further interrupts whilst processing one

interrupt– Interrupts remain pending and are checked after first interrupt has

been processed– Interrupts handled in sequence as they occur

• Define priorities– Low priority interrupts can be interrupted by higher priority

interrupts– When higher priority interrupt has been processed, processor

returns to previous interrupt


Multiple Interrupts - Sequential


Multiple Interrupts - Nested


Concept of DMA

• 直接記憶體存取 (Direct Memory Access, DMA)• 直接記憶體存取 (DMA) 是一種介面， DMA 控制器 (DMA controll

er) 利用週期竊取 (cycle stealing) 的方式將記憶體單元中的資料直接對周邊作大量資料的傳輸。

• 當 CPU 送出起始位址及傳送字數以啟動 DMAC 之後，由於 CPU進行指令解碼及執行的時候，並不會使用到系統的匯流排，因此DMAC 會利用這個階段，使 CPU 讓出系統的匯流排， DMAC 就可以使用系統的匯流排，直接使周邊與記憶單元間作資料的傳輸，不必經由 CPU 的管理。這種技術就稱為週期竊取 (cycle stealing) 。

• DMA 與程式 I/O 不同的地方在於 DMA 不使用 CPU 的暫存器，直接竊取記憶週期，進行資料傳送，適用於高速的周邊設備。


1. 週邊裝置向 DMAC 提出服務請求2. DMAC 向 CPU 提出統匯流排請求3. CPU完成目前週期，回應 HLDA

4. DMAC 向週邊裝置提出服務認可 5. DMAC 送出欲傳送資料的起始位址後，開始進行 memory 與周邊之間資料傳輸

6. 傳輸完畢， DMAC disable HRQ ，將系統匯流排控制權交還給 CPU


Example:

• Overhead of Polling in an I/O system– Assume that the number of clock cycles for a polling operation

is 400 and the processor executes with a 500MHz clock. Determine the fraction of CPU time consumed for the following three cases, assuming that you poll often enough so that no data is ever lost and assuming that the devices are potentially always busy:

• The mouse must be polled 30 times per second to ensure that we do not miss any movement made by the user.

• The floppy disk transfers data to the processor in 16-bit units and has a data rate of 50KB/sec. No data transfer can be missed.

• The hard disk transfers data in four-word chunks and can transfer at 4MB/sec. Again, no transfer can be missed.


Example:

• Solution:– (a) For mouse

30x400 = 12000 cycles per second每秒消耗 processor 的時間比例 =12000 / 500M = 0.002%

– (b) For Floppy每秒可存取 50KB ，每次 16-bit = 2Bytes所以需要 50KB / 2 = 25K 次的 poll ，每次 poll 花費 400 個 clock cycle ，共需要 25K x 400 = 10

000K每秒消耗 processor 的時間比例 = 10000K / 500M = 2%

– (c)For hard disk每秒可存取 4MB ，每次 4 word = 16Bytes所以需要 4MB / 16 = 250K 次的 poll ，

每次 poll 花費 400 個 clock cycle ，共需要 250K x 400 = 100000K每秒消耗 processor 的時間比例 = 100000K / 500M = 20%


Example:

• Overhead of Interrupt-driven I/O– Suppose we have the same hard disk and processor we used

in the previous example, but we use interrupt-driven I/O. The overhead for each transfer is 500 clock cycles. Find the fraction of the processor consumed if the hard disk is only transferring data 5% of the time.

• Solution– 每秒可存取 4MB ，每次 4 word = 16Bytes

所以需要 4MB / 16 = 250K 次的 interrupt ，每次 interrupt 花費 500 個 clock cycle ，共需要 250K x 500 = 125000K

每秒消耗 processor 的時間比例 = 125000K / 500M = 25%假設 the hard disk is only transferring data 5% of the time ，則每秒消耗 processor 的時間比例 = 25% x 5% = 1.25%


Example:

• Overhead of I/O Using DMA– Suppose we have the same processor and hard disk we used

in the previous example, Assume that the initial setup of a DMA transfer takes 1000 clock cycles for the processor, and assume the handling of the interrupt at DMA completion requires 500 clock cycles for the processor. The hard disk has a transfer rate of 4MB/sec and uses the DMA. IF the average transfer from the disk is 8 KB, what fraction of the 500MHz processor is consumed if the disk is actively transferring 100% of the time?

– Solution: 由題目可知使用 DMA 的硬碟傳輸率為 4MB/sec ，硬碟平均每次傳輸 8KB ，因此每次傳輸需花費 8K / 4MB = 0.002秒如果硬碟不斷的傳輸資料，則需花費(1000+500) / 0.002 = 750000 clock/sec每秒消耗 processor 的時間比例 = 750000/ 500M = 0.15%


I/O System Design

• Consider the following computer system:– A CPU that sustains 3 billion instructions per second and averages

100000 instructions in the operating system per I/O operation.

– A memory backplane bus capable of sustaining a transfer rate of 1000 MB/sec.

– SCSI Ultra320 controllers with a transfer rate of 320MB/sec and accommodating up to 7 disks.

– Disk drives with a read/write bandwidth of 75 MB/sec and an average seek plus rotational latency of 6 ms

• If the workload consists of 64KB reads (where the block is sequential on a track) and the user program needs 200000 instructions per I/O operation, find the maximum sustainable I/O rate and the number of disks and SCSI controllers required. Assume that the reads can always be done on an idle disk if one exists (i.e., ignore disk conflicts).


I/O System Design

• 系統中兩個固定元件是 memory bus 和 CPU ，首先計算可支援的 I/O rate ，及知道哪一個是瓶頸。– 已知每個 I/O 花費 200000 個使用者指令與 100000 個 OS 指令，因此

maximum I/O rate of CPU = Instruction execution rate / Instructions per I/O= 3*109/(200+100)*103 = 10000 I/Os/second

– 每個 I/O 傳輸 64KB ，因此maximum I/O rate of bus= Bus bandwidth / Bytes per I/O= 1000*106/64*103

= 15625 I/Os/second– 因為 10000 I/Os/second < 15625 I/Os/second ，所以瓶頸發生在 CPU– 接著決定需要多少磁碟才能提供每秒 10000 個 I/O 動作

• 先計算每個 I/O 動作在磁碟花多少時間time per I/O at disk = seek + rotational time + transfer time = 6 + 64K/75M =6.9 ms

• 因此每個磁碟可完成每秒 1000ms/6.9ms = 146 個 I/O 動作。• 未滿足 CPU 每秒 10000 個 I/O 動作，需要 10000/146=69 個磁碟

– 為計算 SCSI bus 的個數，我們需檢查每個磁碟的平均傳輸率• Transfer rate=transfer size/transfer time=64K/6.9ms=9.56MB/sec• 因為最大的 SCSI 數為 7 個，因此不會飽和匯流排

Documents

Chapters 8 Storage, Networks, and Other Peripherals